摘要:是一個(gè)數(shù)據(jù)分析的開(kāi)源庫(kù)。與表格或關(guān)系數(shù)據(jù)庫(kù)中的表非常神似。注意帶有一個(gè)索引,類似于關(guān)系數(shù)據(jù)庫(kù)中的主鍵。的統(tǒng)計(jì)函數(shù)分組與聚合通過(guò)方法,可以對(duì)數(shù)據(jù)組施加一系列的函數(shù)。函數(shù)的作用是串聯(lián),追加數(shù)據(jù)行使用函數(shù)。
pandas(Python data analysis)是一個(gè)Python數(shù)據(jù)分析的開(kāi)源庫(kù)。
pandas兩種數(shù)據(jù)結(jié)構(gòu):DataFrame和Series
安裝:pandas依賴于NumPy,python-dateutil,pytz
pip install pandas
DataFrameDataFrame是一種帶標(biāo)簽的二維對(duì)象。與excel表格或關(guān)系數(shù)據(jù)庫(kù)中的表非常神似??梢杂靡韵路绞絹?lái)創(chuàng)建DataFrame:
從另一個(gè)DataFrame來(lái)創(chuàng)建DataFrame
從具有二維形狀的NumPy數(shù)組或者數(shù)組的復(fù)合結(jié)構(gòu)來(lái)生成DataFrame
可以用Series來(lái)創(chuàng)建DataFrame
DataFrame可以從類似CSV之類的文件來(lái)生成
準(zhǔn)備數(shù)據(jù)資料:http://www.exporedata.net/Dow... 下載一個(gè)csv數(shù)據(jù)文件。
from pandas.io.parsers import read_csv df = read_csv("WHO_first9cols.csv") print "Dataframe", df print "Shape", df.shape print "Length", len(df) print "Column Headers", df.columns print "Data types", df.dtypes print "Index", df.index print "Values", df.values
注意:DataFrame帶有一個(gè)索引,類似于關(guān)系數(shù)據(jù)庫(kù)中的主鍵。我們既可以手動(dòng)創(chuàng)建,也可以自動(dòng)創(chuàng)建。訪問(wèn)df.index
如果需要遍歷數(shù)據(jù),請(qǐng)使用df.values獲取所有值,非數(shù)字的數(shù)值在被輸出時(shí)標(biāo)記為nan。
Series是一個(gè)由不同類型元素組成的一維數(shù)組,該數(shù)據(jù)結(jié)構(gòu)也具有標(biāo)簽??梢酝ㄟ^(guò)以下方式創(chuàng)建Series數(shù)據(jù)結(jié)構(gòu):
由Python字典來(lái)創(chuàng)建
由NumPy數(shù)組來(lái)創(chuàng)建
由單個(gè)標(biāo)量值來(lái)創(chuàng)建
創(chuàng)建Series數(shù)據(jù)結(jié)構(gòu)時(shí),可以向構(gòu)造函數(shù)遞交一組軸標(biāo)簽,這些標(biāo)簽通常稱為索引。
對(duì)DataFrame列執(zhí)行查詢操作時(shí),會(huì)返回一個(gè)Series
from pandas.io.parsers import read_csv import numpy as np df = read_csv("WHO_first9cols.csv") #這里對(duì)DataFrame列進(jìn)行查詢操作,返回一個(gè)Series country_col = df["Country"] print "Type df", type(df) print "Type country col", type(country_col) print "Series shape", country_col.shape print "Series index", country_col.index print "Series values", country_col.values print "Series name", country_col.name print "Last 2 countries", country_col[-2:] print "Last 2 countries type", type(country_col[-2:]) #NumPy的函數(shù)同樣適用于pandas的DataFrame和Series print "df signs", np.sign(df) last_col = df.columns[-1] print "Last df column signs", last_col, np.sign(df[last_col]) print np.sum(df[last_col] - df[last_col].values)利用pandas查詢數(shù)據(jù)
數(shù)據(jù)準(zhǔn)備:pip install Quandl 或者手動(dòng)從http://www.quandl.com/SIDC/SU... 下載csv文件。
import Quandl # Data from http://www.quandl.com/SIDC/SUNSPOTS_A-Sunspot-Numbers-Annual # PyPi url https://pypi.python.org/pypi/Quandl sunspots = Quandl.get("SIDC/SUNSPOTS_A") print "Head 2", sunspots.head(2) print "Tail 2", sunspots.tail(2) last_date = sunspots.index[-1] print "Last value", sunspots.loc[last_date] print "Values slice by date", sunspots["20020101": "20131231"] print "Slice from a list of indices", sunspots.iloc[[2, 4, -4, -2]] print "Scalar with Iloc", sunspots.iloc[0, 0] print "Scalar with iat", sunspots.iat[1, 0] print "Boolean selection", sunspots[sunspots > sunspots.mean()] print "Boolean selection with column label", sunspots[sunspots.Number > sunspots.Number.mean()]
DataFrame的統(tǒng)計(jì)函數(shù)
describe、count、mad、median、min、max、,pde、std、var、skew、kurt
import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import random_integers import numpy as np seed(42) df = pd.DataFrame({"Weather" : ["cold", "hot", "cold", "hot", "cold", "hot", "cold"], "Food" : ["soup", "soup", "icecream", "chocolate", "icecream", "icecream", "soup"], "Price" : 10 * rand(7), "Number" : random_integers(1, 9, size=(7,))}) print df weather_group = df.groupby("Weather") i = 0 for name, group in weather_group: i = i + 1 print "Group", i, name print group print "Weather group first ", weather_group.first() print "Weather group last ", weather_group.last() print "Weather group mean ", weather_group.mean() wf_group = df.groupby(["Weather", "Food"]) print "WF Groups", wf_group.groups #通過(guò)agg方法,可以對(duì)數(shù)據(jù)組施加一系列的NumPy函數(shù)。 print "WF Aggregated ", wf_group.agg([np.mean, np.median])DataFrame的串聯(lián)與附加操作
數(shù)據(jù)庫(kù)的數(shù)據(jù)表有內(nèi)部連接和外部連接。DataFrame也有類似操作,即串聯(lián)和附加。
函數(shù)concat()的作用是串聯(lián)DataFrame,追加數(shù)據(jù)行使用append()函數(shù)。
例如
pd.concat([df[:3],df[3:]]) df[:3].append(df[5:])
pandas提供merge()或DataFrane的join()方法都能實(shí)現(xiàn)類似數(shù)據(jù)庫(kù)的連接操作功能。默認(rèn)情況下join()方法會(huì)按照索引進(jìn)行連接,不過(guò),有時(shí)候這不符合我們的要求。
數(shù)據(jù)準(zhǔn)備:
tips.csv
EmpNr,Amount 5,10 9,5 7,2.5
dest.csv
EmpNr,Dest 5,The Hague 3,Amsterdam 9,Rotterdam
dests = pd.read_csv("dest.csv") tips = pd.read_csv("tips.csv") #使用merge()函數(shù)按照員工編號(hào)進(jìn)行連接處理 print "Merge() on key ", pd.merge(dests, tips, on="EmpNr") #用join()方法執(zhí)行連接操作時(shí),需要使用后綴來(lái)指示左、右操作對(duì)象。 print "Dests join() tips ", dests.join(tips, lsuffix="Dest", rsuffix="Tips") #用merge()執(zhí)行內(nèi)部連接時(shí),更顯示的方法如下 print "Inner join with merge() ", pd.merge(dests, tips, how="inner") #稍作修改便變成完全外部連接,缺失的數(shù)據(jù)變?yōu)镹aN print "Outer join ", pd.merge(dests, tips, how="outer")處理缺失的數(shù)據(jù)
缺失的數(shù)據(jù)變?yōu)镹aN(非數(shù)字),還有一個(gè)類似的符號(hào)NaT(非日期). 可以使用pandas的兩個(gè)函數(shù)來(lái)進(jìn)行判斷isnull(),notnull(), fillna()方法可以用一個(gè)標(biāo)量值來(lái)替換缺失的數(shù)據(jù)。
import pandas as pd import numpy as np df = pd.read_csv("WHO_first9cols.csv") # Select first 3 rows of country and Net primary school enrolment ratio male (%) df = df[["Country", df.columns[-2]]][:2] print "New df ", df print "Null Values ", pd.isnull(df) print "Total Null Values ", pd.isnull(df).sum() print "Not Null Values ", df.notnull() print "Last Column Doubled ", 2 * df[df.columns[-1]] print "Last Column plus NaN ", df[df.columns[-1]] + np.nan print "Zero filled ", df.fillna(0)處理日期數(shù)據(jù)
http://pandas.pydata.org/pand...
各種頻率(freq)短碼對(duì)照表:
B business day frequency
C custom business day frequency (experimental)
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter endfrequency
QS quarter start frequency
BQS business quarter start frequency
A year end frequency
BA business year end frequency
AS year start frequency
BAS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
import pandas as pd from pandas.tseries.offsets import DateOffset import sys print "Date range", pd.date_range("1/1/1900", periods=42, freq="D") try: print "Date range", pd.date_range("1/1/1677", periods=4, freq="D") except: etype, value, _ = sys.exc_info() print "Error encountered", etype, value offset = DateOffset(seconds=2 ** 63/10 ** 9) mid = pd.to_datetime("1/1/1970") print "Start valid range", mid - offset print "End valid range", mid + offset print pd.to_datetime(["1900/1/1", "1901.12.11"]) print "With format", pd.to_datetime(["19021112", "19031230"], format="%Y%m%d") print "Illegal date", pd.to_datetime(["1902-11-12", "not a date"]) print "Illegal date coerced", pd.to_datetime(["1902-11-12", "not a date"], coerce=True)據(jù)透視表(pivot_table)
數(shù)據(jù)透視表可以用來(lái)匯總數(shù)據(jù)。pivot_table()函數(shù)及相應(yīng)的DataFrame方法。
import pandas as pd from numpy.random import seed from numpy.random import rand from numpy.random import random_integers import numpy as np seed(42) N = 7 df = pd.DataFrame({ "Weather" : ["cold", "hot", "cold", "hot", "cold", "hot", "cold"], "Food" : ["soup", "soup", "icecream", "chocolate", "icecream", "icecream", "soup"], "Price" : 10 * rand(N), "Number" : random_integers(1, 9, size=(N,))}) print "DataFrame ", df #cols指定需要聚合的列,aggfunc指定聚合函數(shù)。 print pd.pivot_table(df, cols=["Food"], aggfunc=np.sum)
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/38355.html
摘要:時(shí)間永遠(yuǎn)都過(guò)得那么快,一晃從年注冊(cè),到現(xiàn)在已經(jīng)過(guò)去了年那些被我藏在收藏夾吃灰的文章,已經(jīng)太多了,是時(shí)候把他們整理一下了。那是因?yàn)槭詹貖A太亂,橡皮擦給設(shè)置私密了,不收拾不好看呀。 ...
摘要:在基本語(yǔ)法入門之后,就要準(zhǔn)備選一個(gè)研究方向了。是自己比較感興趣的方向,可是,導(dǎo)師這邊的數(shù)據(jù)處理肯定不能由我做主了。真的挺愁人的還有幾個(gè)月就要進(jìn)行春季實(shí)習(xí)招聘了,加油總結(jié)一下機(jī)器學(xué)習(xí)方面的資料吧。 在python基本語(yǔ)法入門之后,就要準(zhǔn)備選一個(gè)研究方向了。Web是自己比較感興趣的方向,可是,導(dǎo)師這邊的數(shù)據(jù)處理肯定不能由我做主了。paper、peper、paper……真的挺愁人的 還有幾個(gè)...
摘要:學(xué)習(xí)筆記七數(shù)學(xué)形態(tài)學(xué)關(guān)注的是圖像中的形狀,它提供了一些方法用于檢測(cè)形狀和改變形狀。學(xué)習(xí)筆記十一尺度不變特征變換,簡(jiǎn)稱是圖像局部特征提取的現(xiàn)代方法基于區(qū)域圖像塊的分析。本文的目的是簡(jiǎn)明扼要地說(shuō)明的編碼機(jī)制,并給出一些建議。 showImg(https://segmentfault.com/img/bVRJbz?w=900&h=385); 前言 開(kāi)始之前,我們先來(lái)看這樣一個(gè)提問(wèn): pyth...
閱讀 783·2021-09-30 09:46
閱讀 3797·2021-09-03 10:45
閱讀 3617·2019-08-30 14:11
閱讀 2551·2019-08-30 13:54
閱讀 2262·2019-08-30 11:00
閱讀 2357·2019-08-29 13:03
閱讀 1564·2019-08-29 11:16
閱讀 3588·2019-08-26 13:52