摘要:本文著重介紹這兩種數(shù)據(jù)結(jié)構(gòu)的生成和訪問(wèn)的基本方法。是一種類(lèi)似于一維數(shù)組的對(duì)象,由一組數(shù)據(jù)一維數(shù)組對(duì)象和一組與之對(duì)應(yīng)相關(guān)的數(shù)據(jù)標(biāo)簽索引組成。注當(dāng)數(shù)據(jù)未指定索引時(shí),會(huì)自動(dòng)創(chuàng)建整數(shù)型索引注通過(guò)字典創(chuàng)建,可視為一個(gè)定長(zhǎng)的有序字典。
前言
Pandas是Python環(huán)境下最有名的數(shù)據(jù)統(tǒng)計(jì)包,是基于 Numpy 構(gòu)建的含有更高級(jí)數(shù)據(jù)結(jié)構(gòu)和工具的數(shù)據(jù)分析包。Pandas圍繞著 Series 和 DataFrame 兩個(gè)核心數(shù)據(jù)結(jié)構(gòu)展開(kāi)的。本文著重介紹這兩種數(shù)據(jù)結(jié)構(gòu)的生成和訪問(wèn)的基本方法。
Series是一種類(lèi)似于一維數(shù)組的對(duì)象,由一組數(shù)據(jù)(一維ndarray數(shù)組對(duì)象)和一組與之對(duì)應(yīng)相關(guān)的數(shù)據(jù)標(biāo)簽(索引)組成。
注:numpy(Numerical Python)提供了python對(duì)多維數(shù)組對(duì)象的支持:ndarray,具有矢量運(yùn)算能力,快速、節(jié)省空間。
(1)Pandas說(shuō)明文檔中對(duì)Series特點(diǎn)介紹如下:
""" One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).Operations between Series (+, -, /, , *) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.Parameters
---------- data : array-like, dict, or scalar valueContains data stored in Series index : array-like or Index (1d) Values must be hashable and have the same length as `data`. Non-unique index values are allowed. Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict. dtype : numpy.dtype or None If None, dtype will be inferred copy : boolean, default False Copy input data """
(2)創(chuàng)建Series的基本方法如下,數(shù)據(jù)可以是陣列(list、ndarray)、字典和常量值。s = pd.Series(data, index=index)
s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=["a", "b", "c", "d", "e"],dtype="int8" ) a -1 b 0 c 0 d -1 e -1 dtype: int8 s = pd.Series(["a",-0.75414753,123,66666,-1.64899442], index=["a", "b", "c", "d", "e"],) a a b -0.754148 c 123 d 66666 e -1.64899 dtype: object
注:Series支持的數(shù)據(jù)類(lèi)型包括整數(shù)、浮點(diǎn)數(shù)、復(fù)數(shù)、布爾值、字符串等numpy.dtype,與創(chuàng)建ndarray數(shù)組相同的是,如未指定類(lèi)型,它會(huì)嘗試推斷出一個(gè)合適的數(shù)據(jù)類(lèi)型,例程中數(shù)據(jù)包含數(shù)字和字符串時(shí),推斷為object類(lèi)型;如指定int8類(lèi)型時(shí)數(shù)據(jù)以int8顯示。
s = pd.Series(np.random.randn(5)) 0 0.485468 1 -0.912130 2 0.771970 3 -1.058117 4 0.926649 dtype: float64 s.index RangeIndex(start=0, stop=5, step=1) s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"]) a 0.485468 b -0.912130 c 0.771970 d -1.058117 e 0.926649 dtype: float64
注:當(dāng)數(shù)據(jù)未指定索引時(shí),Series會(huì)自動(dòng)創(chuàng)建整數(shù)型索引
s = pd.Series({"a" : 0., "b" : 1., "c" : 2.}) a 0.0 b 1.0 c 2.0 dtype: float64 s = pd.Series({"a" : 0., "b" : 1., "c" : 2.}, index=["b", "c", "d", "a"]) b 1.0 c 2.0 d NaN a 0.0 dtype: float64
注:通過(guò)Python字典創(chuàng)建Series,可視為一個(gè)定長(zhǎng)的有序字典。如果只傳入一個(gè)字典,那么Series中的索引即是原字典的鍵。如果傳入索引,那么會(huì)找到索引相匹配的值并放在相應(yīng)的位置上,未找到對(duì)應(yīng)值時(shí)結(jié)果為NaN。
s = pd.Series(5., index=["a", "b", "c", "d", "e"]) a 5.0 b 5.0 c 5.0 d 5.0 e 5.0 dtype: float64
注:數(shù)值重復(fù)匹配以適應(yīng)索引長(zhǎng)度
(3)訪問(wèn)Series中的元素和索引
s = pd.Series({"a" : 0., "b" : 1., "c" : 2.}, index=["b", "c", "d", "a"]) b 1.0 c 2.0 d NaN a 0.0 dtype: float64 s.values [ 1. 2. nan 0.] s.index Index([u"b", u"c", u"d", u"a"], dtype="object")
注:Series的values和index屬性獲取其數(shù)組表示形式和索引對(duì)象
s["a"] 0.0 s[["a","b"]] a 0.0 b 1.0 dtype: float64 s[["a","b","c"]] a 0.0 b 1.0 c 2.0 dtype: float64 s[:2] b 1.0 c 2.0 dtype: float64
注:可以通過(guò)索引的方式選取Series中的單個(gè)或一組值
DataFrame是一個(gè)表格型(二維)的數(shù)據(jù)結(jié)構(gòu),它含有一組有序的列,每列可以是不同的值類(lèi)型(數(shù)值、字符串、布爾值等)。DataFrame既有行索引也有列索引,它可以看做由Series組成的字典(共用同一個(gè)索引)。
(1)Pandas說(shuō)明文檔中對(duì)DataFrame特點(diǎn)介紹如下:
""" Two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic
operations align on both row and column labels. Can be thought of as a
dict-like container for Series objects. The primary pandas data
structureParameters
---------- data : numpy ndarray (structured or homogeneous), dict, or DataFrameDict can contain Series, arrays, constants, or list-like objects index : Index or array-like Index to use for resulting frame. Will default to np.arange(n) if no indexing information part of input data and no index provided columns : Index or array-like Column labels to use for resulting frame. Will default to np.arange(n) if no column labels are provided dtype : dtype, default None Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False Copy data from inputs. Only affects DataFrame / 2d ndarray input
(2)創(chuàng)建DataFrame的基本方法如下,數(shù)據(jù)可以是由列表、一維ndarray或Series組成的字典(序列長(zhǎng)度必須相同)、二維ndarray、字典組成的字典等df = pd.DataFrame(data, index=index)
df = pd.DataFrame({"one": [1., 2., 3., 5], "two": [1., 2., 3., 4.]}) one two 0 1.0 1.0 1 2.0 2.0 2 3.0 3.0 3 5.0 4.0
注:以列表組成的字典形式創(chuàng)建,每個(gè)序列成為DataFrame的一列。不支持單一列表創(chuàng)建df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]}),因?yàn)閘ist為unhashable類(lèi)型
df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=["a", "b"],columns=["one","two","three","four"]) one two three four a 1.0 2.0 3.0 5.0 b 1.0 2.0 3.0 4.0
注:以嵌套列表組成形式創(chuàng)建2行4列的表格,通過(guò)index和 columns參數(shù)指定了索引和列名
data = np.zeros((2,), dtype=[("A", "i4"),("B", "f4"),("C", "a10")]) [(0, 0., "") (0, 0., "")]
注:zeros(shape, dtype=float, order="C")返回一個(gè)給定形狀和類(lèi)型的用0填充的數(shù)組
data[:] = [(1,2.,"Hello"), (2,3.,"World")] df = pd.DataFrame(data) A B C 0 1 2.0 Hello 1 2 3.0 World df = pd.DataFrame(data, index=["first", "second"]) A B C first 1 2.0 Hello second 2 3.0 World df = pd.DataFrame(data, columns=["C", "A", "B"]) C A B 0 Hello 1 2.0 1 World 2 3.0
注:同Series相同,未指定索引時(shí)DataFrame會(huì)自動(dòng)加上索引,指定列則按指定順序進(jìn)行排列
data = {"one" : pd.Series([1., 2., 3.], index=["a", "b", "c"]), "two" : pd.Series([1., 2., 3., 4.], index=["a", "b", "c", "d"])} df = pd.DataFrame(data) one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0
注:以Series組成的字典形式創(chuàng)建時(shí),每個(gè)Series成為一列,如果沒(méi)有顯示指定索引,則各Series的索引被合并成結(jié)果的行索引。NaN代替缺失的列數(shù)據(jù)
df = pd.DataFrame(data,index=["d", "b", "a"]) one two d NaN 4.0 b 2.0 2.0 a 1.0 1.0 df = pd.DataFrame(data,index=["d", "b", "a"], columns=["two", "three"]) two three d 4.0 NaN b 2.0 NaN a 1.0 NaN data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}] df = pd.DataFrame(data2) a b c 0 1 2 NaN 1 5 10 20.0
注:以字典的列表形式創(chuàng)建時(shí),各項(xiàng)成為DataFrame的一行,字典鍵索引的并集成為DataFrame的列標(biāo)
df = pd.DataFrame(data2, index=["first", "second"]) a b c first 1 2 NaN second 5 10 20.0 df = pd.DataFrame(data2, columns=["a", "b"]) a b 0 1 2 1 5 10 df = pd.DataFrame({("a", "b"): {("A", "B"): 1, ("A", "C"): 2}, ("a", "a"): {("A", "C"): 3, ("A", "B"): 4}, ("a", "c"): {("A", "B"): 5, ("A", "C"): 6}, ("b", "a"): {("A", "C"): 7, ("A", "B"): 8}, ("b", "b"): {("A", "D"): 9, ("A", "B"): 10}}) a b a b c a b A B 4.0 1.0 5.0 8.0 10.0 C 3.0 2.0 6.0 7.0 NaN D NaN NaN NaN NaN 9.0
注:以字典的字典形式創(chuàng)建時(shí),列索引由外層的鍵合并成結(jié)果的列索引,各內(nèi)層字典成為一列,內(nèi)層的鍵會(huì)被合并成結(jié)果的行索引。
(3)訪問(wèn)DataFrame中的元素和索引
data = {"one" : pd.Series([1., 2., 3.], index=["a", "b", "c"]), "two" : pd.Series([1., 2., 3., 4.], index=["a", "b", "c", "d"])} df = pd.DataFrame(data) one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0 df["one"]或df.one a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64
注:通過(guò)類(lèi)似字典標(biāo)記的方式或?qū)傩缘姆绞剑梢詫ataFrame的列獲取為一個(gè)Series。返回的Series擁有原DataFrame相同的索引,且其name屬性也被相應(yīng)設(shè)置。
df[0:1] one two a 1.0 1.0
注:返回前兩列數(shù)據(jù)
df.loc["a"] one 1.0 two 1.0 Name: a, dtype: float64 df.loc[:,["one","two"] ] one two a 1.0 1.0 b 2.0 2.0 c 3.0 3.0 d NaN 4.0 df.loc[["a",],["one","two"]] one two a 1.0 1.0 df.loc["a","one"] 1.0
注:loc是通過(guò)標(biāo)簽來(lái)選擇數(shù)據(jù)
df.iloc[0:2,0:1] one a 1.0 b 2.0 df.iloc[0:2] one two a 1.0 1.0 b 2.0 2.0 df.iloc[[0,2],[0,1]]#自由選取行位置,和列位置對(duì)應(yīng)的數(shù)據(jù) one two a 1.0 1.0 c 3.0 3.0
注:iloc通過(guò)位置來(lái)選擇數(shù)據(jù)
df.ix["a"] one 1.0 two 1.0 Name: a, dtype: float64 df.ix["a",["one","two"]] one 1.0 two 1.0 Name: a, dtype: float64 df.ix["a",[0,1]] one 1.0 two 1.0 Name: a, dtype: float64 df.ix[["a","b"],[0,1]] one two a 1.0 1.0 b 2.0 2.0 df.ix[1,[0,1]] one 2.0 two 2.0 Name: b, dtype: float64 df.ix[[0,1],[0,1]] one two a 1.0 1.0 b 2.0 2.0
注:通過(guò)索引字段ix和名稱(chēng)結(jié)合的方式獲取行數(shù)據(jù)
df.ix[df.one>1,:1] one b 2.0 c 3.0
注:使用條件來(lái)選擇,選取one列中大于1的行和第一列
df["one"]=16.8 one two a 16.8 1.0 b 16.8 2.0 c 16.8 3.0 d 16.8 4.0 val = pd.Series([2,2,2],index=["b", "c", "d"]) df["one"]=val one two a NaN 1.0 b 2.0 2.0 c 2.0 3.0 d 2.0 4.0
注:列可以通過(guò)賦值方式修改,將列表或數(shù)組賦值給某個(gè)列時(shí)長(zhǎng)度必須和DataFrame的長(zhǎng)度相匹配。Series賦值時(shí)會(huì)精確匹配DataFrame的索引,空位以NaN填充。
df["four"]=[3,3,3,3] one two four a NaN 1.0 3 b 2.0 2.0 3 c 2.0 3.0 3 d 2.0 4.0 3
注:對(duì)不存在的列賦值會(huì)創(chuàng)建新列
df.index.get_loc("a") 0 df.index.get_loc("b") 1 df.columns.get_loc("one") 0
注:通過(guò)行/列索引獲取整數(shù)形式位置
更多python量化交易內(nèi)容互動(dòng)請(qǐng)加微信公眾號(hào):PythonQT-YuanXiao
歡迎訂閱量化交易課程:[鏈接地址]
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/41409.html
摘要:前言在數(shù)據(jù)分析和建模之前需要審查數(shù)據(jù)是否滿(mǎn)足數(shù)據(jù)處理應(yīng)用的要求,以及對(duì)數(shù)據(jù)進(jìn)行清洗,轉(zhuǎn)化,合并,重塑等一系列規(guī)整化處理。通過(guò)數(shù)據(jù)信息查看可知數(shù)據(jù)中存在缺失值,比如各存在個(gè),各存在個(gè)。 前言 在數(shù)據(jù)分析和建模之前需要審查數(shù)據(jù)是否滿(mǎn)足數(shù)據(jù)處理應(yīng)用的要求,以及對(duì)數(shù)據(jù)進(jìn)行清洗,轉(zhuǎn)化,合并,重塑等一系列規(guī)整化處理。pandas標(biāo)準(zhǔn)庫(kù)提供了高級(jí)靈活的方法,能夠輕松地將數(shù)據(jù)規(guī)整化為正確的形式,本文通...
摘要:小安分析的數(shù)據(jù)主要是用戶(hù)使用代理訪問(wèn)日志記錄信息,要分析的原始數(shù)據(jù)以的形式存儲(chǔ)。下面小安帶小伙伴們一起來(lái)管窺管窺這些數(shù)據(jù)。在此小安一定一定要告訴你,小安每次做數(shù)據(jù)分析時(shí)必定使用的方法方法。 隨著網(wǎng)絡(luò)安全信息數(shù)據(jù)大規(guī)模的增長(zhǎng),應(yīng)用數(shù)據(jù)分析技術(shù)進(jìn)行網(wǎng)絡(luò)安全分析成為業(yè)界研究熱點(diǎn),小安在這次小講堂中帶大家用Python工具對(duì)風(fēng)險(xiǎn)數(shù)據(jù)作簡(jiǎn)單分析,主要是分析蜜罐日志數(shù)據(jù),來(lái)看看一般大家都使用代理i...
摘要:什么是爬蟲(chóng)網(wǎng)絡(luò)爬蟲(chóng)也叫網(wǎng)絡(luò)蜘蛛,是一種自動(dòng)化瀏覽網(wǎng)絡(luò)的程序,或者說(shuō)是一種網(wǎng)絡(luò)機(jī)器人。 什么是爬蟲(chóng) 網(wǎng)絡(luò)爬蟲(chóng)也叫網(wǎng)絡(luò)蜘蛛,是一種自動(dòng)化瀏覽網(wǎng)絡(luò)的程序,或者說(shuō)是一種網(wǎng)絡(luò)機(jī)器人。它們被廣泛用于互聯(lián)網(wǎng)搜索引擎或其他類(lèi)似網(wǎng)站,以獲取或更新這些網(wǎng)站的內(nèi)容和檢索方式。它們可以自動(dòng)采集所有其能夠訪問(wèn)到的頁(yè)面內(nèi)容,以供搜索引擎做進(jìn)一步處理(分檢整理下載的頁(yè)面),而使得用戶(hù)能更快的檢索到他們需要的信息。簡(jiǎn)...
摘要:去吧,參加一個(gè)在上正在舉辦的實(shí)時(shí)比賽吧試試你所學(xué)到的全部知識(shí)微軟雅黑深度學(xué)習(xí)終于看到這個(gè),興奮吧現(xiàn)在,你已經(jīng)學(xué)到了絕大多數(shù)關(guān)于機(jī)器學(xué)習(xí)的技術(shù),是時(shí)候試試深度學(xué)習(xí)了。微軟雅黑對(duì)于深度學(xué)習(xí),我也是個(gè)新手,就請(qǐng)把這些建議當(dāng)作參考吧。 如果你想做一個(gè)數(shù)據(jù)科學(xué)家,或者作為一個(gè)數(shù)據(jù)科學(xué)家你想擴(kuò)展自己的工具和知識(shí)庫(kù),那么,你來(lái)對(duì)地方了。這篇文章的目的,是給剛開(kāi)始使用Python進(jìn)行數(shù)據(jù)分析的人,指明一條全...
目錄Numpy簡(jiǎn)介Numpy操作集合1、不同維度數(shù)據(jù)的表示1.1 一維數(shù)據(jù)的表示1.2 二維數(shù)據(jù)的表示1.3 三維數(shù)據(jù)的表示2、 為什么要使用Numpy2.1、Numpy的ndarray具有廣播功能2.2 Numpy數(shù)組的性能比Python原生數(shù)據(jù)類(lèi)型高3 ndarray的屬性和基本操作3.1 ndarray的基本屬性3.2 ndarray元素類(lèi)型3.3 創(chuàng)建ndarray的方式3.4 ndarr...
閱讀 1543·2021-08-09 13:47
閱讀 2796·2019-08-30 15:55
閱讀 3529·2019-08-29 15:42
閱讀 1141·2019-08-29 13:45
閱讀 3039·2019-08-29 12:33
閱讀 1773·2019-08-26 11:58
閱讀 1016·2019-08-26 10:19
閱讀 2443·2019-08-23 18:00