大數(shù)據(jù)與云計算學(xué)習(xí)：數(shù)據(jù)分析（一）

dunizb 發(fā)布于2019-07-30 14:48 / 2775人閱讀

python基礎(chǔ)

先看看基礎(chǔ)

注意點

切割操作

這里發(fā)現(xiàn)我們在取出list中的元素時候是左開右閉的，即[3,6) 索引6對應(yīng)的元素7并沒有被輸出

改變list中的元素

添加刪除元素

兩種拷貝list的方式

list2拷貝給y，y改變，list2也變

list2拷貝給y，y改變，list2不變

刪除實例的屬性和刪除字典屬性的區(qū)別

a = {"a":1,"b":2}
del a["a"]
a = classname()
del classname.attrname

with as

https://www.cnblogs.com/DswCn...

if name == "__main__":

if __name__ == "__main__":

一個python的文件有兩種使用的方法，
第一是直接作為腳本執(zhí)行，
第二是import到其他的python腳本中被調(diào)用（模塊重用）執(zhí)行。
因此if name == "main":
的作用就是控制這兩種情況執(zhí)行代碼的過程，
在if name == "main": 下的代碼只有在第一種情況下（即文件作為腳本直接執(zhí)行）才會被執(zhí)行，
而import到其他腳本中是不會被執(zhí)行的。...

函數(shù) /方法 正則表達式

基礎(chǔ)看這里

import re
line = "jwxddxsw33"
if line == "jxdxsw33":
    print("yep")
else:
    print("no")

# ^ 限定以什么開頭
regex_str = "^j.*"
if re.match(regex_str, line):
    print("yes")
#$限定以什么結(jié)尾
regex_str1 = "^j.*3$"
if re.match(regex_str, line):
    print("yes")

regex_str1 = "^j.3$"
if re.match(regex_str, line):
    print("yes")
# 貪婪匹配
regex_str2 = ".*(d.*w).*"
match_obj = re.match(regex_str2, line)
if match_obj:
    print(match_obj.group(1))
# 非貪婪匹配
# ？處表示遇到第一個d 就匹配
regex_str3 = ".*?(d.*w).*"
match_obj = re.match(regex_str3, line)
if match_obj:
    print(match_obj.group(1))
# * 表示>=0次　?。”硎尽?=0次
# ? 表示非貪婪模式
# + 的作用至少>出現(xiàn)一次  所以.+任意字符這個字符至少出現(xiàn)一次
line1 = "jxxxxxxdxsssssswwwwjjjww123"
regex_str3 = ".*(w.+w).*"
match_obj = re.match(regex_str3, line1)
if match_obj:
    print(match_obj.group(1))
# {2}限定前面的字符出現(xiàn)次數(shù) {2,}2次以上 {2,5}最小兩次最多5次
line2 = "jxxxxxxdxsssssswwaawwjjjww123"
regex_str3 = ".*(w.{3}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = "jxxxxxxdxsssssswwaawwjjjww123"
regex_str3 = ".*(w.{2}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = "jxxxxxxdxsssssswbwaawwjjjww123"
regex_str3 = ".*(w.{5,}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

# | 或

line3 = "jx123"
regex_str4 = "((jx|jxjx)123)"
match_obj = re.match(regex_str4, line3)
if match_obj:
    print(match_obj.group(1))
    print(match_obj.group(2))
# [] 表示中括號內(nèi)任意一個
line4 = "ixdxsw123"
regex_str4 = "([hijk]xdxsw123)"
match_obj = re.match(regex_str4, line4)
if match_obj:
    print(match_obj.group(1))
# [0,9]{9} 0到9任意一個 出現(xiàn)9次（9位數(shù)）
line5 = "15955224326"
regex_str5 = "(1[234567][0-9]{9})"
match_obj = re.match(regex_str5, line5)
if match_obj:
    print(match_obj.group(1))
# [^1]{9}
line6 = "15955224326"
regex_str6 = "(1[234567][^1]{9})"
match_obj = re.match(regex_str6, line6)
if match_obj:
    print(match_obj.group(1))

# [.*]{9} 中括號中的.和*就代表.*本身
line7 = "1.*59224326"
regex_str7 = "(1[.*][^1]{9})"
match_obj = re.match(regex_str7, line7)
if match_obj:
    print(match_obj.group(1))

#s 空格
line8 = "你 好"
regex_str8 = "(你s好)"
match_obj = re.match(regex_str8, line8)
if match_obj:
    print(match_obj.group(1))

# S 只要不是空格都可以（非空格）
line9 = "你真好"
regex_str9 = "(你S好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

# w  任意字符 和.不同的是 它表示[A-Za-z0-9_]
line9 = "你adsfs好"
regex_str9 = "(你wwwww好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

line10 = "你adsf_好"
regex_str10 = "(你wwwww好)"
match_obj = re.match(regex_str10, line10)
if match_obj:
    print(match_obj.group(1))
#W大寫的是非[A-Za-z0-9_]
line11 = "你 好"
regex_str11 = "(你W好)"
match_obj = re.match(regex_str11, line11)
if match_obj:
    print(match_obj.group(1))

# unicode編碼 [u4E00-u9FA5] 表示漢字
line12= "鏡心的小樹屋"
regex_str12= "([u4E00-u9FA5]+)"
match_obj = re.match(regex_str12,line12)
if match_obj:
    print(match_obj.group(1))

print("-----貪婪匹配情況----")
line13 = "reading in 鏡心的小樹屋"
regex_str13 = ".*([u4E00-u9FA5]+樹屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

print("----取消貪婪匹配情況----")
line13 = "reading in 鏡心的小樹屋"
regex_str13 = ".*?([u4E00-u9FA5]+樹屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

#d數(shù)字
line14 = "XXX出生于2011年"
regex_str14 = ".*(d{4})年"
match_obj = re.match(regex_str14, line14)
if match_obj:
    print(match_obj.group(1))

regex_str15 = ".*?(d+)年"
match_obj = re.match(regex_str15, line14)
if match_obj:
    print(match_obj.group(1))

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

###
# 試寫一個驗證Email地址的正則表達式。版本一應(yīng)該可以驗證出類似的Email：
#[email protected]
#[email protected]
###

import re
addr = "[email protected]"
addr2 = "[email protected]"
def is_valid_email(addr):
    if re.match(r"[a-zA-Z_.]*@[a-aA-Z.]*",addr):
        return True
    else:
        return False

print(is_valid_email(addr))
print(is_valid_email(addr2))

# 版本二可以提取出帶名字的Email地址：
#  [email protected] => Tom Paris
# [email protected] => bob

addr3 = " [email protected]"
addr4 = "[email protected]"

def name_of_email(addr):
    r=re.compile(r"^(?)([ws]*)@([w.]*)$")
    if not r.match(addr):
        return None
    else:
        m = r.match(addr)
        return m.group(2)

print(name_of_email(addr3))
print(name_of_email(addr4))

案例

找出一個文本中詞頻最高的單詞

text = "the clown ran after the car and the car ran into the tent and the tent fell down on the clown and the car"
words = text.split()
print(words)

for word in words:# 初始化空列表
    print(word)


#步驟一：獲得單詞列表  相當(dāng)于去重
unique_words = list()
for word in words:
   if(word not in unique_words):# 使用in判斷某個元素是否在列表里
       unique_words.append(word)
print(unique_words)


#步驟二：初始化詞頻列表

# [e]*n 快速初始化
counts = [0] * len(unique_words)
print(counts)

# 步驟三：統(tǒng)計詞頻
for word in words:
    index = unique_words.index(word)

    counts[index] = counts[index] + 1
    print(counts[index])
print(counts)
# 步驟四：找出最高詞頻和其對應(yīng)的單詞
bigcount = None #None 為空，初始化bigcount
bigword = None

for i in range(len(counts)):
    if bigcount is None or counts[i] > bigcount:
        bigword = unique_words[i]
        bigcount = counts[i]
print(bigword,bigcount)

用字典的方式：

# 案例回顧：找出一個文本中最高詞頻的單詞

text = """the clown ran after the car and the car ran into the tent 
        and the tent fell down on the clown and the car"""
words = text.split() # 獲取單詞的列表

# 使用字典可以極大簡化步驟
# 獲取單詞-詞頻字典
counts = dict() # 初始化一個空字典
for word in words:
    counts[word] = counts.get(word, 0) + 1  # 構(gòu)造字典。注意get方法需要設(shè)定默認返回值0（當(dāng)單詞第一次出現(xiàn)時，詞頻為1）
print(counts)

# 在字典中查找最高詞頻的單詞
bigcount = None
bigword = None
for word,count in counts.items():
    if bigcount is None or count > bigcount:
        bigword = word
        bigcount = count

print(bigword, bigcount)

自定義一個每周工資計算器函數(shù)

# 使用input()函數(shù)，從鍵盤讀取輸入的文本
# a = input("請輸入文本:")
# print("您輸入的內(nèi)容是：",a)

def salary_calculator(): #沒有參數(shù)的函數(shù)
    user = str #初始化user為字符串變量
    print("----工資計算器----")

    while True:
        user = input("
請輸入你的名字，或者輸入0來結(jié)束報告: ")

        if user == "0":
            print("結(jié)束報告")
            break
        else:
            hours = float(input("請輸入你的工作小時數(shù)："))
            payrate =float(input("請輸入你的單位時間工資： ￥"))

            if hours <= 40:
                print("員工姓名:",user)
                print("加班小時數(shù)：0")
                print("加班費：￥0.00")
                regularpay = round(hours * payrate,2) # round函數(shù)保留小數(shù)點后兩位
                print("稅前工資:￥" + str(regularpay))


            elif hours > 40:

                overtimehours = round(hours - 40, 2)

                print("員工姓名: " + user)

                print("加班小時數(shù): " + str(overtimehours))

                regularpay = round(40 * payrate, 2)

                overtimerate = round(payrate * 1.5, 2)

                overtimepay = round(overtimehours * overtimerate)

                grosspay = round(regularpay + overtimepay, 2)

                print("常規(guī)工資: ￥" + str(regularpay))

                print("加班費: ￥" + str(overtimepay))

                print("稅前工資: ￥" + str(grosspay))

#調(diào)用 salary_calculator

salary_calculator()

這個實例中注意 python中關(guān)于round函數(shù)的小坑

數(shù)據(jù)結(jié)構(gòu)、函數(shù)、條件和循環(huán) 包管理

戳這里看有哪些流行python包——>awesom-python

Numpy 處理數(shù)組/數(shù)據(jù)計算擴展

ndarray 一種多維數(shù)組對象

利用數(shù)組進行數(shù)據(jù)處理

用于數(shù)組的文件輸入輸出

多維操作

線性代數(shù)

隨機數(shù)生成

隨機漫步

Numpy高級應(yīng)用

ndarray 對象的內(nèi)部機制

高級數(shù)組操作

廣播

ufunc高級應(yīng)用

結(jié)構(gòu)化和記錄式數(shù)組

更多有關(guān)排序

NumPy的matrix類

高級數(shù)組輸入輸出

Matplotlib 數(shù)據(jù)可視化

Pandas 數(shù)據(jù)分析

pandas的數(shù)據(jù)結(jié)構(gòu)

基本功能

匯總和計算描述統(tǒng)計

處理缺失數(shù)據(jù)

層次化索引

聚合與分組

邏輯回歸基本原理

jupyter

pip3 install jupyter
jupyter notebook

scipy

描述性統(tǒng)計

Scikit-learn 數(shù)據(jù)挖掘、機器學(xué)習(xí)

keras 人工神經(jīng)網(wǎng)絡(luò)

tensorflow 神經(jīng)網(wǎng)絡(luò)

安裝Python包管理工具pip，主要是用于安裝 PyPI 上的軟件包

安裝教程

sudo apt-get install python3-pip
pip3 install numpy
pip3 install scipy
pip3 install matplotlib

或者下這個安裝腳本 get-pip.py

包的引入方式

因為python是面向?qū)ο蟮木幊?，推薦引入方式還是

import numpy
numpy.array([1,2,3])

數(shù)據(jù)存儲 數(shù)據(jù)操作 生成數(shù)據(jù)

生成一組二維數(shù)組，有5000個元素，每個元素內(nèi)表示 身高和體重

import numpy as np

生成1000個經(jīng)緯度位置，靠近（117，32），并輸出位csv

import pandas as pd
import numpy as np

# 任意的多組列表
lng = np.random.normal(117,0.20,1000)

lat = np.random.normal(32.00,0.20,1000)

# 字典中的key值即為csv中列名
dataframe = pd.DataFrame({"lng":lng,"lat":lat})


#將DataFrame存儲為csv,index表示是否顯示行名，default=True
dataframe.to_csv("data/lng-lat.csv",index = False, sep="," )

numpy的常用操作

#encoding=utf-8 
import numpy as np 
def main():
    lst = [[1,3,5],[2,4,6]]
    print(type(lst))
    np_lst = np.array(lst)
    print(type(np_lst))
    # 同一種numpy.array中只能有一種數(shù)據(jù)類型
    # 定義np的數(shù)據(jù)類型
    # 數(shù)據(jù)類型有：bool int int8 int16 int32 int64 int128 uint8 uint16 uint32 uint64 uint128 float16/32/64 complex64/128
    np_lst = np.array(lst,dtype=np.float)

    print(np_lst.shape)
    print(np_lst.ndim)#數(shù)據(jù)的維度
    print(np_lst.dtype)#數(shù)據(jù)類型
    print(np_lst.itemsize) #每個元素的大小
    print(np_lst.size)#數(shù)據(jù)大小 幾個元素

    # numpy array
    print(np.zeros([2,4]))# 生成2行4列都是0的數(shù)組
    print(np.ones([3,5]))

    print("---------隨機數(shù)Rand-------") 
    print(np.random.rand(2,4))# rand用于產(chǎn)生0～1之間的隨機數(shù) 2*4的數(shù)組
    print(np.random.rand())
    print("---------隨機數(shù)RandInt-------")
    print(np.random.randint(1,10)) # 1~10之間的隨機整數(shù)
    print(np.random.randint(1,10,3))# 3個1～10之間的隨機整數(shù)
    print("---------隨機數(shù)Randn 標(biāo)準(zhǔn)正太分布-------")
    print(np.random.randn(2,4)) # 2行4列的標(biāo)準(zhǔn)正太分布的隨機整數(shù)
    print("---------隨機數(shù)Choice-------")
    print(np.random.choice([10,20,30]))# 指定在10 20 30 里面選一個隨機數(shù)生成
    print("---------分布Distribute-------")
    print(np.random.beta(1,10,100))# 生成beta分布
if __name__ == "__main__":
    main()

常用函數(shù)舉例

計算紅酒數(shù)據(jù)每一個屬性的平均值（即每一列數(shù)據(jù)的平均值）

數(shù)據(jù)分析工具 數(shù)據(jù)可視化

探索數(shù)據(jù)
數(shù)據(jù)展示
數(shù)據(jù) ---> 故事

matplotlib 繪圖基礎(chǔ)

函數(shù)曲線的繪制

圖形細節(jié)的設(shè)置

案例分析：銷售記錄可視化

條形圖

繪制多圖

餅圖

散點圖

直方圖

seaborn 數(shù)據(jù)可視化包

分類數(shù)據(jù)的散點圖

分類數(shù)據(jù)的箱線圖

多變量圖

更多內(nèi)容戳這里數(shù)據(jù)可視化

安裝 matplotlib

注意這里會報這樣的錯誤

ImportError: No module named "_tkinter", please install the python3-tk package

需要安裝 python3-tk

更多示例 線圖

散點圖 & 柱狀圖

數(shù)據(jù)分析

padans

上層數(shù)據(jù)操作

dataframe數(shù)據(jù)結(jié)構(gòu)

 import pandas as pd
brics = pd.read_csv("/home/wyc/study/python_lession/python_lessions/數(shù)據(jù)分析/brics.csv",index_col = 0)

pandas基本操作


import numpy as np
import pandas as pd

def main():

    #Data Structure
    s = pd.Series([i*2 for i in range(1,11)])
    print(type(s))

    dates = pd.date_range("20170301",periods=8)
    df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=list("ABCDE"))
    print(df)
    # basic

    print(df.head(3))
    print(df.tail(3))
    print(df.index)
    print(df.values)
    print(df.T)
    # print(df.sort(columns="C"))
    print(df.sort_index(axis=1,ascending=False))
    print(df.describe())

    #select
    print(type(df["A"]))
    print(df[:3])
    print(df["20170301":"20170304"])
    print(df.loc[dates[0]])
    print(df.loc["20170301":"20170304",["B","D"]])
    print(df.at[dates[0],"C"])


    print(df.iloc[1:3,2:4])
    print(df.iloc[1,4])
    print(df.iat[1,4])

    print(df[df.B>0][df.A<0])
    print(df[df>0])
    print(df[df["E"].isin([1,2])])

    # Set
    s1 = pd.Series(list(range(10,18)),index = pd.date_range("20170301",periods=8))
    df["F"]= s1
    print(df)
    df.at[dates[0],"A"] = 0
    print(df)
    df.iat[1,1] = 1
    df.loc[:,"D"] = np.array([4]*len(df))
    print(df)

    df2 = df.copy()
    df2[df2>0] = -df2
    print(df2)

    # Missing Value
    df1 = df.reindex(index=dates[:4],columns = list("ABCD") + ["G"])
    df1.loc[dates[0]:dates[1],"G"]=1
    print(df1)
    print(df1.dropna())
    print(df1.fillna(value=1))

    # Statistic
    print(df.mean())
    print(df.var())

    s = pd.Series([1,2,4,np.nan,5,7,9,10],index=dates)
    print(s)
    print(s.shift(2))
    print(s.diff())
    print(s.value_counts())
    print(df.apply(np.cumsum))
    print(df.apply(lambda x:x.max()-x.min()))

    #Concat
    pieces = [df[:3],df[-3:]]
    print(pd.concat(pieces))

    left = pd.DataFrame({"key":["x","y"],"value":[1,2]})
    right = pd.DataFrame({"key":["x","z"],"value":[3,4]})
    print("LEFT",left)
    print("RIGHT", right)
    print(pd.merge(left,right,on="key",how="outer"))
    df3 = pd.DataFrame({"A": ["a","b","c","b"],"B":list(range(4))})
    print(df3.groupby("A").sum())



if __name__ == "__main__":
    main()

# 首先產(chǎn)生一個叫g(shù)dp的字典
gdp = {"country":["United States", "China", "Japan", "Germany", "United Kingdom"],
       "capital":["Washington, D.C.", "Beijing", "Tokyo", "Berlin", "London"],
       "population":[323, 1389, 127, 83, 66],
       "gdp":[19.42, 11.8, 4.84, 3.42, 2.5],
       "continent":["North America", "Asia", "Asia", "Europe", "Europe"]}

import pandas as pd
gdp_df = pd.DataFrame(gdp)
print(gdp_df)

# 我們可以通過index選項添加自定義的行標(biāo)簽(label)
# 使用column選項可以選擇列的順序
gdp_df = pd.DataFrame(gdp, columns = ["country", "capital", "population", "gdp", "continent"],index = ["us", "cn", "jp", "de", "uk"])
print(gdp_df)

#修改行和列的標(biāo)簽
# 也可以使用index和columns直接修改
gdp_df.index=["US", "CN", "JP", "DE", "UK"]
gdp_df.columns = ["Country", "Capital", "Population", "GDP", "Continent"]
print(gdp_df)
# 增加rank列，表示他們的GDP處在前5位
gdp_df["rank"] = "Top5 GDP"
# 增加國土面積變量,以百萬公里計（數(shù)據(jù)來源：http://data.worldbank.org/）
gdp_df["Area"] = [9.15, 9.38, 0.37, 0.35, 0.24]
print(gdp_df)


# 一個最簡單的series
series = pd.Series([2,4,5,7,3],index = ["a","b","c","d","e"])
print(series)
# 當(dāng)我們使用點操作符來查看一個變量時，返回的是一個pandas series
# 在后續(xù)的布爾篩選中使用點方法可以簡化代碼
# US,...,UK是索引
print(gdp_df.GDP)


# 可以直接查看索引index
print(gdp_df.GDP.index)
# 類型是pandas.core.series.Series
print(type(gdp_df.GDP))

#返回一個布爾型的series，在后面講到的DataFrame的布爾索引中會大量使用
print(gdp_df.GDP > 4)

# 我們也可以將series視為一個長度固定且有順序的字典，一些用于字典的函數(shù)也可以用于series
gdp_dict = {"US": 19.42, "CN": 11.80, "JP": 4.84, "DE": 3.42, "UK": 2.5}
gdp_series = pd.Series(gdp_dict)
print(gdp_series)

# 判斷 ’US" 標(biāo)簽是否在gdp_series中

print("US" in gdp_series)
# 使用變量名加[[]]選取列
print(gdp_df[["Country"]])
# 可以同時選取多列
print(gdp_df[["Country", "GDP"]])


# 如果只是用[]則產(chǎn)生series
print(type(gdp_df["Country"]))
# 行選取和2d數(shù)組類似
# 如果使用[]選取行，切片方法唯一的選項
print(gdp_df[2:5]) #終索引是不被包括的！

#loc方法
# 在上面例子中，我們使用行索引選取行，能不能使用行標(biāo)簽實現(xiàn)選取呢？
# loc方法正是基于標(biāo)簽選取數(shù)據(jù)的方法
print(gdp_df.loc[["JP","DE"]])
# 以上例子選取了所有的列
# 我們可以加入需要的列標(biāo)簽
print(gdp_df.loc[["JP","DE"],["Country","GDP","Continent"]])

# 選取所有的行，我們可以使用:來表示選取所有的行
print(gdp_df.loc[:,["Country","GDP","Continent"]])

# 等價于gdp_df.loc[["JP","DE"]]
print(gdp_df.iloc[[2,3]])

print(gdp_df.loc[["JP","DE"],["Country", "GDP", "Continent"]])
print(gdp_df.iloc[[2,3],[0,3,4]])

# 選出亞洲國家，下面兩行命令產(chǎn)生一樣的結(jié)果
print(gdp_df[gdp_df.Continent == "Asia"])

print(gdp_df.loc[gdp_df.Continent == "Asia"])
# 選出gdp大于3兆億美元的歐洲國家
print(gdp_df[(gdp_df.Continent == "Europe") & (gdp_df.GDP > 3)])

缺失值處理 數(shù)據(jù)挖掘

案例:Iris鳶尾花數(shù)據(jù)集
讓我們來看一下經(jīng)典的iris數(shù)據(jù):

鳶尾花卉數(shù)據(jù)集，來源 UCI 機器學(xué)習(xí)數(shù)據(jù)集

它最初是埃德加·安德森采集的

四個特征被用作樣本的定量分析，它們分別是花萼(sepal)和花瓣(petal)的長度(length)和寬度(width)

#####
#數(shù)據(jù)的導(dǎo)入和觀察
#####
import pandas as pd
# 用列表存儲列標(biāo)簽
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
# 讀取數(shù)據(jù)，并指定每一列的標(biāo)簽
iris = pd.read_csv("data/iris.txt", names = col_names)

# 使用head/tail查看數(shù)據(jù)的頭和尾

print(iris.head(10))

# 使用info 方法查看數(shù)據(jù)的總體信息
iris.info()

# 使用shape可以查看DataFrame的行數(shù)與列數(shù)
# iris有150個觀察值，5個變量
print(iris.shape)
# 這里的品種(species)是分類變量(categorical variable)
# 可以使用unique方法來對查看series中品種的名字
print(iris.species.unique())


# 統(tǒng)計不同品種的數(shù)量
# 使用DataFrame的value_counts方法來實現(xiàn)
print(iris.species.value_counts())

#選取花瓣數(shù)據(jù)，即 petal_length 和 petal_width 這兩列
# 方法一：使用[[ ]]
petal = iris[["petal_length","petal_width"]]
print(petal.head())
# 方法二：使用 .loc[ ]
petal = iris.loc[:,["petal_length","petal_width"]]
print(petal.head())
# 方法三：使用 .iloc[ ]
petal = iris.iloc[:,2:4]
print(petal.head())

# 選取行索引為5-10的數(shù)據(jù)行
# 方法一：使用[]
print(iris[5:11])
# 方法二：使用 .iloc[]
print(iris.iloc[5:11,:])

# 選取品種為 Iris-versicolor 的數(shù)據(jù)
versicolor = iris[iris.species == "Iris-versicolor"]
print(versicolor.head())


####
#數(shù)據(jù)的可視化
####
#散點圖
import matplotlib.pyplot as plt
# 我們首先畫散點圖（sactter plot），x軸上畫出花瓣的長度，y軸上畫出花瓣的寬度
# 我們觀察到什么呢？
iris.plot(kind = "scatter", x="petal_length", y="petal_width")
# plt.show()

# 使用布爾索引的方法分別獲取三個品種的數(shù)據(jù)
setosa = iris[iris.species == "Iris-setosa"]
versicolor = iris[iris.species == "Iris-versicolor"]
virginica = iris[iris.species == "Iris-virginica"]

ax = setosa.plot(kind="scatter", x="petal_length", y="petal_width", color="Red", label="setosa", figsize=(10,6))
versicolor.plot(kind="scatter", x="petal_length", y="petal_width", color="Green", ax=ax, label="versicolor")
virginica.plot(kind="scatter", x="petal_length", y="petal_width", color="Orange", ax=ax, label="virginica")
plt.show()

#箱圖
#使用mean()方法獲取花瓣寬度均值
print(iris.petal_width.mean())
#使用median()方法獲取花瓣寬度的中位數(shù)
print(iris.petal_width.median())
# 可以使用describe方法來總結(jié)數(shù)值變量
print(iris.describe())


# 繪制花瓣寬度的箱圖
# 箱圖展示了數(shù)據(jù)中的中位數(shù)，四分位數(shù)，最大值，最小值
iris.petal_width.plot(kind="box")
# plt.show()

# 按品種分類，分別繪制不同品種花瓣寬度的箱圖
iris[["petal_width","species"]].boxplot(grid=False,by="species",figsize=(10,6))
# plt.show()

setosa.describe()

# 計算每個品種鳶尾花各個屬性（花萼、花瓣的長度和寬度）的最小值、平均值又是分別是多少？ （提示：使用min、mean 方法。）
print(iris.groupby(["species"]).agg(["min","mean"]))

#計算鳶尾花每個品種的花萼長度（sepal_length) 大于6cm的數(shù)據(jù)個數(shù)。
# 方法1
print(iris[iris["sepal_length"]> 6].groupby("species").size())
# 方法2
def more_len(group,length=6):
    return len(group[group["sepal_length"] > length])
print(iris.groupby(["species"]).apply(more_len,6))

缺失值處理、數(shù)據(jù)透視表

缺失值處理：pandas中的fillna()方法

pandas用nan(not a number)表示缺失數(shù)據(jù)，處理缺失數(shù)據(jù)有以下幾種方法：

dropna去除nan數(shù)據(jù)

fillna使用默認值填入

isnull 返回一個含有布爾值的對象，表示哪些是nan，哪些不是

notnull isnull的否定式

數(shù)據(jù)透視表：pandas中的pivot_table函數(shù)

我們用案例分析 - 泰坦尼克數(shù)據(jù) 來說明這個兩個問題
缺失值處理：

真實數(shù)據(jù)往往某些變量會有缺失值。

這里，cabin有超過70%以上的缺失值，我們可以考慮直接丟掉這個變量。 -- 刪除某一列數(shù)據(jù)

像Age這樣的重要變量，有20%左右的缺失值，我們可以考慮用中位值來填補。-- 填補缺失值

我們一般不提倡去掉帶有缺失值的行，因為其他非缺失的變量可能提供有用的信息。-- 刪除帶缺失值的行

# 讀取常用的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#讀取數(shù)據(jù)
titanic_df = pd.read_csv("data/titanic.csv")

#查看前五行數(shù)據(jù)
print(titanic_df.head())

# 數(shù)據(jù)的統(tǒng)計描述
# describe函數(shù)查看部分變量的分布
# 因為Survived是0-1變量，所以均值就是幸存人數(shù)的百分比，這個用法非常有用
print(titanic_df[["Survived","Age","SibSp","Parch"]].describe())

# 使用include=[np.object]來查看分類變量
# count: 非缺失值的個數(shù)
# unique: 非重復(fù)值得個數(shù)
# top: 最高頻值
# freq: 最高頻值出現(xiàn)次數(shù)

print(titanic_df.describe(include=[np.object]))

#不同艙位的分布情況是怎樣的呢？
# 方法1: value_counts
# 查看不同艙位的分布
# 頭等艙：24%； 二等艙：21%； 三等艙：55%
# value_counts 頻數(shù)統(tǒng)計， len() 獲取數(shù)據(jù)長度
print(titanic_df.Pclass.value_counts() / len(titanic_df))
# 總共有891個乘客
# Age有714個非缺失值，Cabin只有204個非缺失值。我們將會講解如何處理缺失值
print(titanic_df.info())

#方法2：group_by
# sort_values 將結(jié)果排序
(titanic_df.groupby("Pclass").agg("size")/len(titanic_df)).sort_values(ascending=False)

# 填補年齡數(shù)據(jù)中的缺失值
# 直接使用所有人年齡的中位數(shù)來填補
# 在處理之前，查看Age列的統(tǒng)計值
print(titanic_df.Age.describe())

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")

# 計算所有人年齡的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值,inplace=True表示在原數(shù)據(jù)titanic_df上直接進行修改
titanic_df.Age.fillna(age_median1,inplace=True)
#查看Age列的統(tǒng)計值
print(titanic_df.Age.describe())
#print(titanic_df.info())

# 考慮性別因素，分別用男女乘客各自年齡的中位數(shù)來填補
# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算男女年齡的中位數(shù)， 得到一個Series數(shù)據(jù)，索引為Sex
age_median2 = titanic_df.groupby("Sex").Age.median()
# 設(shè)置Sex為索引
titanic_df.set_index("Sex",inplace=True)
# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)
# 查看Age列的統(tǒng)計值
print(titanic_df.Age.describe())

#同時考慮性別和艙位因素

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算不同艙位男女年齡的中位數(shù)， 得到一個Series數(shù)據(jù)，索引為Pclass,Sex
age_median3 = titanic_df.groupby(["Pclass", "Sex"]).Age.median()
# 設(shè)置Pclass, Sex為索引， inplace=True表示在原數(shù)據(jù)titanic_df上直接進行修改
titanic_df.set_index(["Pclass","Sex"], inplace=True)
print(titanic_df)

# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的統(tǒng)計值
titanic_df.Age.describe()

將連續(xù)型變量離散化

連續(xù)型變量離散化是建模中一種常用的方法

離散化指的是將某個變量的所在區(qū)間分割為幾個小區(qū)間，落在同一個區(qū)間的觀測值用同一個符號表示

以年齡為例，最小值是0.42（嬰兒），最大值是80，如果我們想產(chǎn)生一個五個級（levels），我們可使用cut或者qcut函數(shù)

cut函數(shù)將年齡的區(qū)間均勻分割為5分，而qcut則選取區(qū)間以至于每個區(qū)間里的觀察值個數(shù)都是一樣的（五等分），這里演示中使用cut函數(shù)。

# 讀取常用的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#讀取數(shù)據(jù)
titanic_df = pd.read_csv("data/titanic.csv")

#查看前五行數(shù)據(jù)
print(titanic_df.head())

# 數(shù)據(jù)的統(tǒng)計描述
# describe函數(shù)查看部分變量的分布
# 因為Survived是0-1變量，所以均值就是幸存人數(shù)的百分比，這個用法非常有用
print(titanic_df[["Survived","Age","SibSp","Parch"]].describe())

# 使用include=[np.object]來查看分類變量
# count: 非缺失值的個數(shù)
# unique: 非重復(fù)值得個數(shù)
# top: 最高頻值
# freq: 最高頻值出現(xiàn)次數(shù)

print(titanic_df.describe(include=[np.object]))

#不同艙位的分布情況是怎樣的呢？
# 方法1: value_counts
# 查看不同艙位的分布
# 頭等艙：24%； 二等艙：21%； 三等艙：55%
# value_counts 頻數(shù)統(tǒng)計， len() 獲取數(shù)據(jù)長度
print(titanic_df.Pclass.value_counts() / len(titanic_df))
# 總共有891個乘客
# Age有714個非缺失值，Cabin只有204個非缺失值。我們將會講解如何處理缺失值
print(titanic_df.info())

#方法2：group_by
# sort_values 將結(jié)果排序
(titanic_df.groupby("Pclass").agg("size")/len(titanic_df)).sort_values(ascending=False)

# 填補年齡數(shù)據(jù)中的缺失值
# 直接使用所有人年齡的中位數(shù)來填補
# 在處理之前，查看Age列的統(tǒng)計值
print(titanic_df.Age.describe())

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")

# 計算所有人年齡的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值,inplace=True表示在原數(shù)據(jù)titanic_df上直接進行修改
titanic_df.Age.fillna(age_median1,inplace=True)
#查看Age列的統(tǒng)計值
print(titanic_df.Age.describe())
#print(titanic_df.info())

# 考慮性別因素，分別用男女乘客各自年齡的中位數(shù)來填補
# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算男女年齡的中位數(shù)， 得到一個Series數(shù)據(jù)，索引為Sex
age_median2 = titanic_df.groupby("Sex").Age.median()
# 設(shè)置Sex為索引
titanic_df.set_index("Sex",inplace=True)
# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)
# 查看Age列的統(tǒng)計值
print(titanic_df.Age.describe())

#同時考慮性別和艙位因素

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計算不同艙位男女年齡的中位數(shù)， 得到一個Series數(shù)據(jù)，索引為Pclass,Sex
age_median3 = titanic_df.groupby(["Pclass", "Sex"]).Age.median()
# 設(shè)置Pclass, Sex為索引， inplace=True表示在原數(shù)據(jù)titanic_df上直接進行修改
titanic_df.set_index(["Pclass","Sex"], inplace=True)
print(titanic_df)

# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的統(tǒng)計值
titanic_df.Age.describe()


###
#分析哪些因素會決定生還概率
###

# 艙位與生還概率
#計算每個艙位的生還概率
# 方法1：使用經(jīng)典的分組-聚合-計算
# 注意：因為Survived是0-1函數(shù)，所以均值即表示生還百分比
print(titanic_df[["Pclass", "Survived"]].groupby("Pclass").mean() 
    .sort_values(by="Survived", ascending=False))

# 方法2：我們還可以使用pivot_table函數(shù)來實現(xiàn)同樣的功能（本次課新內(nèi)容）
# pivot table中文為數(shù)據(jù)透視表
# values: 聚合后被施加計算的值，這里我們施加mean函數(shù)
# index: 分組用的變量
# aggfunc: 定義施加的函數(shù)
print(titanic_df.pivot_table(values="Survived", index="Pclass", aggfunc=np.mean))

# 繪制艙位和生還概率的條形圖
# 使用sns.barplot做條形圖，圖中y軸給出 Survived 均值的點估計
#sns.barplot(data=titanic_df,x="Pclass",y="Survived",ci=None)
# plt.show()

#####
#性別與生還概率
#####
# 方法1：groupby
print(titanic_df[["Sex", "Survived"]].groupby("Sex").mean() 
    .sort_values(by="Survived", ascending=False))
# 方法2：pivot_table
print(titanic_df.pivot_table(values="Survived",index="Sex",aggfunc=np.mean))

# 繪制條形圖
#sns.barplot(data=titanic_df,x="Sex",y="Survived",ci=None)
#plt.show()


#####
#綜合考慮艙位和性別的因素，與生還概率的關(guān)系
#####
# 方法1：groupby
print(titanic_df[["Pclass","Sex", "Survived"]].groupby(["Pclass", "Sex"]).mean())

# 方法2：pivot_table
titanic_df.pivot_table(values="Survived", index=["Pclass", "Sex"], aggfunc=np.mean)

# 方法3：pivot_talbe
# columns指定另一個分類變量，只不過我們將它列在列里而不是行里，這也是為什么這個變量稱為columns
print(titanic_df.pivot_table(values="Survived",index="Pclass",columns="Sex",aggfunc=np.mean))

#繪制條形圖：使用sns.barplot
#sns.barplot(data=titanic_df,x="Pclass",y="Survived",hue="Sex",ci=None)
# plt.show()

# 繪制折線圖：使用sns.pointplot
sns.pointplot(data=titanic_df,x="Pclass",y="Survived",hue="Sex",ci=None)
#plt.show()

####
#年齡與生還情況
####
#與上面的艙位、性別這些分類變量不同，年齡是一個連續(xù)的變量

#生還組和罹難組的年齡分布直方圖
#使用seaborn包中的 FacetGrid().map() 來快速生成高質(zhì)量圖片
# col="Survived"指定將圖片在一行中做出生還和罹難與年齡的關(guān)系圖
sns.FacetGrid(titanic_df,col="Survived").
    map(plt.hist,"Age",bins=20,normed=True)
# plt.show()


###
#將連續(xù)型變量離散化
###
#我們使用cut函數(shù)
#我們可以看到每個區(qū)間的大小是固定的,大約是16歲

titanic_df["AgeBand"] = pd.cut(titanic_df["Age"],5)
print(titanic_df.head())

#查看落在不同年齡區(qū)間里的人數(shù)
#方法1：value_counts(), sort=False表示不需要將結(jié)果排序
print(titanic_df.AgeBand.value_counts(sort=False))

#方法2：pivot_table
print(titanic_df.pivot_table(values="Survived",index="AgeBand",aggfunc="count"))

#查看各個年齡區(qū)間的生還率
print(titanic_df.pivot_table(values="Survived",index="AgeBand",aggfunc=np.mean))
sns.barplot(data=titanic_df,x="AgeBand",y="Survived",ci=None)
plt.xticks(rotation=60)
plt.show()


####
# 年齡、性別 與生還概率
####
# 查看落在不同區(qū)間里男女的生還概率
print(titanic_df.pivot_table(values="Survived",index="AgeBand", columns="Sex", aggfunc=np.mean))

sns.pointplot(data=titanic_df, x="AgeBand", y="Survived", hue="Sex", ci=None)
plt.xticks(rotation=60)

plt.show()

####
#年齡、艙位、性別 與生還概率
####
titanic_df.pivot_table(values="Survived",index="AgeBand", columns=["Sex", "Pclass"], aggfunc=np.mean)



# 回顧sns.pointplot 繪制艙位、性別與生還概率的關(guān)系圖
sns.pointplot(data=titanic_df, x="Pclass", y="Survived", hue="Sex", ci=None)

人工神經(jīng)網(wǎng)絡(luò)

https://keras.io

機器學(xué)習(xí) 特征工程

特征工程到底是什么？

案例分析：共享單車需求
特征工程（feature engineering）

數(shù)據(jù)和特征決定了機器學(xué)習(xí)的上限，而一個好的模型只是逼近那個上限而已

我們的目標(biāo)是盡可能得從原始數(shù)據(jù)上獲取有用的信息，一些原始數(shù)據(jù)本身往往不能直接作為模型的變量。

特征工程是利用數(shù)據(jù)領(lǐng)域的相關(guān)知識來創(chuàng)建能夠使機器學(xué)習(xí)算法達到最佳性能的特征的過程。

日期型變量的處理

以datetime為例子，這個特征里包含了日期和時間點兩個重要信息。我們還可以進一步從日期中導(dǎo)出其所對應(yīng)的月份和星期數(shù)。

#租車人數(shù)是由哪些因素決定的？
#導(dǎo)入數(shù)據(jù)分析包
import numpy as np
import pandas as pd

#導(dǎo)入繪圖工具包
import matplotlib.pyplot as plt
import seaborn as sns

#導(dǎo)入日期時間變量處理相關(guān)的工具包
import calendar
from datetime import datetime

# 讀取數(shù)據(jù)
BikeData = pd.read_csv("data/bike.csv")


#####
#了解數(shù)據(jù)大小
#查看前幾行/最后幾行數(shù)據(jù)
#查看數(shù)據(jù)類型與缺失值
####
# 第一步：查看數(shù)據(jù)大小

print(BikeData.shape)

# 第二步：查看前10行數(shù)據(jù)
print(BikeData.head(10))


# 第三步：查看數(shù)據(jù)類型與缺失值
# 大部分變量為整數(shù)型，溫度和風(fēng)速為浮點型變量
# datetime類型為object，我們將在下面進一步進行處理
# 沒有缺失值！
print(BikeData.info())


####
#日期型變量的處理
####

# 取datetime中的第一個元素為例，其數(shù)據(jù)類型為字符串，所以我們可以使用split方法將字符串拆開
# 日期+時間戳是一個非常常見的數(shù)據(jù)形式
ex = BikeData.datetime[1]
print(ex)

print(type(ex))

# 使用split方法將字符串拆開
ex.split()

# 獲取日期數(shù)據(jù)
ex.split()[0]

# 首先獲得日期，定義一個函數(shù)使用split方法將日期+時間戳拆分為日期和
def get_date(x):
    return(x.split()[0])

# 使用pandas中的apply方法，對datatime使用函數(shù)get_date
BikeData["date"] = BikeData.datetime.apply(get_date)

print(BikeData.head())

# 生成租車時間(24小時）
# 為了取小時數(shù)，我們需要進一步拆分
print(ex.split()[1])
#":"是分隔符
print(ex.split()[1].split(":")[0])

# 將上面的內(nèi)容定義為get_hour的函數(shù)，然后使用apply到datatime這個特征上
def get_hour(x):
    return (x.split()[1].split(":")[0])
# 使用apply方法，獲取整列數(shù)據(jù)的時間
BikeData["hour"] = BikeData.datetime.apply(get_hour)

print(BikeData.head())

####
# 生成日期對應(yīng)的星期數(shù)
####
# 首先引入calendar中的day_name，列舉了周一到周日
print(calendar.day_name[:])

#獲取字符串形式的日期
dateString = ex.split()[0]

# 使用datatime中的strptime函數(shù)將字符串轉(zhuǎn)換為日期時間類型
# 注意這里的datatime是一個包不是我們dataframe里的變量名
# 這里我們使用"%Y-%m-%d"來指定輸入日期的格式是按照年月日排序，有時候可能會有月日年的排序形式
print(dateString)
dateDT = datetime.strptime(dateString,"%Y-%m-%d")
print(dateDT)
print(type(dateDT))

# 然后使用weekday方法取出日期對應(yīng)的星期數(shù)
# 是0-6的整數(shù)，星期一對應(yīng)0， 星期日對應(yīng)6
week_day = dateDT.weekday()

print(week_day)
# 將星期數(shù)映射到其對應(yīng)的名字上
print(calendar.day_name[week_day])


# 現(xiàn)在將上述的過程融合在一起變成一個獲取星期的函數(shù)
def get_weekday(dateString):
    week_day = datetime.strptime(dateString,"%Y-%m-%d").weekday()
    return (calendar.day_name[week_day])

# 使用apply方法，獲取date整列數(shù)據(jù)的星期
BikeData["weekday"] = BikeData.date.apply(get_weekday)

print(BikeData.head())


####
# 生成日期對應(yīng)的月份
####

# 模仿上面的過程，我們可以提取日期對應(yīng)的月份
# 注意：這里month是一個attribute不是一個函數(shù)，所以不用括號

def get_month(dateString):
    return (datetime.strptime(dateString,"%Y-%m-%d").month)
# 使用apply方法，獲取date整列數(shù)據(jù)的月份
BikeData["month"] = BikeData.date.apply(get_month)
print(BikeData.head())

####
#數(shù)據(jù)可視化舉例
####

#繪制租車人數(shù)的箱線圖， 以及人數(shù)隨時間（24小時）變化的箱線圖
# 設(shè)置畫布大小
fig = plt.figure(figsize=(18,5))

# 添加第一個子圖
# 租車人數(shù)的箱線圖
ax1 = fig.add_subplot(121)
sns.boxplot(data=BikeData,y="count")
ax1.set(ylabel="Count",title="Box Plot On Count")


# 添加第二個子圖
# 租車人數(shù)和時間的箱線圖
# 商業(yè)洞察：租車人數(shù)由時間是如何變化的?
ax2 = fig.add_subplot(122)
sns.boxplot(data=BikeData,y="count",x="hour")
ax2.set(xlabel="Hour",ylabel="Count",title="Box Plot On Count Across Hours")
plt.show()

機器學(xué)習(xí)

機器學(xué)習(xí)（Machine Learning）是人工智能的分支，其目標(biāo)是通過算法從現(xiàn)有的數(shù)據(jù)中建立模型（學(xué)習(xí)）來解決問題。

機器學(xué)習(xí)是一門交叉學(xué)科，涉及概率統(tǒng)計（probability and statistics），優(yōu)化（optimization），和計算機編程（computer programming）等等。

用途極為廣泛：從預(yù)測信用卡違約風(fēng)險，癌癥病人五年生存概率到汽車無人駕駛，都有著機器學(xué)習(xí)的身影。

備受重視：人們在決策分析的時候越來越多得用定量方法（quantitative approach）來衡量一個決策的優(yōu)劣。

監(jiān)督學(xué)習(xí)：

監(jiān)督學(xué)習(xí)（Supervised Learning）：從給定的訓(xùn)練數(shù)據(jù)集中學(xué)習(xí)出一個函數(shù)，當(dāng)新的數(shù)據(jù)到來時，可以根據(jù)這個函數(shù)預(yù)測結(jié)果。監(jiān)督學(xué)習(xí)的訓(xùn)練集（training data）要求是包括輸入和輸出，也可以說是特征和目標(biāo)。

監(jiān)督學(xué)習(xí)中又可進一步分為兩大類主要問題：預(yù)測與分類。房價預(yù)測是一個典型的預(yù)測問題，房價作為目標(biāo)是一個連續(xù)型變量。信用卡違約預(yù)測是一個典型的分類問題，是否違約作為一個目標(biāo)是一個分類變量。

無監(jiān)督學(xué)習(xí)

無監(jiān)督學(xué)習(xí)（Unsupervised Learning）：訓(xùn)練集沒有人為標(biāo)注的結(jié)果。我們從輸入數(shù)據(jù)本身探索規(guī)律。

無監(jiān)督學(xué)習(xí)的例子包括圖片聚類分析，文章主題分類，基因序列分析，和高緯數(shù)據(jù)（high dimensional data) 降維等等。

案例分析：波士頓地區(qū)房價
注意波士頓房價數(shù)據(jù)是scikit-learn中的Toy datasets 可通過函數(shù)datasets.load_boston()直接加載

學(xué)習(xí)資源

機器學(xué)習(xí)教程及筆記
https://www.datacamp.com/
http://matplotlib.org/2.1.0/g...
https://www.kesci.com/
https://keras.io

競賽

https://www.kaggle.com/
天池大數(shù)據(jù)競賽和Kaggle、DataCastle的比較，哪個比較好？
天池新人實戰(zhàn)賽

參考

The Python Tutorial
python寫入csv文件的幾種方法總結(jié)
常見安裝第三方庫問題
慕課網(wǎng) Python在數(shù)據(jù)科學(xué)中的應(yīng)用
慕課網(wǎng) Python數(shù)據(jù)分析-基礎(chǔ)技術(shù)篇
《利用python進行數(shù)據(jù)分析》
DataLearningTeam/PythonData
Visualization
使用 NumPy 進行科學(xué)計算
使用Python進行描述性統(tǒng)計
Documentation of scikit-learn 0.19.1
Seaborn tutorial
特征工程