用Python處理csv文件的一些小玩意兒

wuyumin 發(fā)布于2019-07-25 11:12 / 2436人閱讀

摘要：整理了一些個人在利用處理文件時經(jīng)常用到的一些自定義方法，放在這里主要方便自己查閱，也可以給其他人做參考目錄輸出文件某列的匹配不匹配的記錄調(diào)整文件的列的順序轉(zhuǎn)換器抽取特定列除去完全重復(fù)的記錄根據(jù)列名排序鍵值互換輸出文件某列的匹配不匹配的記錄主

Python CSV Toolkit

整理了一些個人在利用python處理csv文件時經(jīng)常用到的一些自定義方法，放在這里主要方便自己查閱，也可以給其他人做參考

輸出CSV文件某列的匹配/不匹配的記錄

調(diào)整csv文件的列的順序

CSV轉(zhuǎn)換器

抽取特定列

除去完全重復(fù)的記錄

根據(jù)列名排序

鍵值互換

輸出CSV文件某列的匹配/不匹配的記錄

主要用于從csv文件中抽取出匹配特定列的特定字段集合的記錄，比如現(xiàn)有這么一個csv文件（表格化后）

name	age	sex
Danny	24	male
Daisy	23	female
Lancelot	23	unknown
Lydia	21	female
...	...	...

需要輸出其中age為23的記錄到新的csv文件，則我們可以先把23這么個關(guān)鍵詞用一個列表收集起來，然后通過下列代碼從csv文件中找出所有符合條件的記錄并輸出

import sys
import csv

# try to fix "_csv.Error: field larger than field limit (131072)"
csv.field_size_limit(sys.maxint)

# write to common csv file with delimiter ","
# output the rows with matched id in id_list to a new csv file
def csv_match(id_list,key,input_file,output_file):
    with open(input_file, "rb") as f:
        reader = csv.DictReader(f)
        rows = [row for row in reader if row[key] in set(id_list)]

    header = rows[0].keys()
    with open(output_file, "w") as f:
        f.write(",".join(header))
        f.write("
")
        for data in rows:
            f.write(",".join(data[h] for h in header))
            f.write("
")

調(diào)用的時候：

lst=["23"]
csv_match(lst,"age","in.csv","out.csv")

key為需要匹配的列名，另外我們也可以提取不符合該條件的記錄，‘取個反’就行了

# output the rows with not matched id in id_list to a new csv file
def csv_not_match(id_list, key, input_file, output_file):
    with open(input_file, "rb") as f:
        reader = csv.DictReader(f)
        rows = [row for row in reader if not row[key] in set(id_list)]

    header = rows[0].keys()
    with open(output_file, "w") as f:
        f.write(",".join(header))
        f.write("
")
        for data in rows:
            f.write(",".join(data[h] for h in header))
            f.write("
")

對于需要判斷csv文件中多個列的值的情況，只需修改對應(yīng)的判別條件和傳入?yún)?shù)情況即可

# output the rows with matched key1 or key2 in refer_list to a new csv file
# @params
# refer_list: the list referred to
# key,key2: column name of csv file to check the value in the refer_list or not
def csv_match2(refer_list, key1, key2, input_file, output_file):
    with open(input_file, "rb") as f:
        reader = csv.DictReader(f)
        rows = [row for row in reader if (row[key1] in set(refer_list)) or (row[key2] in set(refer_list))]

    header = rows[0].keys()
    with open(output_file, "w") as f:
        f.write(",".join(header))
        f.write("
")
        for data in rows:
            f.write(",".join(data[h] for h in header))
            f.write("
")

調(diào)整csv文件的列的順序

有時候我們輸出的或者拿到的csv文件的列的順序不夠‘人性化’，為了讓我們看起來更加直觀，更舒服一點(diǎn)，我們可以按照我們的需要調(diào)整列的順序

import csv
# reorder the column of the csv file to what you want
def csv_reorder(in_file, out_file,lst_order):
    with open(in_file, "rb") as infile, open(out_file, "wb") as outfile:
        fieldnames=lst_order
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in csv.DictReader(infile):
            writer.writerow(row)

其中lst_order為我們需要的列名順序，用list存儲，舉個例子

season_id,league_name,league_size
2003,scottish-premiership,12
2016,1-hnl,10
2004,alka-superligaen,12
2006,allsvenskan,14
1992,premier-league,22
...

現(xiàn)在我們想調(diào)整他的順序，按照league_name,season_id,league_size的順序重新組合一下
則調(diào)用

lst_order = ["league_name","season_id","league_size"]
csv_reorder("leagues_size.csv", "leagues_size_new.csv", lst_order)

得到結(jié)果

league_name,season_id,league_size
scottish-premiership,2003,12
1-hnl,2016,10
alka-superligaen,2004,12
allsvenskan,2006,14
premier-league,1992,22
...

CSV轉(zhuǎn)換器

這個主要是用來進(jìn)行csv和python的一些內(nèi)置的容器例如list,dict之類的轉(zhuǎn)換，包括一些特殊的多級字典，或者是嵌套列表的字典等等，這里只是把他們打個包放在一起，具體的可以參照我之前寫的一篇文章

import csv

#---------------------------------------------------csv <--> dict--------------------------------------------

# convert csv file to dict
# @params:
# key/value: the column of original csv file to set as the key and value of dict
def csv2dict(in_file,key,value):
    new_dict = {}
    with open(in_file, "rb") as f:
        reader = csv.reader(f, delimiter=",")
        fieldnames = next(reader)
        reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",")
        for row in reader:
            new_dict[row[key]] = row[value]
    return new_dict


# convert csv file to dict(key-value pairs each row)
# default: set row[0] as key and row[1] as value of the dict
def row_csv2dict(csv_file):
    dict_club={}
    with open(csv_file)as f:
        reader=csv.reader(f,delimiter=",")
        for row in reader:
            dict_club[row[0]]=row[1]
    return dict_club

# write dict to csv file
# write each key/value pair on a separate row
def dict2csv(dict, file):
    with open(file, "wb") as f:
        w = csv.writer(f)
        # write each key/value pair on a separate row
        w.writerows(dict.items())

# write dict to csv file
# write all keys on one row and all values on the next
def dict2csv2(dict, file):
    with open(file, "wb") as f:
        w = csv.writer(f)
        # write all keys on one row and all values on the next
        w.writerow(dict.keys())
        w.writerow(dict.values())

# build a dict of list like {key:[...element of lst_inner_value...]}
# key is certain column name of csv file
# the lst_inner_value is a list of specific column name of csv file
def build_list_dict(source_file, key, lst_inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            for element in lst_inner_value:
                new_dict.setdefault(row[key], []).append(row[element])
    return new_dict
# sample:
# test_club=build_list_dict("test_info.csv","season",["move from","move to"])
# print test_club


# build specific nested dict from csv files
# @params:
#   source_file
#   outer_key:the outer level key of nested dict
#   inner_key:the inner level key of nested dict,and rest key-value will be store as the value of inner key
def build_level2_dict(source_file,outer_key,inner_key):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        reader = csv.reader(csv_file, delimiter=",")
        fieldnames = next(reader)
        inner_keyset=fieldnames
        inner_keyset.remove(outer_key)
        inner_keyset.remove(inner_key)
        csv_file.seek(0)
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            item = new_dict.get(row[outer_key], dict())
            item[row[inner_key]] = {k: row[k] for k in inner_keyset}
            new_dict[row[outer_key]] = item
    return new_dict

# build specific nested dict from csv files
# @params:
#   source_file
#   outer_key:the outer level key of nested dict
#   inner_key:the inner level key of nested dict
#   inner_value:set the inner value for the inner key
def build_level2_dict2(source_file,outer_key,inner_key,inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            item = new_dict.get(row[outer_key], dict())
            item[row[inner_key]] = row[inner_value]
            new_dict[row[outer_key]] = item
    return new_dict

# build specific nested dict from csv files
# @params:
#   source_file
#   outer_key:the outer level key of nested dict
#   lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct
#   {outer_key:[{pairs of lst_inner_value}]}
def build_level2_dict3(source_file,outer_key,lst_inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            new_dict.setdefault(row[outer_key], []).append({k: row[k] for k in lst_inner_value})
    return new_dict

# build specific nested dict from csv files
# @params:
#   source_file
#   outer_key:the outer level key of nested dict
#   lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct
#   {outer_key:{key of lst_inner_value:[...value of lst_inner_value...]}}
def build_level2_dict4(source_file,outer_key,lst_inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            # print row
            item = new_dict.get(row[outer_key], dict())
            # item.setdefault("move from",[]).append(row["move from"])
            # item.setdefault("move to", []).append(row["move to"])
            for element in lst_inner_value:
                item.setdefault(element, []).append(row[element])
            new_dict[row[outer_key]] = item
    return new_dict

# build specific nested dict from csv files
# @params:
#   source_file
#   outer_key:the outer level key of nested dict
#   lst_inner_key:a list of column name
#   lst_inner_value: a list of column name,for circumstance that the inner value of the same lst_inner_key are not distinct
#   {outer_key:{lst_inner_key:[...lst_inner_value...]}}
def build_list_dict2(source_file,outer_key,lst_inner_key,lst_inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            # print row
            item = new_dict.get(row[outer_key], dict())
            item.setdefault(row[lst_inner_key], []).append(row[lst_inner_value])
            new_dict[row[outer_key]] = item
    return new_dict

# dct=build_list_dict2("test_info.csv","season","move from","move to")

# build specific nested dict from csv files
# a dict like {outer_key:{inner_key1:{inner_key2:{rest_key:rest_value...}}}}
# the params are extract from the csv column name as you like
def build_level3_dict(source_file,outer_key,inner_key1,inner_key2):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        reader = csv.reader(csv_file, delimiter=",")
        fieldnames = next(reader)
        inner_keyset=fieldnames
        inner_keyset.remove(outer_key)
        inner_keyset.remove(inner_key1)
        inner_keyset.remove(inner_key2)
        csv_file.seek(0)
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            item = new_dict.get(row[outer_key], dict())
            sub_item = item.get(row[inner_key1], dict())
            sub_item[row[inner_key2]] = {k: row[k] for k in inner_keyset}
            item[row[inner_key1]] = sub_item
            new_dict[row[outer_key]] = item
    return new_dict

# build specific nested dict from csv files
# a dict like {outer_key:{inner_key1:{inner_key2:inner_value}}}
# the params are extract from the csv column name as you like
def build_level3_dict2(source_file,outer_key,inner_key1,inner_key2,inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            item = new_dict.get(row[outer_key], dict())
            sub_item = item.get(row[inner_key1], dict())
            sub_item[row[inner_key2]] = row[inner_value]
            item[row[inner_key1]] = sub_item
            new_dict[row[outer_key]] = item
    return new_dict
   

# build specific nested dict from csv files
# a dict like {outer_key:{inner_key1:{inner_key2:[inner_value]}}}
# for multiple inner_value with the same inner_key2,thus gather them in a list
# the params are extract from the csv column name as you like
def build_level3_dict3(source_file,outer_key,inner_key1,inner_key2,inner_value):
    new_dict = {}
    with open(source_file, "rb")as csv_file:
        data = csv.DictReader(csv_file, delimiter=",")
        for row in data:
            item = new_dict.get(row[outer_key], dict())
            sub_item = item.get(row[inner_key1], dict())
            sub_item.setdefault(row[inner_key2], []).append(row[inner_value])
            item[row[inner_key1]] = sub_item
            new_dict[row[outer_key]] = item
    return new_dict

#----------------------------------------------------------------------------------------------------------

#---------------------------------------------------csv <--> list--------------------------------------------

def list2csv(list, file):
# def list2csv(list):
#     wr = csv.writer(open(file, "wb"), quoting=csv.QUOTE_ALL)
    wr=open(file,"w")
    for word in list:
        # print "".join(word)
        # wr.writerow([word])
        wr.write(word+"
")
        # wr.writerow(str.split(word,""")[0])
        # print [word]

# test_list = ["United States", "China", "America", "England"]

# list2csv(test_list,"small_test.csv")

# write nested list of dict to csv
def nestedlist2csv(list, out_file):
    with open(out_file, "wb") as f:
        w = csv.writer(f)
        fieldnames=list[0].keys()  # solve the problem to automatically write the header
        w.writerow(fieldnames)
        for row in list:
            w.writerow(row.values())


# my_list = [{"players.vis_name": "Khazri", "players.role": "Midfielder", "players.country": "Tunisia",
#             "players.last_name": "Khazri", "players.player_id": "989", "players.first_name": "Wahbi",
#             "players.date_of_birth": "08/02/1991", "players.team": "Bordeaux"},
#            {"players.vis_name": "Khazri", "players.role": "Midfielder", "players.country": "Tunisia",
#             "players.last_name": "Khazri", "players.player_id": "989", "players.first_name": "Wahbi",
#             "players.date_of_birth": "08/02/1991", "players.team": "Sunderland"},
#            {"players.vis_name": "Lewis Baker", "players.role": "Midfielder", "players.country": "England",
#             "players.last_name": "Baker", "players.player_id": "9574", "players.first_name": "Lewis",
#             "players.date_of_birth": "25/04/1995", "players.team": "Vitesse"}
#            ]

# nestedlist2csv(my_list, "dict2csv_test.csv")



# collect and convert the first column of csv file to list
def csv2list(csv_file):
    lst = []
    with open(csv_file, "rb")as f:
        reader = csv.reader(f, delimiter=",")
        for row in reader:
            lst.append(row[0])
    return list(set(lst))
#----------------------------------------------------------------------------------------------------------

抽取特定列

抽取特定列的所有值并存儲于列表

根據(jù)下標(biāo)抽取特定列到某個新的csv文件

抽取特定列的所有值并存儲于列表

獲取某列原始的數(shù)據(jù)并保存為列表

# get certain column value of csv(for common csv file(","))
def get_origin_column_value(file, column_name):
    with open(file, "rb") as f:
        role_list = []
        reader = csv.reader(f, delimiter=",")
        fieldnames = next(reader)
        reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",")
        for row in reader:
            role_list.append(row[column_name])
        return role_list

對于某些有特殊需要的可以直接修改代碼，比如對原始的列的值進(jìn)行除重和排序后獲取，如下

# get certain column value of csv(for common csv file(",")),and judge if it"s repeated
def get_column_value2(file, column_name):
    with open(file, "rb") as f:
        role_list = []
        reader = csv.reader(f, delimiter=",")
        fieldnames = next(reader)
        reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",")
        for row in reader:
            role_list.append(row[column_name])
        role_set = set(role_list)
        return sorted(list(role_set))

根據(jù)下標(biāo)抽取特定列到某個新的csv文件

import csv
# extract certain column from csv file according to the column#
def column_extract(file_in,file_out,index):
    with open(file_in,"r") as f_in:
        with open(file_out,"w") as f_out:
            for line in f_in:
                f_out.write(line.split(",")[index])
                f_out.write("
") # comment if a new line already exists

除去完全重復(fù)的記錄

# eliminated the completely repeated record in repeated file for further analysis
def eliminate_repeated_row(in_file,out_file):
    with open(in_file,"rb") as in_file,open(out_file,"wb")as out_file:
        seen=set()
        for line in in_file:
            # print line
            if line in seen:continue

            seen.add(line)
            out_file.write(line)

對csv文件按照某一列排序

# sort the csv file by certain column to put the similar record together for further analysis
def sort_csv_byColumn(in_file, out_file,column_name):
    with open(in_file, "rb") as f:
        reader = csv.reader(f, delimiter=",")
        fieldnames = next(reader)
        reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",")
        sorted_list = sorted(reader, key=lambda row: row[column_name], reverse=True)
        # print sorted_list
        csv_converter.nestedlist2csv(sorted_list, out_file)

例如我們按照league_name排序(注意這里調(diào)用了csv轉(zhuǎn)換器中的方法將列表的字典轉(zhuǎn)換為csv文件)

sort_csv_byColumn("leagues_size.csv","ordered_leagues_size.csv","league_name")

得到結(jié)果

season_id,league_name,league_size
2016,ykkonen,9
2003,ykkonen,14
2005,ykkonen,14
2006,ykkonen,14
2007,ykkonen,14
2010,ykkonen,13
2011,ykkonen,10
2009,ykkonen,14
2008,ykkonen,14
2012,ykkonen,10
2013,ykkonen,10
2014,ykkonen,10
2015,ykkonen,10
2016,wiener-stadtliga,16
1988,wiener-stadtliga,16
1993,wiener-stadtliga,16
1994,wiener-stadtliga,16
1995,wiener-stadtliga,16
1996,wiener-stadtliga,16
1997,wiener-stadtliga,16
1998,wiener-stadtliga,16

如果我們按league_size排序

sort_csv_byColumn("leagues_size.csv",
                    "orderedbysize_leagues_size.csv","league_size")

得到結(jié)果

season_id,league_name,league_size
2008,virsliga,9
2010,virsliga,9
2012,a-lyga,9
2012,a-pojat-sm-sarja,9
2013,a-pojat-sm-sarja,9
1953,salzburger-liga,9
2010,3-lig-grup-1,9
2013,armenian-first-league,9
2016,ykkonen,9
2014,stirling-sports-premiership,9
2014,hong-kong-premier-league,9
2015,hong-kong-premier-league,9
1996,s-league,9
2015,s-league,9
2013,united-football-league,9
2016,i-league,9

鍵值互換

csv文件每一條記錄其實(shí)可以看作是一個字典，有時csv文件里有不同的鍵對應(yīng)同一個值的情況，我們想講記錄反轉(zhuǎn)一下，即讓值作為鍵，對應(yīng)的鍵作為值

# return a dict with the same value in original as new key and keys as value
def dict_same_value(original_dict):
    new_dict={}
    for k,v in original_dict.iteritems():
        new_dict.setdefault(v,[]).append(k)
    return new_dict

最后歡迎大家fork關(guān)于這個的github上的repository，一起豐富更多好玩的功能~

更新日志
1、2016-12-18 修復(fù)了從csv文件中獲取特定的列的值保存為集合的問題，而是存儲為原始的列表
2、2016-12-22 改進(jìn)了csv轉(zhuǎn)換器中的構(gòu)建二級字典的方法，使其變得更加靈活
3、2016年12月24日14:57:48 在csv轉(zhuǎn)換器部分加入三級字典構(gòu)造的參照方法
4、2017年1月9日11:28:45 在csv轉(zhuǎn)換器部分，三級字典構(gòu)造中，加入了最內(nèi)部存儲值為列表的構(gòu)造方法
5、2017年1月16日10:43:41 在csv轉(zhuǎn)換器部分，加入了構(gòu)造列表字典的方法以及構(gòu)造特殊的二級字典（內(nèi)部為列表）的方法
6、2017年2月9日10:58:17 在csv轉(zhuǎn)換器部分，加入了新的構(gòu)造特殊的二級字典（內(nèi)部為列表）的方法
7、2017年2月10日11:21:45 在csv轉(zhuǎn)換器部分，改進(jìn)了簡單的csv文件轉(zhuǎn)換為字典的方法，此外在Csv_Match部分，加入了匹配判斷多個列對應(yīng)的元素條件的方法

GPU云服務(wù)器云服務(wù)器 ddos一些比較小的網(wǎng)站小玩意兒 csv文件導(dǎo)入數(shù)據(jù)庫的好處 python文件處理

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://systransis.cn/yun/38266.html

發(fā)表評論

登陸后可評論

0條評論

wuyumin

男|高級講師

我要關(guān)注我要私信

TA的文章

云虛擬主機(jī)如何使用-該怎么選擇合適自己的云虛擬主機(jī)？

閱讀 1641·2021-09-22 15:25
SpartanHost：西雅圖VPS八折優(yōu)惠 $2.8/月起，Cera聯(lián)通CUVIP線路，10Gbp

閱讀 1519·2021-09-07 10:06
CSS_一些問題

閱讀 3195·2019-08-30 15:53
深入理解ES6之《解構(gòu)》

閱讀 1099·2019-08-29 13:12
面試分享：一年經(jīng)驗(yàn)初探阿里巴巴前端社招

閱讀 3390·2019-08-29 13:07
CSS 強(qiáng)化裝備！Sass 安裝與使用

閱讀 738·2019-08-28 18:19
xml和json相互轉(zhuǎn)換

閱讀 2278·2019-08-27 10:57
字蛛font-spider報錯,<web font not found>,碰到的最新問題

閱讀 995·2019-08-26 13:29

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

用Python處理csv文件的一些小玩意兒

相關(guān)文章

**【Python Library】csv — CSV File Reading and Writing**

**數(shù)據(jù)分析之在線JupyterNotebook使用小技巧｜Python技能樹測評**

**利用python合并csv文件的方式實(shí)例**

**怎么使用Python實(shí)時向文件寫入數(shù)據(jù)**

**功能式Python中的探索性數(shù)據(jù)分析**

發(fā)表評論

0條評論

wuyumin

男|高級講師

TA的文章

云虛擬主機(jī)如何使用-該怎么選擇合適自己的云虛擬主機(jī)？

SpartanHost：西雅圖VPS八折優(yōu)惠 $2.8/月起，Cera聯(lián)通CUVIP線路，10Gbp

CSS_一些問題

深入理解ES6之《解構(gòu)》

面試分享：一年經(jīng)驗(yàn)初探阿里巴巴前端社招

CSS 強(qiáng)化裝備！Sass 安裝與使用

xml和json相互轉(zhuǎn)換

字蛛font-spider報錯,<web font not found>,碰到的最新問題

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

用Python處理csv文件的一些小玩意兒

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！