摘要:整理了一些個人在利用處理文件時經(jīng)常用到的一些自定義方法,放在這里主要方便自己查閱,也可以給其他人做參考目錄輸出文件某列的匹配不匹配的記錄調(diào)整文件的列的順序轉(zhuǎn)換器抽取特定列除去完全重復(fù)的記錄根據(jù)列名排序鍵值互換輸出文件某列的匹配不匹配的記錄主
Python CSV Toolkit
整理了一些個人在利用python處理csv文件時經(jīng)常用到的一些自定義方法,放在這里主要方便自己查閱,也可以給其他人做參考
輸出CSV文件某列的匹配/不匹配的記錄
調(diào)整csv文件的列的順序
CSV轉(zhuǎn)換器
抽取特定列
除去完全重復(fù)的記錄
根據(jù)列名排序
鍵值互換
輸出CSV文件某列的匹配/不匹配的記錄主要用于從csv文件中抽取出匹配特定列的特定字段集合的記錄,比如現(xiàn)有這么一個csv文件(表格化后)
name | age | sex |
---|---|---|
Danny | 24 | male |
Daisy | 23 | female |
Lancelot | 23 | unknown |
Lydia | 21 | female |
... | ... | ... |
需要輸出其中age為23的記錄到新的csv文件,則我們可以先把23這么個關(guān)鍵詞用一個列表收集起來,然后通過下列代碼從csv文件中找出所有符合條件的記錄并輸出
import sys import csv # try to fix "_csv.Error: field larger than field limit (131072)" csv.field_size_limit(sys.maxint) # write to common csv file with delimiter "," # output the rows with matched id in id_list to a new csv file def csv_match(id_list,key,input_file,output_file): with open(input_file, "rb") as f: reader = csv.DictReader(f) rows = [row for row in reader if row[key] in set(id_list)] header = rows[0].keys() with open(output_file, "w") as f: f.write(",".join(header)) f.write(" ") for data in rows: f.write(",".join(data[h] for h in header)) f.write(" ")
調(diào)用的時候:
lst=["23"] csv_match(lst,"age","in.csv","out.csv")
key為需要匹配的列名,另外我們也可以提取不符合該條件的記錄,‘取個反’就行了
# output the rows with not matched id in id_list to a new csv file def csv_not_match(id_list, key, input_file, output_file): with open(input_file, "rb") as f: reader = csv.DictReader(f) rows = [row for row in reader if not row[key] in set(id_list)] header = rows[0].keys() with open(output_file, "w") as f: f.write(",".join(header)) f.write(" ") for data in rows: f.write(",".join(data[h] for h in header)) f.write(" ")
對于需要判斷csv文件中多個列的值的情況,只需修改對應(yīng)的判別條件和傳入?yún)?shù)情況即可
# output the rows with matched key1 or key2 in refer_list to a new csv file # @params # refer_list: the list referred to # key,key2: column name of csv file to check the value in the refer_list or not def csv_match2(refer_list, key1, key2, input_file, output_file): with open(input_file, "rb") as f: reader = csv.DictReader(f) rows = [row for row in reader if (row[key1] in set(refer_list)) or (row[key2] in set(refer_list))] header = rows[0].keys() with open(output_file, "w") as f: f.write(",".join(header)) f.write(" ") for data in rows: f.write(",".join(data[h] for h in header)) f.write(" ")調(diào)整csv文件的列的順序
有時候我們輸出的或者拿到的csv文件的列的順序不夠‘人性化’,為了讓我們看起來更加直觀,更舒服一點(diǎn),我們可以按照我們的需要調(diào)整列的順序
import csv # reorder the column of the csv file to what you want def csv_reorder(in_file, out_file,lst_order): with open(in_file, "rb") as infile, open(out_file, "wb") as outfile: fieldnames=lst_order writer = csv.DictWriter(outfile, fieldnames=fieldnames) writer.writeheader() for row in csv.DictReader(infile): writer.writerow(row)
其中lst_order為我們需要的列名順序,用list存儲,舉個例子
season_id,league_name,league_size 2003,scottish-premiership,12 2016,1-hnl,10 2004,alka-superligaen,12 2006,allsvenskan,14 1992,premier-league,22 ...
現(xiàn)在我們想調(diào)整他的順序,按照league_name,season_id,league_size的順序重新組合一下
則調(diào)用
lst_order = ["league_name","season_id","league_size"] csv_reorder("leagues_size.csv", "leagues_size_new.csv", lst_order)
得到結(jié)果
league_name,season_id,league_size scottish-premiership,2003,12 1-hnl,2016,10 alka-superligaen,2004,12 allsvenskan,2006,14 premier-league,1992,22 ...CSV轉(zhuǎn)換器
這個主要是用來進(jìn)行csv和python的一些內(nèi)置的容器例如list,dict之類的轉(zhuǎn)換,包括一些特殊的多級字典,或者是嵌套列表的字典等等,這里只是把他們打個包放在一起,具體的可以參照我之前寫的一篇文章
import csv #---------------------------------------------------csv <--> dict-------------------------------------------- # convert csv file to dict # @params: # key/value: the column of original csv file to set as the key and value of dict def csv2dict(in_file,key,value): new_dict = {} with open(in_file, "rb") as f: reader = csv.reader(f, delimiter=",") fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",") for row in reader: new_dict[row[key]] = row[value] return new_dict # convert csv file to dict(key-value pairs each row) # default: set row[0] as key and row[1] as value of the dict def row_csv2dict(csv_file): dict_club={} with open(csv_file)as f: reader=csv.reader(f,delimiter=",") for row in reader: dict_club[row[0]]=row[1] return dict_club # write dict to csv file # write each key/value pair on a separate row def dict2csv(dict, file): with open(file, "wb") as f: w = csv.writer(f) # write each key/value pair on a separate row w.writerows(dict.items()) # write dict to csv file # write all keys on one row and all values on the next def dict2csv2(dict, file): with open(file, "wb") as f: w = csv.writer(f) # write all keys on one row and all values on the next w.writerow(dict.keys()) w.writerow(dict.values()) # build a dict of list like {key:[...element of lst_inner_value...]} # key is certain column name of csv file # the lst_inner_value is a list of specific column name of csv file def build_list_dict(source_file, key, lst_inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: for element in lst_inner_value: new_dict.setdefault(row[key], []).append(row[element]) return new_dict # sample: # test_club=build_list_dict("test_info.csv","season",["move from","move to"]) # print test_club # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict,and rest key-value will be store as the value of inner key def build_level2_dict(source_file,outer_key,inner_key): new_dict = {} with open(source_file, "rb")as csv_file: reader = csv.reader(csv_file, delimiter=",") fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = {k: row[k] for k in inner_keyset} new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # inner_key:the inner level key of nested dict # inner_value:set the inner value for the inner key def build_level2_dict2(source_file,outer_key,inner_key,inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) item[row[inner_key]] = row[inner_value] new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:[{pairs of lst_inner_value}]} def build_level2_dict3(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: new_dict.setdefault(row[outer_key], []).append({k: row[k] for k in lst_inner_value}) return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_value: a list of column name,for circumstance that the inner value of the same outer_key are not distinct # {outer_key:{key of lst_inner_value:[...value of lst_inner_value...]}} def build_level2_dict4(source_file,outer_key,lst_inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) # item.setdefault("move from",[]).append(row["move from"]) # item.setdefault("move to", []).append(row["move to"]) for element in lst_inner_value: item.setdefault(element, []).append(row[element]) new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # @params: # source_file # outer_key:the outer level key of nested dict # lst_inner_key:a list of column name # lst_inner_value: a list of column name,for circumstance that the inner value of the same lst_inner_key are not distinct # {outer_key:{lst_inner_key:[...lst_inner_value...]}} def build_list_dict2(source_file,outer_key,lst_inner_key,lst_inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: # print row item = new_dict.get(row[outer_key], dict()) item.setdefault(row[lst_inner_key], []).append(row[lst_inner_value]) new_dict[row[outer_key]] = item return new_dict # dct=build_list_dict2("test_info.csv","season","move from","move to") # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:{rest_key:rest_value...}}}} # the params are extract from the csv column name as you like def build_level3_dict(source_file,outer_key,inner_key1,inner_key2): new_dict = {} with open(source_file, "rb")as csv_file: reader = csv.reader(csv_file, delimiter=",") fieldnames = next(reader) inner_keyset=fieldnames inner_keyset.remove(outer_key) inner_keyset.remove(inner_key1) inner_keyset.remove(inner_key2) csv_file.seek(0) data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = {k: row[k] for k in inner_keyset} item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:inner_value}}} # the params are extract from the csv column name as you like def build_level3_dict2(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item[row[inner_key2]] = row[inner_value] item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict # build specific nested dict from csv files # a dict like {outer_key:{inner_key1:{inner_key2:[inner_value]}}} # for multiple inner_value with the same inner_key2,thus gather them in a list # the params are extract from the csv column name as you like def build_level3_dict3(source_file,outer_key,inner_key1,inner_key2,inner_value): new_dict = {} with open(source_file, "rb")as csv_file: data = csv.DictReader(csv_file, delimiter=",") for row in data: item = new_dict.get(row[outer_key], dict()) sub_item = item.get(row[inner_key1], dict()) sub_item.setdefault(row[inner_key2], []).append(row[inner_value]) item[row[inner_key1]] = sub_item new_dict[row[outer_key]] = item return new_dict #---------------------------------------------------------------------------------------------------------- #---------------------------------------------------csv <--> list-------------------------------------------- def list2csv(list, file): # def list2csv(list): # wr = csv.writer(open(file, "wb"), quoting=csv.QUOTE_ALL) wr=open(file,"w") for word in list: # print "".join(word) # wr.writerow([word]) wr.write(word+" ") # wr.writerow(str.split(word,""")[0]) # print [word] # test_list = ["United States", "China", "America", "England"] # list2csv(test_list,"small_test.csv") # write nested list of dict to csv def nestedlist2csv(list, out_file): with open(out_file, "wb") as f: w = csv.writer(f) fieldnames=list[0].keys() # solve the problem to automatically write the header w.writerow(fieldnames) for row in list: w.writerow(row.values()) # my_list = [{"players.vis_name": "Khazri", "players.role": "Midfielder", "players.country": "Tunisia", # "players.last_name": "Khazri", "players.player_id": "989", "players.first_name": "Wahbi", # "players.date_of_birth": "08/02/1991", "players.team": "Bordeaux"}, # {"players.vis_name": "Khazri", "players.role": "Midfielder", "players.country": "Tunisia", # "players.last_name": "Khazri", "players.player_id": "989", "players.first_name": "Wahbi", # "players.date_of_birth": "08/02/1991", "players.team": "Sunderland"}, # {"players.vis_name": "Lewis Baker", "players.role": "Midfielder", "players.country": "England", # "players.last_name": "Baker", "players.player_id": "9574", "players.first_name": "Lewis", # "players.date_of_birth": "25/04/1995", "players.team": "Vitesse"} # ] # nestedlist2csv(my_list, "dict2csv_test.csv") # collect and convert the first column of csv file to list def csv2list(csv_file): lst = [] with open(csv_file, "rb")as f: reader = csv.reader(f, delimiter=",") for row in reader: lst.append(row[0]) return list(set(lst)) #----------------------------------------------------------------------------------------------------------抽取特定列
抽取特定列的所有值并存儲于列表
根據(jù)下標(biāo)抽取特定列到某個新的csv文件
抽取特定列的所有值并存儲于列表獲取某列原始的數(shù)據(jù)并保存為列表
# get certain column value of csv(for common csv file(",")) def get_origin_column_value(file, column_name): with open(file, "rb") as f: role_list = [] reader = csv.reader(f, delimiter=",") fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",") for row in reader: role_list.append(row[column_name]) return role_list
對于某些有特殊需要的可以直接修改代碼,比如對原始的列的值進(jìn)行除重和排序后獲取,如下
# get certain column value of csv(for common csv file(",")),and judge if it"s repeated def get_column_value2(file, column_name): with open(file, "rb") as f: role_list = [] reader = csv.reader(f, delimiter=",") fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",") for row in reader: role_list.append(row[column_name]) role_set = set(role_list) return sorted(list(role_set))根據(jù)下標(biāo)抽取特定列到某個新的csv文件
import csv # extract certain column from csv file according to the column# def column_extract(file_in,file_out,index): with open(file_in,"r") as f_in: with open(file_out,"w") as f_out: for line in f_in: f_out.write(line.split(",")[index]) f_out.write(" ") # comment if a new line already exists除去完全重復(fù)的記錄
# eliminated the completely repeated record in repeated file for further analysis def eliminate_repeated_row(in_file,out_file): with open(in_file,"rb") as in_file,open(out_file,"wb")as out_file: seen=set() for line in in_file: # print line if line in seen:continue seen.add(line) out_file.write(line)對csv文件按照某一列排序
# sort the csv file by certain column to put the similar record together for further analysis def sort_csv_byColumn(in_file, out_file,column_name): with open(in_file, "rb") as f: reader = csv.reader(f, delimiter=",") fieldnames = next(reader) reader = csv.DictReader(f, fieldnames=fieldnames, delimiter=",") sorted_list = sorted(reader, key=lambda row: row[column_name], reverse=True) # print sorted_list csv_converter.nestedlist2csv(sorted_list, out_file)
例如我們按照league_name排序(注意這里調(diào)用了csv轉(zhuǎn)換器中的方法將列表的字典轉(zhuǎn)換為csv文件)
sort_csv_byColumn("leagues_size.csv","ordered_leagues_size.csv","league_name")
得到結(jié)果
season_id,league_name,league_size 2016,ykkonen,9 2003,ykkonen,14 2005,ykkonen,14 2006,ykkonen,14 2007,ykkonen,14 2010,ykkonen,13 2011,ykkonen,10 2009,ykkonen,14 2008,ykkonen,14 2012,ykkonen,10 2013,ykkonen,10 2014,ykkonen,10 2015,ykkonen,10 2016,wiener-stadtliga,16 1988,wiener-stadtliga,16 1993,wiener-stadtliga,16 1994,wiener-stadtliga,16 1995,wiener-stadtliga,16 1996,wiener-stadtliga,16 1997,wiener-stadtliga,16 1998,wiener-stadtliga,16
如果我們按league_size排序
sort_csv_byColumn("leagues_size.csv", "orderedbysize_leagues_size.csv","league_size")
得到結(jié)果
season_id,league_name,league_size 2008,virsliga,9 2010,virsliga,9 2012,a-lyga,9 2012,a-pojat-sm-sarja,9 2013,a-pojat-sm-sarja,9 1953,salzburger-liga,9 2010,3-lig-grup-1,9 2013,armenian-first-league,9 2016,ykkonen,9 2014,stirling-sports-premiership,9 2014,hong-kong-premier-league,9 2015,hong-kong-premier-league,9 1996,s-league,9 2015,s-league,9 2013,united-football-league,9 2016,i-league,9鍵值互換
csv文件每一條記錄其實(shí)可以看作是一個字典,有時csv文件里有不同的鍵對應(yīng)同一個值的情況,我們想講記錄反轉(zhuǎn)一下,即讓值作為鍵,對應(yīng)的鍵作為值
# return a dict with the same value in original as new key and keys as value def dict_same_value(original_dict): new_dict={} for k,v in original_dict.iteritems(): new_dict.setdefault(v,[]).append(k) return new_dict
最后歡迎大家fork關(guān)于這個的github上的repository,一起豐富更多好玩的功能~
更新日志
1、2016-12-18 修復(fù)了從csv文件中獲取特定的列的值保存為集合的問題,而是存儲為原始的列表
2、2016-12-22 改進(jìn)了csv轉(zhuǎn)換器中的構(gòu)建二級字典的方法,使其變得更加靈活
3、2016年12月24日14:57:48 在csv轉(zhuǎn)換器部分加入三級字典構(gòu)造的參照方法
4、2017年1月9日11:28:45 在csv轉(zhuǎn)換器部分,三級字典構(gòu)造中,加入了最內(nèi)部存儲值為列表的構(gòu)造方法
5、2017年1月16日10:43:41 在csv轉(zhuǎn)換器部分,加入了構(gòu)造列表字典的方法以及構(gòu)造特殊的二級字典(內(nèi)部為列表)的方法
6、2017年2月9日10:58:17 在csv轉(zhuǎn)換器部分,加入了新的構(gòu)造特殊的二級字典(內(nèi)部為列表)的方法
7、2017年2月10日11:21:45 在csv轉(zhuǎn)換器部分,改進(jìn)了簡單的csv文件轉(zhuǎn)換為字典的方法,此外在Csv_Match部分,加入了匹配判斷多個列對應(yīng)的元素條件的方法
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/38266.html
摘要:一概述逗號分隔符文件是表格與數(shù)據(jù)庫操作之中最常用的輸入輸出格式。但是盡管不同規(guī)范的之中,分隔符和引用符千差萬別,他們的格式還是大體相似的,因此制作一個可以高效處理文件中的數(shù)據(jù)同時還能將讀寫的細(xì)節(jié)隱去的模塊并不是什么難事兒。 一、概述 CSV(逗號分隔符)文件是表格與數(shù)據(jù)庫操作之中最常用的輸入輸出格式。在RFC4180中的文件描述標(biāo)準(zhǔn)對CSV格式進(jìn)行規(guī)范之前,CSV格式文件就已經(jīng)被應(yīng)用了...
摘要:三種運(yùn)行方式的快捷鍵見菜單分別表示只運(yùn)行當(dāng)前單元格運(yùn)行并選擇下一個單元格和運(yùn)行并向下插入一個單元格。也可以使用直接在中顯示網(wǎng)頁,代碼效果直接在單元格下方展示差異情況。等未來正式發(fā)布的時候,一定非常好用后,沒有的用戶們也可以使用該功能。 ...
小編寫這篇文章的主要目的,主要是給大家做一個解答,解答有幾個方向,包括利用python合并csv文件的一些相關(guān)實(shí)例,下面就給大家做出一個解答?! ?.用concat方法合并csv 將兩個相同的csv文件進(jìn)行數(shù)據(jù)合并,通過pandas的read_csv和to_csv來完成,即采用concat方法: #加載第三方庫 importpandasaspd importnumpyasnp #...
小編寫這篇文章的主要目的,主要是講解一下關(guān)于Python的一些知識,比如教大家怎么能夠?qū)崟r的去進(jìn)行寫入數(shù)據(jù),提高工作的效率,關(guān)于其具體的代碼,下面小編給大家詳細(xì)解答下。 之前在做數(shù)據(jù)分析的時候,我們需要處理大量高并發(fā)的數(shù)據(jù),比如對新的數(shù)據(jù)去進(jìn)行處理,那么,怎么處理呢?下面給大家進(jìn)行一個詳細(xì)的解答。 1:實(shí)時向csv文件寫入數(shù)據(jù) 假設(shè)需要生成一張csv表,里面的字段對應(yīng)一些數(shù)據(jù),由于后續(xù)的...
摘要:在中運(yùn)行不同的實(shí)驗(yàn)似乎比試圖在中進(jìn)行這種探索性的操作更有效。理論上,我們可以在中做很多的探索。我們?nèi)绾卫^續(xù)第一步是獲取格式的原始數(shù)據(jù)。這些列將包含來自使用該代理鍵的一個請求的一行數(shù)據(jù)。這是重構(gòu)的另一部分。數(shù)據(jù)的最終顯示保持完全分離。 歡迎大家前往騰訊云+社區(qū),獲取更多騰訊海量技術(shù)實(shí)踐干貨哦~ 這里有一些技巧來處理日志文件提取。假設(shè)我們正在查看一些Enterprise Splunk提取。...
閱讀 1641·2021-09-22 15:25
閱讀 1519·2021-09-07 10:06
閱讀 3195·2019-08-30 15:53
閱讀 1099·2019-08-29 13:12
閱讀 3390·2019-08-29 13:07
閱讀 738·2019-08-28 18:19
閱讀 2278·2019-08-27 10:57
閱讀 995·2019-08-26 13:29