摘要:結(jié)果可見這些線程是真的沒有并發(fā)執(zhí)行,而是順序執(zhí)行的,并沒有達(dá)到多線程的目的。綜上由我自己了解的知識(shí)和本實(shí)驗(yàn)而言,我的結(jié)論是用上多線程下載速度能夠比過,但是解析網(wǎng)頁(yè)這種事沒有快,畢竟原生就是為了寫網(wǎng)頁(yè),而且復(fù)雜的爬蟲總不能都用字符串去找吧。
前言
早就聽說Nodejs的異步策略是多么的好,I/O是多么的牛逼......反正就是各種好。今天我就準(zhǔn)備給nodejs和python來(lái)做個(gè)比較。能體現(xiàn)異步策略和I/O優(yōu)勢(shì)的項(xiàng)目,我覺得莫過于爬蟲了。那么就以一個(gè)爬蟲項(xiàng)目來(lái)一較高下吧。
爬蟲項(xiàng)目眾籌網(wǎng)-眾籌中項(xiàng)目 http://www.zhongchou.com/brow...,我們就以這個(gè)網(wǎng)站為例,我們爬取它所有目前正在眾籌中的項(xiàng)目,獲得每一個(gè)項(xiàng)目詳情頁(yè)的URL,存入txt文件中。
實(shí)戰(zhàn)比較 python原始版# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time from BeautifulSoup import BeautifulSoup # HTML #請(qǐng)求頭 headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":1, "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } # 獲得項(xiàng)目url列表 def getItems(allpage): no = 0 items = open("pystandard.txt","a") for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) # print url #① r1 = requests.get(url,headers=headers) html = r1.text.encode("utf8") soup = BeautifulSoup(html); lists = soup.findAll(attrs={"class":"ssCardItem"}) for i in range(len(lists)): href = lists[i].a["href"] items.write(href+" ") no +=1 items.close() return no if __name__ == "__main__": start = time.clock() allpage = 30 no = getItems(allpage) end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,no))
實(shí)驗(yàn)5次的結(jié)果:
it takes 48.1727159614 Seconds to get 720 items it takes 45.3397999415 Seconds to get 720 items it takes 44.4811429862 Seconds to get 720 items it takes 44.4619293082 Seconds to get 720 items it takes 46.669706593 Seconds to get 720 itemspython多線程版
# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time,threading from BeautifulSoup import BeautifulSoup # HTML #請(qǐng)求頭 headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":1, "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } items = open("pymulti.txt","a") no = 0 lock = threading.Lock() # 獲得項(xiàng)目url列表 def getItems(urllist): # print urllist #① global items,no,lock for url in urllist: r1 = requests.get(url,headers=headers) html = r1.text.encode("utf8") soup = BeautifulSoup(html); lists = soup.findAll(attrs={"class":"ssCardItem"}) for i in range(len(lists)): href = lists[i].a["href"] lock.acquire() items.write(href+" ") no +=1 # print no lock.release() if __name__ == "__main__": start = time.clock() allpage = 30 allthread = 30 per = (int)(allpage/allthread) urllist = [] ths = [] for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) urllist.append(url) for i in range(allthread): # print urllist[i*(per):(i+1)*(per)] th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],)) th.start() th.join() items.close() end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,no))
實(shí)驗(yàn)5次的結(jié)果:
it takes 45.5222291114 Seconds to get 720 items it takes 46.7097831417 Seconds to get 720 items it takes 45.5334646156 Seconds to get 720 items it takes 48.0242797553 Seconds to get 720 items it takes 44.804855018 Seconds to get 720 items
這個(gè)多線程并沒有優(yōu)勢(shì),經(jīng)過 #① 的注釋與否發(fā)現(xiàn),這個(gè)所謂的多線程也是按照單線程運(yùn)行的。
python改進(jìn) 單線程首先我們把解析html的步驟改進(jìn)一下,分析發(fā)現(xiàn)
lists = soup.findAll("a",attrs={"class":"siteCardICH3"})
比
lists = soup.findAll(attrs={"class":"ssCardItem"})
更好,因?yàn)樗侵苯诱?a ,而不是先找 div 再找 div 下的 a
改進(jìn)后實(shí)驗(yàn)5次結(jié)果如下,可見有進(jìn)步:
it takes 41.0018861912 Seconds to get 720 items it takes 42.0260390497 Seconds to get 720 items it takes 42.249635988 Seconds to get 720 items it takes 41.295524133 Seconds to get 720 items it takes 42.9022894154 Seconds to get 720 items多線程
修改 getItems(urllist) 為 getItems(urllist,thno)
函數(shù)起止加入 print thno," begin at",time.clock() 和 print thno," end at",time.clock()。結(jié)果:
0 begin at 0.00100631078628 0 end at 1.28625832936 1 begin at 1.28703230691 1 end at 2.61739476075 2 begin at 2.61801291642 2 end at 3.92514717937 3 begin at 3.9255829208 3 end at 5.38870235361 4 begin at 5.38921134066 4 end at 6.670658786 5 begin at 6.67125734731 5 end at 8.01520989534 6 begin at 8.01566383155 6 end at 9.42006780585 7 begin at 9.42053340537 7 end at 11.0386755513 8 begin at 11.0391565464 8 end at 12.421359168 9 begin at 12.4218294329 9 end at 13.9932716671 10 begin at 13.9939957256 10 end at 15.3535799145 11 begin at 15.3540870354 11 end at 16.6968289314 12 begin at 16.6972665389 12 end at 17.9798803157 13 begin at 17.9804714125 13 end at 19.326706238 14 begin at 19.3271438455 14 end at 20.8744308886 15 begin at 20.8751017624 15 end at 22.5306500245 16 begin at 22.5311450156 16 end at 23.7781693541 17 begin at 23.7787245279 17 end at 25.1775114499 18 begin at 25.178350742 18 end at 26.5497330734 19 begin at 26.5501776789 19 end at 27.970799259 20 begin at 27.9712727895 20 end at 29.4595075375 21 begin at 29.4599959972 21 end at 30.9507299602 22 begin at 30.9513989679 22 end at 32.2762763982 23 begin at 32.2767182045 23 end at 33.6476256057 24 begin at 33.648137392 24 end at 35.1100517711 25 begin at 35.1104907783 25 end at 36.462657099 26 begin at 36.4632234696 26 end at 37.7908515759 27 begin at 37.7912845182 27 end at 39.4359928956 28 begin at 39.436448698 28 end at 40.9955021593 29 begin at 40.9960871912 29 end at 42.6425665264 it takes 42.6435882327 Seconds to get 720 items
可見這些線程是真的沒有并發(fā)執(zhí)行,而是順序執(zhí)行的,并沒有達(dá)到多線程的目的。問題在哪里呢?原來(lái)
我的循環(huán)中
th.start() th.join()
兩行代碼是緊接著的,所以新的線程會(huì)等待上一個(gè)線程執(zhí)行完畢才會(huì)start,修改為
for i in range(allthread): # print urllist[i*(per):(i+1)*(per)] th = threading.Thread(target = getItems,args= (urllist[i*(per):(i+1)*(per)],i)) ths.append(th) for th in ths: th.start() for th in ths: th.join()
結(jié)果:
0 begin at 0.0010814225325 1 begin at 0.00135201143191 2 begin at 0.00191744892518 3 begin at 0.0021311208492 4 begin at 0.00247495536449 5 begin at 0.0027334144167 6 begin at 0.00320601192551 7 begin at 0.00379011072218 8 begin at 0.00425431064445 9 begin at 0.00511692939449 10 begin at 0.0132038052264 11 begin at 0.0165926979253 12 begin at 0.0170886220634 13 begin at 0.0174665134574 14 begin at 0.018348726576 15 begin at 0.0189780790334 16 begin at 0.0201896641572 17 begin at 0.0220576606283 18 begin at 0.0231484138125 19 begin at 0.0238804034387 20 begin at 0.0273901280772 21 begin at 0.0300363009005 22 begin at 0.0362878375422 23 begin at 0.0395512329756 24 begin at 0.0431556637289 25 begin at 0.0459581249682 26 begin at 0.0482254733323 27 begin at 0.0535430117384 28 begin at 0.0584971212607 29 begin at 0.0598136762161 16 end at 65.2657542222 24 end at 66.2951247811 21 end at 66.3849747583 4 end at 66.6230160119 5 end at 67.5501632164 29 end at 67.7516992283 23 end at 68.6985322418 7 end at 69.1060433231 22 end at 69.2743398214 2 end at 69.5523713152 14 end at 69.6454986837 15 end at 69.8333400981 12 end at 69.9508018062 10 end at 70.2860348602 26 end at 70.3670659719 13 end at 70.3847232972 27 end at 70.3941635841 11 end at 70.5132838156 1 end at 70.7272351926 0 end at 70.9115253609 6 end at 71.0876563409 8 end at 71.112480539825 end at 71.1145248855 3 end at 71.4606034226 19 end at 71.6103622486 18 end at 71.6674453096 20 end at 71.725601862 17 end at 71.7778992318 9 end at 71.7847479301 28 end at 71.7921004837 it takes 71.7931912368 Seconds to get 720 items反思
上面的的多線是并發(fā)了,可是比單線程運(yùn)行時(shí)間長(zhǎng)了太多......我還沒找出來(lái)原因,猜想是不是beautifulsoup不支持多線程?請(qǐng)各位多多指教。為了驗(yàn)證這個(gè)想法,我準(zhǔn)備不用beautifulsoup,直接使用字符串查找。首先還是從單線程的修改:
# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time from BeautifulSoup import BeautifulSoup # HTML #請(qǐng)求頭 headers = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } # 獲得項(xiàng)目url列表 def getItems(allpage): no = 0 data = set() for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) # print url #① r1 = requests.get(url,headers=headers) html = r1.text.encode("utf8") start = 5000 while True: index = html.find("deal-show", start) if index == -1: break # print "http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+" " # time.sleep(100) data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+" ") start = index + 1000 items = open("pystandard.txt","a") items.write("".join(data)) items.close() return len(data) if __name__ == "__main__": start = time.clock() allpage = 30 no = getItems(allpage) end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,no))
實(shí)驗(yàn)3次,結(jié)果:
it takes 11.6800132309 Seconds to get 720 items it takes 11.3621804427 Seconds to get 720 items it takes 11.6811991567 Seconds to get 720 items
然后對(duì)多線程進(jìn)行修改:
# -*- coding:utf-8 -*- """ Created on 20160827 @author: qiukang """ import requests,time,threading from BeautifulSoup import BeautifulSoup # HTML #請(qǐng)求頭 header = { "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Connection":"keep-alive", "Host":"www.zhongchou.com", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36" } data = set() no = 0 lock = threading.Lock() # 獲得項(xiàng)目url列表 def getItems(urllist,thno): # print urllist # print thno," begin at",time.clock() global no,lock,data for url in urllist: r1 = requests.get(url,headers=header) html = r1.text.encode("utf8") start = 5000 while True: index = html.find("deal-show", start) if index == -1: break lock.acquire() data.add("http://www.zhongchou.com/deal-show/"+html[index+10:index+19]+" ") start = index + 1000 lock.release() # print thno," end at",time.clock() if __name__ == "__main__": start = time.clock() allpage = 30 #頁(yè)數(shù) allthread = 10 #線程數(shù) per = (int)(allpage/allthread) urllist = [] ths = [] for page in range(allpage): if page==0: url = "http://www.zhongchou.com/browse/di" else: url = "http://www.zhongchou.com/browse/di-p"+str(page+1) urllist.append(url) for i in range(allthread): # print urllist[i*(per):(i+1)*(per)] low = i*allpage/allthread#注意寫法 high = (i+1)*allpage/allthread # print low," ",high th = threading.Thread(target = getItems,args= (urllist[low:high],i)) ths.append(th) for th in ths: th.start() for th in ths: th.join() items = open("pymulti.txt","a") items.write("".join(data)) items.close() end = time.clock() print("it takes %s Seconds to get %s items "%(end-start,len(data)))
實(shí)驗(yàn)3次,結(jié)果:
it takes 1.4781525123 Seconds to get 720 items it takes 1.44905954029 Seconds to get 720 items it takes 1.49297891786 Seconds to get 720 items
可見多線程確實(shí)比單線程快好多倍。對(duì)于簡(jiǎn)單的爬取任務(wù)而言,用字符串的內(nèi)置方法比用beautifulsoup解析html快很多。
NodeJs// npm install request -g #貌似不行,要進(jìn)入代碼所在目錄:npm install --save request // npm install cheerio -g #npm install --save cheerio var request = require("request"); var cheerio = require("cheerio"); var fs = require("fs"); var t1 = new Date().getTime(); var allpage = 30; var urllist = new Array() var urldata = ""; var mark = 0; var no = 0; for (var i=0; i= 0; i--) { // console.log(href[i].attribs["href"]); urldata += (href[i].attribs["href"]+" "); no += 1; } mark += 1; if (mark==allpage) { // console.log(urldata); fs.writeFile("./nodestandard.txt",urldata,function(err){ if(err) throw err; }); var t2 = new Date().getTime(); console.log("it takes " + ((t2-t1)/1000).toString() + " Seconds to get " + no.toString() + " items"); } }
實(shí)驗(yàn)5次的結(jié)果:
it takes 3.949 Seconds to get 720 items it takes 3.642 Seconds to get 720 items it takes 3.641 Seconds to get 720 items it takes 3.938 Seconds to get 720 items it takes 3.783 Seconds to get 720 items
可見同樣是用解析html的方法,nodejs速度完虐python。字符串查找呢?
var request = require("request"); var cheerio = require("cheerio"); var fs = require("fs"); var t1 = new Date().getTime(); var allpage = 30; var urllist = new Array() ; var urldata = new Array(); var mark = 0; var no = 0; for (var i=0; i實(shí)驗(yàn)5次的結(jié)果:
it takes 3.695 Seconds to get 720 items it takes 3.781 Seconds to get 720 items it takes 3.94 Seconds to get 720 items it takes 3.705 Seconds to get 720 items it takes 3.601 Seconds to get 720 items可見和解析起來(lái)的時(shí)間是差不多的。
綜上由我自己了解的知識(shí)和本實(shí)驗(yàn)而言,我的結(jié)論是:python用上多線程下載速度能夠比過nodejs,但是解析網(wǎng)頁(yè)這種事python沒有nodejs快,畢竟js原生就是為了寫網(wǎng)頁(yè),而且復(fù)雜的爬蟲總不能都用字符串去找吧。
2016.9.13-補(bǔ)充評(píng)論中提到的time.time(),感謝老司機(jī)指出我的錯(cuò)誤,我在python多線程,字符串查找版本中使用了,實(shí)驗(yàn)3次過后依然是快于nodejs版本的平均用時(shí)2.3S,不知道是不是您和我的網(wǎng)絡(luò)環(huán)境不一樣導(dǎo)致?我準(zhǔn)備換個(gè)教室試試......至于有沒有誤導(dǎo)人,我想讀者會(huì)自己去嘗試,得出自己的結(jié)論。
Python的確有異步(twisted),nodejs也的確有多進(jìn)程(child_process),我想追求極致的性能比較還需要對(duì)這兩種語(yǔ)言有更深入的研究,這個(gè)我目前也是半知不解,我會(huì)盡快花時(shí)間了解,爭(zhēng)取實(shí)現(xiàn)比較(這里不是追求編程方法的比較,就是單純的想比較在同一臺(tái)機(jī)器同一個(gè)網(wǎng)絡(luò)下,兩種語(yǔ)言能做到的極致。道阻且長(zhǎng)啊。)
還有解析方法,我這里用的是python自帶的解析,官網(wǎng)說lxml的確比自帶的快,但是我這里換了過后多線程依然沒有體現(xiàn)出來(lái)優(yōu)勢(shì),所以我還是很疑惑,是不是beautifulsoup不支持多線程?,我在官網(wǎng)沒找到相關(guān)文檔,請(qǐng)各位指教。另外from BeautifulSoup import BeautifulSoup的確是比from bs4 import BeautifulSoup 慢多了,這是BeautifulSoup的版本原因,感謝評(píng)論者指出。
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/38147.html
摘要:也就是說,我的篇文章的請(qǐng)求對(duì)應(yīng)個(gè)實(shí)例,這些實(shí)例都請(qǐng)求完畢后,執(zhí)行以下邏輯他的目的在于對(duì)每一個(gè)返回值這個(gè)返回值為單篇文章的內(nèi)容,進(jìn)行方法處理。 英國(guó)人Robert Pitt曾在Github上公布了他的爬蟲腳本,導(dǎo)致任何人都可以容易地取得Google Plus的大量公開用戶的ID信息。至今大概有2億2千5百萬(wàn)用戶ID遭曝光。 亮點(diǎn)在于,這是個(gè)nodejs腳本,非常短,包括注釋只有71行。 ...
摘要:,大家好,很榮幸有這個(gè)機(jī)會(huì)可以通過寫博文的方式,把這些年在后端開發(fā)過程中總結(jié)沉淀下來(lái)的經(jīng)驗(yàn)和設(shè)計(jì)思路分享出來(lái)模塊化設(shè)計(jì)根據(jù)業(yè)務(wù)場(chǎng)景,將業(yè)務(wù)抽離成獨(dú)立模塊,對(duì)外通過接口提供服務(wù),減少系統(tǒng)復(fù)雜度和耦合度,實(shí)現(xiàn)可復(fù)用,易維護(hù),易拓展項(xiàng)目中實(shí)踐例子 Hi,大家好,很榮幸有這個(gè)機(jī)會(huì)可以通過寫博文的方式,把這些年在后端開發(fā)過程中總結(jié)沉淀下來(lái)的經(jīng)驗(yàn)和設(shè)計(jì)思路分享出來(lái) 模塊化設(shè)計(jì) 根據(jù)業(yè)務(wù)場(chǎng)景,將業(yè)務(wù)...
摘要:所以與多線程相比,線程的數(shù)量越多,協(xié)程性能的優(yōu)勢(shì)越明顯。值得一提的是,在此過程中,只有一個(gè)線程在執(zhí)行,因此這與多線程的概念是不一樣的。 真正有知識(shí)的人的成長(zhǎng)過程,就像麥穗的成長(zhǎng)過程:麥穗空的時(shí)候,麥子長(zhǎng)得很快,麥穗驕傲地高高昂起,但是,麥穗成熟飽滿時(shí),它們開始謙虛,垂下麥芒。 ——蒙田《蒙田隨筆全集》 上篇論述了關(guān)于python多線程是否是雞肋的問題,得到了一些網(wǎng)友的認(rèn)可,當(dāng)然也有...
摘要:用將倒放這次讓我們一個(gè)用做一個(gè)小工具將動(dòng)態(tài)圖片倒序播放發(fā)現(xiàn)引力波的機(jī)構(gòu)使用的包美國(guó)科學(xué)家日宣布,他們?nèi)ツ暝率状翁綔y(cè)到引力波。宣布這一發(fā)現(xiàn)的,是激光干涉引力波天文臺(tái)的負(fù)責(zé)人。這個(gè)機(jī)構(gòu)誕生于上世紀(jì)年代,進(jìn)行引力波觀測(cè)已經(jīng)有近年。 那些年我們寫過的爬蟲 從寫 nodejs 的第一個(gè)爬蟲開始陸陸續(xù)續(xù)寫了好幾個(gè)爬蟲,從爬拉勾網(wǎng)上的職位信息到爬豆瓣上的租房帖子,再到去爬知乎上的妹子照片什么的,爬蟲...
閱讀 3659·2021-10-09 09:58
閱讀 1202·2021-09-22 15:20
閱讀 2503·2019-08-30 15:54
閱讀 3520·2019-08-30 14:08
閱讀 897·2019-08-30 13:06
閱讀 1827·2019-08-26 12:16
閱讀 2687·2019-08-26 12:11
閱讀 2517·2019-08-26 10:38