摘要:大家都知道百度云網(wǎng)盤上有很多分享的資源,包括軟件各類視頻自學(xué)教程電子書甚至各種電影種子應(yīng)有盡有,但百度云卻沒(méi)有提供相應(yīng)的搜索功能。于是就嘗試開發(fā)一個(gè)百度云資源的搜索系統(tǒng)。
大家都知道百度云網(wǎng)盤上有很多分享的資源,包括軟件、各類視頻自學(xué)教程、電子書、甚至各種電影、BT種子應(yīng)有盡有,但百度云卻沒(méi)有提供相應(yīng)的搜索功能。個(gè)人平時(shí)要找一些軟件、美劇覺(jué)得非常蛋疼。于是就嘗試開發(fā)一個(gè)百度云資源的搜索系統(tǒng)。
資源爬蟲思路:
搜索引擎么最重要的就是有海量的資源了,有了資源,只要再基于資源實(shí)現(xiàn)全文檢索功能就是一個(gè)簡(jiǎn)單的搜索引擎了。首先我需要爬取百度云的分享資源,爬取思路,打開任意一個(gè)百度云分享者的主頁(yè)yun.baidu.com/share/home?uk=xxxxxx&view=share#category/type=0,你可以發(fā)現(xiàn)分享者有訂閱者和粉絲,你可以遞歸遍歷訂閱者和粉絲,從而獲得大量分享者uk,進(jìn)而獲得大量的分享資源。
系統(tǒng)實(shí)現(xiàn)環(huán)境:
語(yǔ)言:python
操作系統(tǒng):Linux
其他中間件:nginx、mysql、sphinx
系統(tǒng)包括幾個(gè)獨(dú)立的部分:
基于requests實(shí)現(xiàn)的獨(dú)立資源爬蟲
基于開源全文檢索引擎sphinx實(shí)現(xiàn)的資源索引程序
基于Django+bootstrap3開發(fā)的簡(jiǎn)易網(wǎng)站,網(wǎng)站搭建采用nginx1.8+fastCGI(flup)+python。 演示網(wǎng)站http://www.itjujiao.com
后續(xù)優(yōu)化:
分詞處理,目前分詞搜索結(jié)果不是很理想,有大神可以指點(diǎn)下思路。比如我檢索“功夫熊貓之卷軸的秘密”,一個(gè)結(jié)果都沒(méi)有。而檢索“功夫熊貓“有結(jié)果集(功丶夫熊貓⒊英語(yǔ)中英字幕.mp4,功丶夫熊貓2.Kung.Fu.Panda.2.2011.BDrip.720P.國(guó)粵英臺(tái)四語(yǔ).特效中英字幕.mp4,功丶夫熊貓3(韓版)2016.高清中字.mkv等)或搜索”卷軸的秘密“有結(jié)果集([美國(guó)]功夫潘達(dá)之卷軸的秘密.2016.1080p.mp4, g夫熊貓之卷軸的秘密.HD1280超清中英雙字.mp4等)
數(shù)據(jù)去重,目前發(fā)現(xiàn)抓取的數(shù)據(jù)很多是共享資源,后續(xù)考慮基于MD5去重
PS:
目前爬蟲爬取了4000W左右的數(shù)據(jù),sphinx對(duì)內(nèi)存的要求實(shí)在太大了,巨坑。
百度會(huì)對(duì)爬蟲做ip限制,寫了個(gè)簡(jiǎn)單的xicidaili代理采集程序,requests可以配置http代理。
分詞是sphinx自帶的實(shí)現(xiàn),支持中文分詞,中文基于一元分詞,有點(diǎn)過(guò)度分詞,分詞效果不是特別理想,比如我搜關(guān)鍵詞“葉問(wèn)3”出現(xiàn)的結(jié)果中會(huì)有“葉子的問(wèn)題第3版”,不符合預(yù)期。英文分詞有很多可以改善的地方,比如我搜xart不會(huì)出現(xiàn)x-art的結(jié)果,而實(shí)際上x-art卻也是我想要的結(jié)果集(你們懂的)。
數(shù)據(jù)庫(kù)是mysql,資源表,考慮單表記錄上限,分了10個(gè)表。第一次爬完sphinx做全量索引,后續(xù)做增量索引。
爬蟲部分實(shí)現(xiàn)代碼(只是思路代碼有點(diǎn)亂):
#coding: utf8 import re import urllib2 import time from Queue import Queue import threading, errno, datetime import json import requests import MySQLdb as mdb DB_HOST = "127.0.0.1" DB_USER = "root" DB_PASS = "" re_start = re.compile(r"start=(d+)") re_uid = re.compile(r"query_uk=(d+)") re_pptt = re.compile(r"&pptt=(d+)") re_urlid = re.compile(r"&urlid=(d+)") ONEPAGE = 20 ONESHAREPAGE = 20 URL_SHARE = "http://yun.baidu.com/pcloud/feed/getsharelist?auth_type=1&start={start}&limit=20&query_uk={uk}&urlid={id}" URL_FOLLOW = "http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk={uk}&limit=20&start={start}&urlid={id}" URL_FANS = "http://yun.baidu.com/pcloud/friend/getfanslist?query_uk={uk}&limit=20&start={start}&urlid={id}" QNUM = 1000 hc_q = Queue(20) hc_r = Queue(QNUM) success = 0 failed = 0 PROXY_LIST = [[0, 10, "42.121.33.160", 809, "", "", 0], [5, 0, "218.97.195.38", 81, "", "", 0], ] def req_worker(inx): s = requests.Session() while True: req_item = hc_q.get() req_type = req_item[0] url = req_item[1] r = s.get(url) hc_r.put((r.text, url)) print "req_worker#", inx, url def response_worker(): dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "baiduyun", charset="utf8") dbcurr = dbconn.cursor() dbcurr.execute("SET NAMES utf8") dbcurr.execute("set global wait_timeout=60000") while True: metadata, effective_url = hc_r.get() #print "response_worker:", effective_url try: tnow = int(time.time()) id = re_urlid.findall(effective_url)[0] start = re_start.findall(effective_url)[0] if True: if "getfollowlist" in effective_url: #type = 1 follows = json.loads(metadata) uid = re_uid.findall(effective_url)[0] if "total_count" in follows.keys() and follows["total_count"]>0 and str(start) == "0": for i in range((follows["total_count"]-1)/ONEPAGE): try: dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 1, 0)" % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE))) except Exception as ex: print "E1", str(ex) pass if "follow_list" in follows.keys(): for item in follows["follow_list"]: try: dbcurr.execute("INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)" % (item["follow_uk"], item["follow_uname"], str(tnow))) except Exception as ex: print "E13", str(ex) pass else: print "delete 1", uid, start dbcurr.execute("delete from urlids where uk=%s and type=1 and start>%s" % (uid, start)) elif "getfanslist" in effective_url: #type = 2 fans = json.loads(metadata) uid = re_uid.findall(effective_url)[0] if "total_count" in fans.keys() and fans["total_count"]>0 and str(start) == "0": for i in range((fans["total_count"]-1)/ONEPAGE): try: dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 2, 0)" % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE))) except Exception as ex: print "E2", str(ex) pass if "fans_list" in fans.keys(): for item in fans["fans_list"]: try: dbcurr.execute("INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)" % (item["fans_uk"], item["fans_uname"], str(tnow))) except Exception as ex: print "E23", str(ex) pass else: print "delete 2", uid, start dbcurr.execute("delete from urlids where uk=%s and type=2 and start>%s" % (uid, start)) else: shares = json.loads(metadata) uid = re_uid.findall(effective_url)[0] if "total_count" in shares.keys() and shares["total_count"]>0 and str(start) == "0": for i in range((shares["total_count"]-1)/ONESHAREPAGE): try: dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 0, 0)" % (uid, str(ONESHAREPAGE*(i+1)), str(ONESHAREPAGE))) except Exception as ex: print "E3", str(ex) pass if "records" in shares.keys(): for item in shares["records"]: try: dbcurr.execute("INSERT INTO share(userid, filename, shareid, status) VALUES(%s, "%s", %s, 0)" % (uid, item["title"], item["shareid"])) except Exception as ex: #print "E33", str(ex), item pass else: print "delete 0", uid, start dbcurr.execute("delete from urlids where uk=%s and type=0 and start>%s" % (uid, str(start))) dbcurr.execute("delete from urlids where id=%s" % (id, )) dbconn.commit() except Exception as ex: print "E5", str(ex), id pid = re_pptt.findall(effective_url) if pid: print "pid>>>", pid ppid = int(pid[0]) PROXY_LIST[ppid][6] -= 1 dbcurr.close() dbconn.close() def worker(): global success, failed dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "baiduyun", charset="utf8") dbcurr = dbconn.cursor() dbcurr.execute("SET NAMES utf8") dbcurr.execute("set global wait_timeout=60000") while True: #dbcurr.execute("select * from urlids where status=0 order by type limit 1") dbcurr.execute("select * from urlids where status=0 and type>0 limit 1") d = dbcurr.fetchall() #print d if d: id = d[0][0] uk = d[0][1] start = d[0][2] limit = d[0][3] type = d[0][4] dbcurr.execute("update urlids set status=1 where id=%s" % (str(id),)) url = "" if type == 0: url = URL_SHARE.format(uk=uk, start=start, id=id).encode("utf-8") elif type == 1: url = URL_FOLLOW.format(uk=uk, start=start, id=id).encode("utf-8") elif type == 2: url = URL_FANS.format(uk=uk, start=start, id=id).encode("utf-8") if url: hc_q.put((type, url)) #print "processed", url else: dbcurr.execute("select * from user where status=0 limit 1000") d = dbcurr.fetchall() if d: for item in d: try: dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 0, 0)" % (item[1], str(ONESHAREPAGE))) dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 1, 0)" % (item[1], str(ONEPAGE))) dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 2, 0)" % (item[1], str(ONEPAGE))) dbcurr.execute("update user set status=1 where userid=%s" % (item[1],)) except Exception as ex: print "E6", str(ex) else: time.sleep(1) dbconn.commit() dbcurr.close() dbconn.close() for item in range(16): t = threading.Thread(target = req_worker, args = (item,)) t.setDaemon(True) t.start() s = threading.Thread(target = worker, args = ()) s.setDaemon(True) s.start() response_worker()
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/37931.html
摘要:運(yùn)營(yíng)商網(wǎng)絡(luò)大致可劃分為四朵云公有云平臺(tái)云云網(wǎng)絡(luò)云。網(wǎng)絡(luò)即云,云網(wǎng)一體化將成為未來(lái)運(yùn)營(yíng)商網(wǎng)絡(luò)的最顯著特征。 5月25日消息互聯(lián)網(wǎng)+是要讓信息技術(shù)、網(wǎng)絡(luò)技術(shù)深度融合于經(jīng)濟(jì)社會(huì)各領(lǐng)域之中,使互聯(lián)網(wǎng)下沉為各行各業(yè)都能調(diào)用的基礎(chǔ)設(shè)施資源。預(yù)計(jì)到2025年,全球?qū)⒂?5億互聯(lián)網(wǎng)用戶,使用80億個(gè)智能手機(jī),創(chuàng)建1000億個(gè)連接,產(chǎn)生176ZB的數(shù)據(jù)流量,全面實(shí)現(xiàn)泛在的連接。在未來(lái),網(wǎng)絡(luò)需要滿足海量終端的接...
摘要:今天開源了一個(gè)百度云網(wǎng)盤爬蟲項(xiàng)目,地址是。推薦使用命令安裝依賴,最簡(jiǎn)單的安裝方式更多安裝的命令可以去上面找。啟動(dòng)項(xiàng)目使用進(jìn)行進(jìn)程管理,運(yùn)行啟動(dòng)所有的后臺(tái)任務(wù),檢查任務(wù)是否正常運(yùn)行可以用命令,正常運(yùn)行的應(yīng)該有個(gè)任務(wù)。 今天開源了一個(gè)百度云網(wǎng)盤爬蟲項(xiàng)目,地址是https://github.com/callmelanmao/yunshare。 百度云分享爬蟲項(xiàng)目 github上有好幾個(gè)這樣的...
摘要:怎么樣繼一家引力主機(jī)企鵝小屋之后,又一家不良垃圾服務(wù)商,客戶工單全部不會(huì)回復(fù),極有可能會(huì)成為又一家跑路商家。果然,證明這是一家無(wú)良的了官網(wǎng)目前已經(jīng)失聯(lián),客戶服務(wù)器出問(wèn)題只能提交工單處理。mineserver怎么樣?mineserver繼一家inlicloud引力主機(jī)、企鵝小屋之后,又一家不良垃圾服務(wù)商,客戶工單全部不會(huì)回復(fù),極有可能會(huì)成為又一家跑路IDC商家。果然,mineserver證明這...
摘要:年初,金山啟動(dòng)私有云項(xiàng)目,該項(xiàng)目旨在為向金山提出了私有云網(wǎng)盤存儲(chǔ)需求的政府大型企業(yè)以及中型企業(yè)提供服務(wù),項(xiàng)目組由金山云楊鋼牽頭組建。中文站對(duì)楊鋼進(jìn)行了專訪,了解其私有云服務(wù)的技術(shù)組成和業(yè)務(wù)狀態(tài)。 2013年初,金山啟動(dòng)私有云項(xiàng)目,該項(xiàng)目旨在為向金山提出了私有云網(wǎng)盤/存儲(chǔ)需求的政府、大型企業(yè)以及中型企業(yè)提供服務(wù),項(xiàng)目組由金山云CTO楊鋼牽頭組建。InfoQ中文站對(duì)楊鋼進(jìn)行了專訪,了解其私有云服...
摘要:華為云網(wǎng)融合解決方案使能運(yùn)營(yíng)商增長(zhǎng)運(yùn)營(yíng)商基礎(chǔ)網(wǎng)絡(luò)設(shè)施優(yōu)勢(shì)明顯,網(wǎng)絡(luò)覆蓋廣接入媒介全機(jī)房光纜豐富。目前,在中國(guó)歐洲及東南亞等全球多個(gè)國(guó)家與地區(qū),華為已與多家運(yùn)營(yíng)商在云網(wǎng)融合領(lǐng)域開展商業(yè)合作,支撐運(yùn)營(yíng)商產(chǎn)品升級(jí),提升運(yùn)營(yíng)商競(jìng)爭(zhēng)力。企業(yè)ICT需求4大變化Gartner調(diào)研顯示,企業(yè)上云不是一蹴而就,而是根據(jù)應(yīng)用復(fù)雜性和上云后的業(yè)務(wù)風(fēng)險(xiǎn),由低至高逐步將企業(yè)應(yīng)用遷移至云上。隨著企業(yè)上云的不斷深入,業(yè)務(wù)...
閱讀 2302·2021-10-09 09:41
閱讀 1757·2019-08-30 15:53
閱讀 1002·2019-08-30 15:52
閱讀 3453·2019-08-30 11:26
閱讀 780·2019-08-29 16:09
閱讀 3438·2019-08-29 13:25
閱讀 2275·2019-08-26 16:45
閱讀 1943·2019-08-26 11:51