成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

基于python的百度云網(wǎng)盤資源搜索引擎設(shè)計(jì)架構(gòu)

williamwen1986 / 2988人閱讀

摘要:大家都知道百度云網(wǎng)盤上有很多分享的資源,包括軟件各類視頻自學(xué)教程電子書甚至各種電影種子應(yīng)有盡有,但百度云卻沒(méi)有提供相應(yīng)的搜索功能。于是就嘗試開發(fā)一個(gè)百度云資源的搜索系統(tǒng)。

大家都知道百度云網(wǎng)盤上有很多分享的資源,包括軟件、各類視頻自學(xué)教程、電子書、甚至各種電影、BT種子應(yīng)有盡有,但百度云卻沒(méi)有提供相應(yīng)的搜索功能。個(gè)人平時(shí)要找一些軟件、美劇覺(jué)得非常蛋疼。于是就嘗試開發(fā)一個(gè)百度云資源的搜索系統(tǒng)。

資源爬蟲思路:
搜索引擎么最重要的就是有海量的資源了,有了資源,只要再基于資源實(shí)現(xiàn)全文檢索功能就是一個(gè)簡(jiǎn)單的搜索引擎了。首先我需要爬取百度云的分享資源,爬取思路,打開任意一個(gè)百度云分享者的主頁(yè)yun.baidu.com/share/home?uk=xxxxxx&view=share#category/type=0,你可以發(fā)現(xiàn)分享者有訂閱者和粉絲,你可以遞歸遍歷訂閱者和粉絲,從而獲得大量分享者uk,進(jìn)而獲得大量的分享資源。

系統(tǒng)實(shí)現(xiàn)環(huán)境

語(yǔ)言:python

操作系統(tǒng):Linux

其他中間件:nginx、mysql、sphinx

系統(tǒng)包括幾個(gè)獨(dú)立的部分

基于requests實(shí)現(xiàn)的獨(dú)立資源爬蟲

基于開源全文檢索引擎sphinx實(shí)現(xiàn)的資源索引程序

基于Django+bootstrap3開發(fā)的簡(jiǎn)易網(wǎng)站,網(wǎng)站搭建采用nginx1.8+fastCGI(flup)+python。 演示網(wǎng)站http://www.itjujiao.com

后續(xù)優(yōu)化

分詞處理,目前分詞搜索結(jié)果不是很理想,有大神可以指點(diǎn)下思路。比如我檢索“功夫熊貓之卷軸的秘密”,一個(gè)結(jié)果都沒(méi)有。而檢索“功夫熊貓“有結(jié)果集(功丶夫熊貓⒊英語(yǔ)中英字幕.mp4,功丶夫熊貓2.Kung.Fu.Panda.2.2011.BDrip.720P.國(guó)粵英臺(tái)四語(yǔ).特效中英字幕.mp4,功丶夫熊貓3(韓版)2016.高清中字.mkv等)或搜索”卷軸的秘密“有結(jié)果集([美國(guó)]功夫潘達(dá)之卷軸的秘密.2016.1080p.mp4, g夫熊貓之卷軸的秘密.HD1280超清中英雙字.mp4等)

數(shù)據(jù)去重,目前發(fā)現(xiàn)抓取的數(shù)據(jù)很多是共享資源,后續(xù)考慮基于MD5去重

PS:

目前爬蟲爬取了4000W左右的數(shù)據(jù),sphinx對(duì)內(nèi)存的要求實(shí)在太大了,巨坑。
百度會(huì)對(duì)爬蟲做ip限制,寫了個(gè)簡(jiǎn)單的xicidaili代理采集程序,requests可以配置http代理。

分詞是sphinx自帶的實(shí)現(xiàn),支持中文分詞,中文基于一元分詞,有點(diǎn)過(guò)度分詞,分詞效果不是特別理想,比如我搜關(guān)鍵詞“葉問(wèn)3”出現(xiàn)的結(jié)果中會(huì)有“葉子的問(wèn)題第3版”,不符合預(yù)期。英文分詞有很多可以改善的地方,比如我搜xart不會(huì)出現(xiàn)x-art的結(jié)果,而實(shí)際上x-art卻也是我想要的結(jié)果集(你們懂的)。

數(shù)據(jù)庫(kù)是mysql,資源表,考慮單表記錄上限,分了10個(gè)表。第一次爬完sphinx做全量索引,后續(xù)做增量索引。

爬蟲部分實(shí)現(xiàn)代碼(只是思路代碼有點(diǎn)亂):

    #coding: utf8
    
    import re
    import urllib2
    import time
    from Queue import Queue
    import threading, errno, datetime
    import json
    import requests
    import MySQLdb as mdb
    
    DB_HOST = "127.0.0.1"
    DB_USER = "root"
    DB_PASS = ""
    
    
    re_start = re.compile(r"start=(d+)")
    re_uid = re.compile(r"query_uk=(d+)")
    re_pptt = re.compile(r"&pptt=(d+)")
    re_urlid = re.compile(r"&urlid=(d+)")
    
    ONEPAGE = 20
    ONESHAREPAGE = 20
    
    URL_SHARE = "http://yun.baidu.com/pcloud/feed/getsharelist?auth_type=1&start={start}&limit=20&query_uk={uk}&urlid={id}"
    URL_FOLLOW = "http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk={uk}&limit=20&start={start}&urlid={id}"
    URL_FANS = "http://yun.baidu.com/pcloud/friend/getfanslist?query_uk={uk}&limit=20&start={start}&urlid={id}"
    
    QNUM = 1000
    hc_q = Queue(20)
    hc_r = Queue(QNUM)
    
    success = 0
    failed = 0
    
    PROXY_LIST = [[0, 10, "42.121.33.160", 809, "", "", 0],
                    [5, 0, "218.97.195.38", 81, "", "", 0],
                    ]
    
    def req_worker(inx):
        s = requests.Session()
        while True:
            req_item = hc_q.get()
            
            req_type = req_item[0]
            url = req_item[1]
            r = s.get(url)
            hc_r.put((r.text, url))
            print "req_worker#", inx, url
            
    def response_worker():
        dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "baiduyun", charset="utf8")
        dbcurr = dbconn.cursor()
        dbcurr.execute("SET NAMES utf8")
        dbcurr.execute("set global wait_timeout=60000")
        while True:
            
            metadata, effective_url = hc_r.get()
            #print "response_worker:", effective_url
            try:
                tnow = int(time.time())
                id = re_urlid.findall(effective_url)[0]
                start = re_start.findall(effective_url)[0]
                if True:
                    if "getfollowlist" in effective_url: #type = 1
                        follows = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in follows.keys() and follows["total_count"]>0 and str(start) == "0":
                            for i in range((follows["total_count"]-1)/ONEPAGE):
                                try:
                                    dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 1, 0)" % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                                except Exception as ex:
                                    print "E1", str(ex)
                                    pass
                        
                        if "follow_list" in follows.keys():
                            for item in follows["follow_list"]:
                                try:
                                    dbcurr.execute("INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)" % (item["follow_uk"], item["follow_uname"], str(tnow)))
                                except Exception as ex:
                                    print "E13", str(ex)
                                    pass
                        else:
                            print "delete 1", uid, start
                            dbcurr.execute("delete from urlids where uk=%s and type=1 and start>%s" % (uid, start))
                    elif "getfanslist" in effective_url: #type = 2
                        fans = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in fans.keys() and fans["total_count"]>0 and str(start) == "0":
                            for i in range((fans["total_count"]-1)/ONEPAGE):
                                try:
                                    dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 2, 0)" % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                                except Exception as ex:
                                    print "E2", str(ex)
                                    pass
                        
                        if "fans_list" in fans.keys():
                            for item in fans["fans_list"]:
                                try:
                                    dbcurr.execute("INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)" % (item["fans_uk"], item["fans_uname"], str(tnow)))
                                except Exception as ex:
                                    print "E23", str(ex)
                                    pass
                        else:
                            print "delete 2", uid, start
                            dbcurr.execute("delete from urlids where uk=%s and type=2 and start>%s" % (uid, start))
                    else:
                        shares = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in shares.keys() and shares["total_count"]>0 and str(start) == "0":
                            for i in range((shares["total_count"]-1)/ONESHAREPAGE):
                                try:
                                    dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 0, 0)" % (uid, str(ONESHAREPAGE*(i+1)), str(ONESHAREPAGE)))
                                except Exception as ex:
                                    print "E3", str(ex)
                                    pass
                        if "records" in shares.keys():
                            for item in shares["records"]:
                                try:
                                    dbcurr.execute("INSERT INTO share(userid, filename, shareid, status) VALUES(%s, "%s", %s, 0)" % (uid, item["title"], item["shareid"]))
                                except Exception as ex:
                                    #print "E33", str(ex), item
                                    pass
                        else:
                            print "delete 0", uid, start
                            dbcurr.execute("delete from urlids where uk=%s and type=0 and start>%s" % (uid, str(start)))
                    dbcurr.execute("delete from urlids where id=%s" % (id, ))
                    dbconn.commit()
            except Exception as ex:
                print "E5", str(ex), id
    
            
            pid = re_pptt.findall(effective_url)
            
            if pid:
                print "pid>>>", pid
                ppid = int(pid[0])
                PROXY_LIST[ppid][6] -= 1
        dbcurr.close()
        dbconn.close()
        
    def worker():
        global success, failed
        dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "baiduyun", charset="utf8")
        dbcurr = dbconn.cursor()
        dbcurr.execute("SET NAMES utf8")
        dbcurr.execute("set global wait_timeout=60000")
        while True:
    
            #dbcurr.execute("select * from urlids where status=0 order by type limit 1")
            dbcurr.execute("select * from urlids where status=0 and type>0 limit 1")
            d = dbcurr.fetchall()
            #print d
            if d:
                id = d[0][0]
                uk = d[0][1]
                start = d[0][2]
                limit = d[0][3]
                type = d[0][4]
                dbcurr.execute("update urlids set status=1 where id=%s" % (str(id),))
                url = ""
                if type == 0:
                    url = URL_SHARE.format(uk=uk, start=start, id=id).encode("utf-8")
                elif  type == 1:
                    url = URL_FOLLOW.format(uk=uk, start=start, id=id).encode("utf-8")
                elif type == 2:
                    url = URL_FANS.format(uk=uk, start=start, id=id).encode("utf-8")
                if url:
                    hc_q.put((type, url))
                    
                #print "processed", url
            else:
                dbcurr.execute("select * from user where status=0 limit 1000")
                d = dbcurr.fetchall()
                if d:
                    for item in d:
                        try:
                            dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 0, 0)" % (item[1], str(ONESHAREPAGE)))
                            dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 1, 0)" % (item[1], str(ONEPAGE)))
                            dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 2, 0)" % (item[1], str(ONEPAGE)))
                            dbcurr.execute("update user set status=1 where userid=%s" % (item[1],))
                        except Exception as ex:
                            print "E6", str(ex)
                else:
                    time.sleep(1)
                    
            dbconn.commit()
        dbcurr.close()
        dbconn.close()
            
        
    for item in range(16):    
        t = threading.Thread(target = req_worker, args = (item,))
        t.setDaemon(True)
        t.start()
    
    s = threading.Thread(target = worker, args = ())
    s.setDaemon(True)
    s.start()
    
    response_worker()

文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/37931.html

相關(guān)文章

  • 亦真亦幻 彈性云網(wǎng)絡(luò)

    摘要:運(yùn)營(yíng)商網(wǎng)絡(luò)大致可劃分為四朵云公有云平臺(tái)云云網(wǎng)絡(luò)云。網(wǎng)絡(luò)即云,云網(wǎng)一體化將成為未來(lái)運(yùn)營(yíng)商網(wǎng)絡(luò)的最顯著特征。 5月25日消息互聯(lián)網(wǎng)+是要讓信息技術(shù)、網(wǎng)絡(luò)技術(shù)深度融合于經(jīng)濟(jì)社會(huì)各領(lǐng)域之中,使互聯(lián)網(wǎng)下沉為各行各業(yè)都能調(diào)用的基礎(chǔ)設(shè)施資源。預(yù)計(jì)到2025年,全球?qū)⒂?5億互聯(lián)網(wǎng)用戶,使用80億個(gè)智能手機(jī),創(chuàng)建1000億個(gè)連接,產(chǎn)生176ZB的數(shù)據(jù)流量,全面實(shí)現(xiàn)泛在的連接。在未來(lái),網(wǎng)絡(luò)需要滿足海量終端的接...

    Jiavan 評(píng)論0 收藏0
  • 實(shí)用開源百度云分享爬蟲項(xiàng)目yunshare - 安裝篇

    摘要:今天開源了一個(gè)百度云網(wǎng)盤爬蟲項(xiàng)目,地址是。推薦使用命令安裝依賴,最簡(jiǎn)單的安裝方式更多安裝的命令可以去上面找。啟動(dòng)項(xiàng)目使用進(jìn)行進(jìn)程管理,運(yùn)行啟動(dòng)所有的后臺(tái)任務(wù),檢查任務(wù)是否正常運(yùn)行可以用命令,正常運(yùn)行的應(yīng)該有個(gè)任務(wù)。 今天開源了一個(gè)百度云網(wǎng)盤爬蟲項(xiàng)目,地址是https://github.com/callmelanmao/yunshare。 百度云分享爬蟲項(xiàng)目 github上有好幾個(gè)這樣的...

    lei___ 評(píng)論0 收藏0
  • mineserver:一家騙子IDC,垃圾服務(wù)商家,國(guó)人以英文站放海外運(yùn)營(yíng)!

    摘要:怎么樣繼一家引力主機(jī)企鵝小屋之后,又一家不良垃圾服務(wù)商,客戶工單全部不會(huì)回復(fù),極有可能會(huì)成為又一家跑路商家。果然,證明這是一家無(wú)良的了官網(wǎng)目前已經(jīng)失聯(lián),客戶服務(wù)器出問(wèn)題只能提交工單處理。mineserver怎么樣?mineserver繼一家inlicloud引力主機(jī)、企鵝小屋之后,又一家不良垃圾服務(wù)商,客戶工單全部不會(huì)回復(fù),極有可能會(huì)成為又一家跑路IDC商家。果然,mineserver證明這...

    terro 評(píng)論0 收藏0
  • 用做WPS思路重寫了一套私有云系統(tǒng)

    摘要:年初,金山啟動(dòng)私有云項(xiàng)目,該項(xiàng)目旨在為向金山提出了私有云網(wǎng)盤存儲(chǔ)需求的政府大型企業(yè)以及中型企業(yè)提供服務(wù),項(xiàng)目組由金山云楊鋼牽頭組建。中文站對(duì)楊鋼進(jìn)行了專訪,了解其私有云服務(wù)的技術(shù)組成和業(yè)務(wù)狀態(tài)。 2013年初,金山啟動(dòng)私有云項(xiàng)目,該項(xiàng)目旨在為向金山提出了私有云網(wǎng)盤/存儲(chǔ)需求的政府、大型企業(yè)以及中型企業(yè)提供服務(wù),項(xiàng)目組由金山云CTO楊鋼牽頭組建。InfoQ中文站對(duì)楊鋼進(jìn)行了專訪,了解其私有云服...

    Achilles 評(píng)論0 收藏0
  • 云網(wǎng)融合,擴(kuò)展運(yùn)營(yíng)商B2B商業(yè)邊界

    摘要:華為云網(wǎng)融合解決方案使能運(yùn)營(yíng)商增長(zhǎng)運(yùn)營(yíng)商基礎(chǔ)網(wǎng)絡(luò)設(shè)施優(yōu)勢(shì)明顯,網(wǎng)絡(luò)覆蓋廣接入媒介全機(jī)房光纜豐富。目前,在中國(guó)歐洲及東南亞等全球多個(gè)國(guó)家與地區(qū),華為已與多家運(yùn)營(yíng)商在云網(wǎng)融合領(lǐng)域開展商業(yè)合作,支撐運(yùn)營(yíng)商產(chǎn)品升級(jí),提升運(yùn)營(yíng)商競(jìng)爭(zhēng)力。企業(yè)ICT需求4大變化Gartner調(diào)研顯示,企業(yè)上云不是一蹴而就,而是根據(jù)應(yīng)用復(fù)雜性和上云后的業(yè)務(wù)風(fēng)險(xiǎn),由低至高逐步將企業(yè)應(yīng)用遷移至云上。隨著企業(yè)上云的不斷深入,業(yè)務(wù)...

    Lemon_95 評(píng)論0 收藏0

發(fā)表評(píng)論

0條評(píng)論

最新活動(dòng)
閱讀需要支付1元查看
<