基于python的百度云網(wǎng)盤資源搜索引擎設(shè)計(jì)架構(gòu)

williamwen1986 發(fā)布于2019-07-25 10:23 / 2988人閱讀

摘要：大家都知道百度云網(wǎng)盤上有很多分享的資源，包括軟件各類視頻自學(xué)教程電子書甚至各種電影種子應(yīng)有盡有，但百度云卻沒(méi)有提供相應(yīng)的搜索功能。于是就嘗試開發(fā)一個(gè)百度云資源的搜索系統(tǒng)。

大家都知道百度云網(wǎng)盤上有很多分享的資源，包括軟件、各類視頻自學(xué)教程、電子書、甚至各種電影、BT種子應(yīng)有盡有，但百度云卻沒(méi)有提供相應(yīng)的搜索功能。個(gè)人平時(shí)要找一些軟件、美劇覺(jué)得非常蛋疼。于是就嘗試開發(fā)一個(gè)百度云資源的搜索系統(tǒng)。

資源爬蟲思路：
搜索引擎么最重要的就是有海量的資源了，有了資源，只要再基于資源實(shí)現(xiàn)全文檢索功能就是一個(gè)簡(jiǎn)單的搜索引擎了。首先我需要爬取百度云的分享資源，爬取思路，打開任意一個(gè)百度云分享者的主頁(yè)yun.baidu.com/share/home?uk=xxxxxx&view=share#category/type=0,你可以發(fā)現(xiàn)分享者有訂閱者和粉絲，你可以遞歸遍歷訂閱者和粉絲，從而獲得大量分享者uk，進(jìn)而獲得大量的分享資源。

系統(tǒng)實(shí)現(xiàn)環(huán)境：

語(yǔ)言：python

操作系統(tǒng)：Linux

其他中間件：nginx、mysql、sphinx

系統(tǒng)包括幾個(gè)獨(dú)立的部分：

基于requests實(shí)現(xiàn)的獨(dú)立資源爬蟲

基于開源全文檢索引擎sphinx實(shí)現(xiàn)的資源索引程序

基于Django+bootstrap3開發(fā)的簡(jiǎn)易網(wǎng)站，網(wǎng)站搭建采用nginx1.8+fastCGI(flup)+python。演示網(wǎng)站http://www.itjujiao.com

后續(xù)優(yōu)化：

分詞處理，目前分詞搜索結(jié)果不是很理想，有大神可以指點(diǎn)下思路。比如我檢索“功夫熊貓之卷軸的秘密”，一個(gè)結(jié)果都沒(méi)有。而檢索“功夫熊貓“有結(jié)果集(功丶夫熊貓⒊英語(yǔ)中英字幕.mp4，功丶夫熊貓2.Kung.Fu.Panda.2.2011.BDrip.720P.國(guó)粵英臺(tái)四語(yǔ).特效中英字幕.mp4，功丶夫熊貓3(韓版)2016.高清中字.mkv等)或搜索”卷軸的秘密“有結(jié)果集([美國(guó)]功夫潘達(dá)之卷軸的秘密.2016.1080p.mp4, g夫熊貓之卷軸的秘密.HD1280超清中英雙字.mp4等)

數(shù)據(jù)去重，目前發(fā)現(xiàn)抓取的數(shù)據(jù)很多是共享資源，后續(xù)考慮基于MD5去重

PS:

目前爬蟲爬取了4000W左右的數(shù)據(jù)，sphinx對(duì)內(nèi)存的要求實(shí)在太大了，巨坑。
百度會(huì)對(duì)爬蟲做ip限制，寫了個(gè)簡(jiǎn)單的xicidaili代理采集程序，requests可以配置http代理。

分詞是sphinx自帶的實(shí)現(xiàn)，支持中文分詞，中文基于一元分詞，有點(diǎn)過(guò)度分詞，分詞效果不是特別理想，比如我搜關(guān)鍵詞“葉問(wèn)3”出現(xiàn)的結(jié)果中會(huì)有“葉子的問(wèn)題第3版”，不符合預(yù)期。英文分詞有很多可以改善的地方，比如我搜xart不會(huì)出現(xiàn)x-art的結(jié)果，而實(shí)際上x-art卻也是我想要的結(jié)果集(你們懂的)。

數(shù)據(jù)庫(kù)是mysql，資源表，考慮單表記錄上限，分了10個(gè)表。第一次爬完sphinx做全量索引，后續(xù)做增量索引。

爬蟲部分實(shí)現(xiàn)代碼（只是思路代碼有點(diǎn)亂）：

    #coding: utf8
    
    import re
    import urllib2
    import time
    from Queue import Queue
    import threading, errno, datetime
    import json
    import requests
    import MySQLdb as mdb
    
    DB_HOST = "127.0.0.1"
    DB_USER = "root"
    DB_PASS = ""
    
    
    re_start = re.compile(r"start=(d+)")
    re_uid = re.compile(r"query_uk=(d+)")
    re_pptt = re.compile(r"&pptt=(d+)")
    re_urlid = re.compile(r"&urlid=(d+)")
    
    ONEPAGE = 20
    ONESHAREPAGE = 20
    
    URL_SHARE = "http://yun.baidu.com/pcloud/feed/getsharelist?auth_type=1&start={start}&limit=20&query_uk={uk}&urlid={id}"
    URL_FOLLOW = "http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk={uk}&limit=20&start={start}&urlid={id}"
    URL_FANS = "http://yun.baidu.com/pcloud/friend/getfanslist?query_uk={uk}&limit=20&start={start}&urlid={id}"
    
    QNUM = 1000
    hc_q = Queue(20)
    hc_r = Queue(QNUM)
    
    success = 0
    failed = 0
    
    PROXY_LIST = [[0, 10, "42.121.33.160", 809, "", "", 0],
                    [5, 0, "218.97.195.38", 81, "", "", 0],
                    ]
    
    def req_worker(inx):
        s = requests.Session()
        while True:
            req_item = hc_q.get()
            
            req_type = req_item[0]
            url = req_item[1]
            r = s.get(url)
            hc_r.put((r.text, url))
            print "req_worker#", inx, url
            
    def response_worker():
        dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "baiduyun", charset="utf8")
        dbcurr = dbconn.cursor()
        dbcurr.execute("SET NAMES utf8")
        dbcurr.execute("set global wait_timeout=60000")
        while True:
            
            metadata, effective_url = hc_r.get()
            #print "response_worker:", effective_url
            try:
                tnow = int(time.time())
                id = re_urlid.findall(effective_url)[0]
                start = re_start.findall(effective_url)[0]
                if True:
                    if "getfollowlist" in effective_url: #type = 1
                        follows = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in follows.keys() and follows["total_count"]>0 and str(start) == "0":
                            for i in range((follows["total_count"]-1)/ONEPAGE):
                                try:
                                    dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 1, 0)" % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                                except Exception as ex:
                                    print "E1", str(ex)
                                    pass
                        
                        if "follow_list" in follows.keys():
                            for item in follows["follow_list"]:
                                try:
                                    dbcurr.execute("INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)" % (item["follow_uk"], item["follow_uname"], str(tnow)))
                                except Exception as ex:
                                    print "E13", str(ex)
                                    pass
                        else:
                            print "delete 1", uid, start
                            dbcurr.execute("delete from urlids where uk=%s and type=1 and start>%s" % (uid, start))
                    elif "getfanslist" in effective_url: #type = 2
                        fans = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in fans.keys() and fans["total_count"]>0 and str(start) == "0":
                            for i in range((fans["total_count"]-1)/ONEPAGE):
                                try:
                                    dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 2, 0)" % (uid, str(ONEPAGE*(i+1)), str(ONEPAGE)))
                                except Exception as ex:
                                    print "E2", str(ex)
                                    pass
                        
                        if "fans_list" in fans.keys():
                            for item in fans["fans_list"]:
                                try:
                                    dbcurr.execute("INSERT INTO user(userid, username, files, status, downloaded, lastaccess) VALUES(%s, "%s", 0, 0, 0, %s)" % (item["fans_uk"], item["fans_uname"], str(tnow)))
                                except Exception as ex:
                                    print "E23", str(ex)
                                    pass
                        else:
                            print "delete 2", uid, start
                            dbcurr.execute("delete from urlids where uk=%s and type=2 and start>%s" % (uid, start))
                    else:
                        shares = json.loads(metadata)
                        uid = re_uid.findall(effective_url)[0]
                        if "total_count" in shares.keys() and shares["total_count"]>0 and str(start) == "0":
                            for i in range((shares["total_count"]-1)/ONESHAREPAGE):
                                try:
                                    dbcurr.execute("INSERT INTO urlids(uk, start, limited, type, status) VALUES(%s, %s, %s, 0, 0)" % (uid, str(ONESHAREPAGE*(i+1)), str(ONESHAREPAGE)))
                                except Exception as ex:
                                    print "E3", str(ex)
                                    pass
                        if "records" in shares.keys():
                            for item in shares["records"]:
                                try:
                                    dbcurr.execute("INSERT INTO share(userid, filename, shareid, status) VALUES(%s, "%s", %s, 0)" % (uid, item["title"], item["shareid"]))
                                except Exception as ex:
                                    #print "E33", str(ex), item
                                    pass
                        else:
                            print "delete 0", uid, start
                            dbcurr.execute("delete from urlids where uk=%s and type=0 and start>%s" % (uid, str(start)))
                    dbcurr.execute("delete from urlids where id=%s" % (id, ))
                    dbconn.commit()
            except Exception as ex:
                print "E5", str(ex), id
    
            
            pid = re_pptt.findall(effective_url)
            
            if pid:
                print "pid>>>", pid
                ppid = int(pid[0])
                PROXY_LIST[ppid][6] -= 1
        dbcurr.close()
        dbconn.close()
        
    def worker():
        global success, failed
        dbconn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "baiduyun", charset="utf8")
        dbcurr = dbconn.cursor()
        dbcurr.execute("SET NAMES utf8")
        dbcurr.execute("set global wait_timeout=60000")
        while True:
    
            #dbcurr.execute("select * from urlids where status=0 order by type limit 1")
            dbcurr.execute("select * from urlids where status=0 and type>0 limit 1")
            d = dbcurr.fetchall()
            #print d
            if d:
                id = d[0][0]
                uk = d[0][1]
                start = d[0][2]
                limit = d[0][3]
                type = d[0][4]
                dbcurr.execute("update urlids set status=1 where id=%s" % (str(id),))
                url = ""
                if type == 0:
                    url = URL_SHARE.format(uk=uk, start=start, id=id).encode("utf-8")
                elif  type == 1:
                    url = URL_FOLLOW.format(uk=uk, start=start, id=id).encode("utf-8")
                elif type == 2:
                    url = URL_FANS.format(uk=uk, start=start, id=id).encode("utf-8")
                if url:
                    hc_q.put((type, url))
                    
                #print "processed", url
            else:
                dbcurr.execute("select * from user where status=0 limit 1000")
                d = dbcurr.fetchall()
                if d:
                    for item in d:
                        try:
                            dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 0, 0)" % (item[1], str(ONESHAREPAGE)))
                            dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 1, 0)" % (item[1], str(ONEPAGE)))
                            dbcurr.execute("insert into urlids(uk, start, limited, type, status) values("%s", 0, %s, 2, 0)" % (item[1], str(ONEPAGE)))
                            dbcurr.execute("update user set status=1 where userid=%s" % (item[1],))
                        except Exception as ex:
                            print "E6", str(ex)
                else:
                    time.sleep(1)
                    
            dbconn.commit()
        dbcurr.close()
        dbconn.close()
            
        
    for item in range(16):    
        t = threading.Thread(target = req_worker, args = (item,))
        t.setDaemon(True)
        t.start()
    
    s = threading.Thread(target = worker, args = ())
    s.setDaemon(True)
    s.start()
    
    response_worker()

云服務(wù)器 GPU云服務(wù)器百度云資源搜索百度網(wǎng)盤資源百度云網(wǎng)盤設(shè)置開機(jī)啟動(dòng)服務(wù)器基于php的網(wǎng)站設(shè)計(jì)

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/37931.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

williamwen1986

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

C++之內(nèi)存泄漏排查

閱讀 2302·2021-10-09 09:41
前端每日實(shí)戰(zhàn)：59# 視頻演示如何用純 CSS 創(chuàng)作彩虹背景文字

閱讀 1757·2019-08-30 15:53
input輸入框只能輸入數(shù)字，只能輸入字母數(shù)字組合

閱讀 1002·2019-08-30 15:52
又是一波前端知識(shí)點(diǎn)總結(jié)

閱讀 3453·2019-08-30 11:26
數(shù)組去重方法小結(jié)

閱讀 780·2019-08-29 16:09
單頁(yè)應(yīng)用SPA開發(fā)最佳實(shí)踐

閱讀 3438·2019-08-29 13:25
UCloud云主機(jī)配置變更

閱讀 2275·2019-08-26 16:45
JS開發(fā)中函數(shù)知識(shí)點(diǎn)梳理(三）

閱讀 1943·2019-08-26 11:51

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

基于python的百度云網(wǎng)盤資源搜索引擎設(shè)計(jì)架構(gòu)

相關(guān)文章

**亦真亦幻彈性云網(wǎng)絡(luò)**

實(shí)用的開源百度云分享爬蟲項(xiàng)目yunshare - 安裝篇

mineserver：一家騙子IDC,垃圾服務(wù)商家,國(guó)人以英文站放海外運(yùn)營(yíng)!

用做WPS的思路重寫了一套私有云系統(tǒng)

云網(wǎng)融合，擴(kuò)展運(yùn)營(yíng)商B2B商業(yè)邊界

發(fā)表評(píng)論

0條評(píng)論

williamwen1986

男|高級(jí)講師

TA的文章

C++之內(nèi)存泄漏排查

前端每日實(shí)戰(zhàn)：59# 視頻演示如何用純 CSS 創(chuàng)作彩虹背景文字

input輸入框只能輸入數(shù)字，只能輸入字母數(shù)字組合

又是一波前端知識(shí)點(diǎn)總結(jié)

數(shù)組去重方法小結(jié)

單頁(yè)應(yīng)用SPA開發(fā)最佳實(shí)踐

UCloud云主機(jī)配置變更

JS開發(fā)中函數(shù)知識(shí)點(diǎn)梳理(三）

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

基于python的百度云網(wǎng)盤資源搜索引擎設(shè)計(jì)架構(gòu)

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！