Scrapy學習（四）爬取微博數(shù)據(jù)

LiveVideoStack 發(fā)布于2019-07-25 11:29 / 1395人閱讀

摘要：爬蟲目標是獲取用戶的微博數(shù)關注數(shù)粉絲數(shù)。創(chuàng)建數(shù)據(jù)這部分我只需要個人信息，微博數(shù)，關注數(shù)分數(shù)數(shù)這些基本信息就行。

前言

Scrapy學習（三）爬取豆瓣圖書信息

接上篇之后。這次來爬取需要登錄才能訪問的微博。
爬蟲目標是獲取用戶的微博數(shù)、關注數(shù)、粉絲數(shù)。為建立用戶關系圖(尚未實現(xiàn))做數(shù)據(jù)儲備

準備

安裝第三方庫requests和pymongo

安裝MongoDB

創(chuàng)建一個weibo爬蟲項目

如何創(chuàng)建Scrapy項目之前文章都已經(jīng)提到了，直接進入主題。

創(chuàng)建Items

Item數(shù)據(jù)這部分我只需要個人信息，微博數(shù)，關注數(shù)、分數(shù)數(shù)這些基本信息就行。

class ProfileItem(Item):
    """
    賬號的微博數(shù)、關注數(shù)、粉絲數(shù)及詳情
    """
    _id = Field()
    nick_name = Field()
    profile_pic = Field()
    tweet_stats = Field()
    following_stats = Field()
    follower_stats = Field()
    sex = Field()
    location = Field()
    birthday = Field()
    bio = Field()
    
class FollowingItem(Item):
    """
    關注的微博賬號
    """
    _id = Field()
    relationship = Field()

class FollowedItem(Item):
    """
    粉絲的微博賬號
    """
    _id = Field()
    relationship = Field()

編寫Spider

為了方便爬蟲，我們選擇登陸的入口是手機版的微博http://weibo.cn/。

其中微博的uid可以通過訪問用戶資料頁或者從關注的href屬性中獲取

class WeiboSpiderSpider(scrapy.Spider):
    name = "weibo_spider"
    allowed_domains = ["weibo.cn"]
    url = "http://weibo.cn/"
    start_urls = ["2634877355",...] # 爬取入口微博ID
    task_set = set(start_urls) # 待爬取集合
    tasked_set = set() # 已爬取集合
    ...   
    
    def start_requests(self):
        while len(self.task_set) > 0 :
            _id = self.task_set.pop()
            if _id in self.tasked_set:
                raise CloseSpider(reason="已存在該數(shù)據(jù) %s "% (_id) )
            self.tasked_set.add(_id)
            info_url = self.url + _id
            info_item = ProfileItem()
            following_url = info_url + "/follow"
            following_item = FollowingItem()
            following_item["_id"] = _id
            following_item["relationship"] = []
            follower_url = info_url + "/fans"
            follower_item = FollowedItem()
            follower_item["_id"] = _id
            follower_item["relationship"] = []
            yield scrapy.Request(info_url, meta={"item":info_item}, callback=self.account_parse)
            yield scrapy.Request(following_url, meta={"item":following_item}, callback=self.relationship_parse)
            yield scrapy.Request(follower_url, meta={"item":follower_item}, callback=self.relationship_parse)

    def account_parse(self, response):
        item = response.meta["item"]
        sel = scrapy.Selector(response)
        profile_url = sel.xpath("http://div[@class="ut"]/a/@href").extract()[1]
        counts = sel.xpath("http://div[@class="u"]/div[@class="tip2"]").extract_first()
        item["_id"] = re.findall(u"^/(d+)/info",profile_url)[0]
        item["tweet_stats"] = re.findall(u"微博[(d+)]", counts)[0]
        item["following_stats"] = re.findall(u"關注[(d+)]", counts)[0]
        item["follower_stats"] = re.findall(u"粉絲[(d+)]", counts)[0]
        if int(item["tweet_stats"]) < 4500 and int(item["following_stats"]) > 1000 and int(item["follower_stats"]) < 500:
            raise CloseSpider("僵尸粉")
        yield scrapy.Request("http://weibo.cn"+profile_url, meta={"item": item},callback=self.profile_parse)

    def profile_parse(self,response):
        item = response.meta["item"]
        sel = scrapy.Selector(response)
        info = sel.xpath("http://div[@class="tip"]/following-sibling::div[@class="c"]").extract_first()
        item["profile_pic"] = sel.xpath("http://div[@class="c"]/img/@src").extract_first()
        item["nick_name"] = re.findall(u"昵稱:(.*?)
",info)[0]
        item["sex"] = re.findall(u"性別:(.*?)
",info) and re.findall(u"性別:(.*?)
",info)[0] or ""
        item["location"] = re.findall(u"地區(qū):(.*?)
",info) and re.findall(u"地區(qū):(.*?)
",info)[0] or ""
        item["birthday"] = re.findall(u"生日:(.*?)
",info) and re.findall(u"生日:(.*?)
",info)[0] or ""
        item["bio"] = re.findall(u"簡介:(.*?)
",info) and re.findall(u"簡介:(.*?)
",info)[0] or ""
        yield item

    def relationship_parse(self, response):
        item = response.meta["item"]
        sel = scrapy.Selector(response)
        uids = sel.xpath("http://table/tr/td[last()]/a[last()]/@href").extract()
        new_uids = []
        for uid in uids:
            if "uid" in uid:
                new_uids.append(re.findall("uid=(d+)&",uid)[0])
            else:
                try:
                    new_uids.append(re.findall("/(d+)", uid)[0])
                except:
                    print("--------",uid)
                    pass
        item["relationship"].extend(new_uids)
        for i in new_uids:
            if i not in self.tasked_set:
                self.task_set.add(i)
        next_page = sel.xpath("http://*[@id="pagelist"]/form/div/a[text()="下頁"]/@href").extract_first()
        if next_page:
            yield scrapy.Request("http://weibo.cn"+next_page, meta={"item": item},callback=self.relationship_parse)
        else:
            yield item

代碼中值得注意的地方有幾個。

start_url

這里我們填寫的是微博的uid，有的用戶有自定義域名（如上圖），要訪問后才能得到真正的uid
start_url 填寫的初始種子數(shù)要在10個以上。這是為了確保后面我們爬取到的新的種子能夠加入到待爬取的隊列中。10個以上的規(guī)定是從Scrapy文檔中查得的

REACTOR_THREADPOOL_MAXSIZE
Default: 10
線程數(shù)是Twisted線程池的默認大小(The maximum limit for Twisted Reactor thread pool size.)

CloseSpider

當遇到不需要的繼續(xù)爬取的連接時(如已經(jīng)爬取過的鏈接，定義的僵尸粉鏈接等等)，就可以用CloseSpider關閉當前爬蟲線程

編寫middlewares

class CookiesMiddleware(object):
    """ 換Cookie """

    def process_request(self, request, spider):
        cookie = random.choice(cookies)
        request.cookies = cookie

編寫cookie的獲取方法

這里我原本是想用手機版的微博去模擬登陸的，奈何驗證碼是在是太難搞了。所以我直接用網(wǎng)上有人編寫好的登陸網(wǎng)頁版微博的代碼SinaSpider 這位寫的很好，有興趣的可以去看看。其中還有另一位寫了模擬登陸（帶驗證碼）經(jīng)測試可用。只不過我還沒想好怎么嵌入到我的項目中。

# encoding=utf-8
import json
import base64
import requests

myWeiBo = [
    {"no": "[email protected]", "psw": "xx"},
    {"no": "[email protected]", "psw": "xx"},
]


def getCookies(weibo):
    """ 獲取Cookies """
    cookies = []
    loginURL = r"https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.15)"
    for elem in weibo:
        account = elem["no"]
        password = elem["psw"]
        username = base64.b64encode(account.encode("utf-8")).decode("utf-8")
        postData = {
            "entry": "sso",
            "gateway": "1",
            "from": "null",
            "savestate": "30",
            "useticket": "0",
            "pagerefer": "",
            "vsnf": "1",
            "su": username,
            "service": "sso",
            "sp": password,
            "sr": "1440*900",
            "encoding": "UTF-8",
            "cdult": "3",
            "domain": "sina.com.cn",
            "prelt": "0",
            "returntype": "TEXT",
        }
        session = requests.Session()
        r = session.post(loginURL, data=postData)
        jsonStr = r.content.decode("gbk")
        info = json.loads(jsonStr)
        if info["retcode"] == "0":
            print("Get Cookie Success!( Account:%s )" % account)
            cookie = session.cookies.get_dict()
            cookies.append(cookie)
        else:
            print("Failed!( Reason:%s )" % info["reason"].encode("utf-8"))
    return cookies

cookies = getCookies(myWeiBo)

登陸-反爬蟲的這部分應該是整個項目中最難的地方了。~~好多地方我都還不太懂。以后有空在研究~~

編寫pipelines

這邊只需要主要什么類型的Item存到那張表里就行了

class MongoDBPipeline(object):
    def __init__(self):
        connection = MongoClient(
            host=settings["MONGODB_SERVER"],
            port=settings["MONGODB_PORT"]
        )
        db = connection[settings["MONGODB_DB"]]
        self.info = db[settings["INFO"]]
        self.following = db[settings["FOLLOWING"]]
        self.followed = db[settings["FOLLOWED"]]

    def process_item(self, item, spider):

        if isinstance(item, ProfileItem):
            self.info.insert(dict(item))
        elif isinstance(item, FollowingItem):
            self.following.insert(dict(item))
        elif isinstance(item, FollowedItem):
            self.followed.insert(dict(item))
        log.msg("Weibo  added to MongoDB database!",
                level=log.DEBUG, spider=spider)
        return item

運行一下程序，就能看到MongoDB中有了我們要的數(shù)據(jù)了

總結

settings中的DOWNLOAD_DELAY設置5才能防止被微博BAN掉

嘗試在利用cookies登陸失敗時使用模擬登陸，但是效果很不理想

嘗試用代理IP池反爬蟲，但是嘗試失敗~~主要是不太會~~

未來將利用D3.js將爬到的數(shù)據(jù)繪制出來(~~或許吧~~)

項目地址：weibo_spider

云服務器 GPU云服務器微博?爬取數(shù)據(jù) scrapy爬取實例深度學習四五四青年學習主機

文章版權歸作者所有，未經(jīng)允許請勿轉載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉載請注明本文地址：http://systransis.cn/yun/38433.html

python數(shù)據(jù)分析微博熱門

摘要：前者對中文進行分詞后者圖形化展示詞語的出現(xiàn)頻率。眾所周知，中文系的語言處理恐怕是最難的自然語言處理的語種。研究中文自然語言處理將是一個長久而大的工程，對于分析數(shù)據(jù)我們不是要研究自然語言處理接上篇，這一篇將從技術層面講講是如何實現(xiàn)的。閱讀本文您將會了解如何用python爬取微博的評論以及如何用python word_cloud庫進行數(shù)據(jù)可視化。上一篇:程序員代碼下的許豪杰準備工作 ...

firim 2019-07-31 10:11 評論0 收藏0
首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

摘要：時間永遠都過得那么快，一晃從年注冊，到現(xiàn)在已經(jīng)過去了年那些被我藏在收藏夾吃灰的文章，已經(jīng)太多了，是時候把他們整理一下了。那是因為收藏夾太亂，橡皮擦給設置私密了，不收拾不好看呀。 ...

Harriet666 2021-09-10 10:51 評論0 收藏0
圍觀微博網(wǎng)友發(fā)起的美胸比賽學習爬取微博評論內(nèi)容

摘要：于是去網(wǎng)上搜一下，搜索結果都是前兩年爬取微博的方法，那時候還是用以格式傳遞，現(xiàn)在明顯已經(jīng)不是。其他的屬性是一些微博的標題發(fā)送時間內(nèi)容點贊數(shù)評論數(shù)轉發(fā)數(shù)和博主相關信息等。網(wǎng)友：看看胸女：滾網(wǎng)友：美胸比賽女：[圖片消息] ???? 繼上次知乎話題擁有一副好身材是怎樣的體驗？解析了知乎回答內(nèi)容之后，這次我們來解析一下微博內(nèi)容，以微博網(wǎng)友發(fā)起的美胸大賽為例： https://m.wei...

cnio 2019-07-31 11:28 評論0 收藏0
利用新浪API實現(xiàn)數(shù)據(jù)的抓取微博數(shù)據(jù)爬取微博爬蟲

摘要：本人長期出售超大量微博數(shù)據(jù)旅游網(wǎng)站評論數(shù)據(jù)，并提供各種指定數(shù)據(jù)爬取服務，。如果用戶傳入偽造的，則新浪微博會返回一個錯誤。 PS:(本人長期出售超大量微博數(shù)據(jù)、旅游網(wǎng)站評論數(shù)據(jù)，并提供各種指定數(shù)據(jù)爬取服務，Message to [email protected]。由于微博接口更新后限制增大，這個代碼已經(jīng)不能用來爬數(shù)據(jù)了。如果只是為了收集數(shù)據(jù)可以咨詢我的郵箱，如果是為了學習爬蟲，...

liuyix 2019-07-30 15:12 評論0 收藏0

發(fā)表評論

登陸后可評論

0條評論

LiveVideoStack

男|高級講師

我要關注我要私信

TA的文章

咕泡Java互聯(lián)網(wǎng)高級架構師（SVIP漲薪班）

閱讀 1859·2021-11-22 15:24
【C++】vector

閱讀 1315·2021-11-12 10:36
文件操作（文件指針+順序讀寫函數(shù)詳解）

閱讀 3216·2021-09-28 09:36
qsort()函數(shù)詳解

閱讀 1844·2021-09-02 15:15
像素，css像素，物理像素，設備獨立像素，分辨率大亂斗

閱讀 2759·2019-08-30 15:54
面試之盒模型

閱讀 2399·2019-08-30 11:02
CSS3 background-origin屬性

閱讀 2398·2019-08-29 13:52
[譯] 為何 Angular 內(nèi)部沒有發(fā)現(xiàn)組件

閱讀 3548·2019-08-26 11:53

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

Scrapy學習（四）爬取微博數(shù)據(jù)

相關文章

python數(shù)據(jù)分析微博熱門

首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

圍觀微博網(wǎng)友發(fā)起的美胸比賽學習爬取微博評論內(nèi)容

**利用新浪API實現(xiàn)數(shù)據(jù)的抓取微博數(shù)據(jù)爬取微博爬蟲**

發(fā)表評論

0條評論

LiveVideoStack

男|高級講師

TA的文章

咕泡Java互聯(lián)網(wǎng)高級架構師（SVIP漲薪班）

【C++】vector

文件操作（文件指針+順序讀寫函數(shù)詳解）

qsort()函數(shù)詳解

像素，css像素，物理像素，設備獨立像素，分辨率大亂斗

面試之盒模型

CSS3 background-origin屬性

[譯] 為何 Angular 內(nèi)部沒有發(fā)現(xiàn)組件

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

Scrapy學習（四） 爬取微博數(shù)據(jù)

相關文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

Scrapy學習（四）爬取微博數(shù)據(jù)