Scrapy爬蟲 - 獲取知乎用戶數(shù)據(jù)

Miyang 發(fā)布于2019-07-31 11:42 / 1848人閱讀

摘要：爬蟲獲取知乎用戶數(shù)據(jù)安裝爬蟲框架關(guān)于如何安裝以及框架，這里不做介紹，請(qǐng)自行網(wǎng)上搜索。

2016-04-10

Scrapy爬蟲 - 獲取知乎用戶數(shù)據(jù) 安裝Scrapy爬蟲框架

關(guān)于如何安裝Python以及Scrapy框架，這里不做介紹，請(qǐng)自行網(wǎng)上搜索。

初始化

安裝好Scrapy后，執(zhí)行 scrapy startproject myspider
接下來(lái)你會(huì)看到 myspider 文件夾，目錄結(jié)構(gòu)如下：

scrapy.cfg

myspider

items.py

pipelines.py

settings.py

__init__.py

spiders

__init__.py

編寫爬蟲文件

在spiders目錄下新建 users.py

# -*- coding: utf-8 -*-
import scrapy
import os
import time
from zhihu.items import UserItem
from zhihu.myconfig import UsersConfig # 爬蟲配置

class UsersSpider(scrapy.Spider):
    name = "users"
    domain = "https://www.zhihu.com"
    login_url = "https://www.zhihu.com/login/email"
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        "Host": "www.zhihu.com",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
    }

    def __init__(self, url = None):
        self.user_url = url

    def start_requests(self):
        yield scrapy.Request(
            url = self.domain,
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": 1
            },
            callback = self.request_captcha
        )

    def request_captcha(self, response):
        # 獲取_xsrf值
        _xsrf = response.css("input[name="_xsrf"]::attr(value)").extract()[0]
        # 獲取驗(yàn)證碼地址
        captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)
        # 準(zhǔn)備下載驗(yàn)證碼
        yield scrapy.Request(
            url = captcha_url,
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "_xsrf": _xsrf
            },
            callback = self.download_captcha
        )

    def download_captcha(self, response):
        # 下載驗(yàn)證碼
        with open("captcha.gif", "wb") as fp:
            fp.write(response.body)
        # 用軟件打開(kāi)驗(yàn)證碼圖片
        os.system("start captcha.gif")
        # 輸入驗(yàn)證碼
        print "Please enter captcha: "
        captcha = raw_input()

        yield scrapy.FormRequest(
            url = self.login_url,
            headers = self.headers,
            formdata = {
                "email": UsersConfig["email"],
                "password": UsersConfig["password"],
                "_xsrf": response.meta["_xsrf"],
                "remember_me": "true",
                "captcha": captcha
            },
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"]
            },
            callback = self.request_zhihu
        )

    def request_zhihu(self, response):
        yield scrapy.Request(
            url = self.user_url + "/about",
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "from": {
                    "sign": "else",
                    "data": {}
                }
            },
            callback = self.user_item,
            dont_filter = True
        )

        yield scrapy.Request(
            url = self.user_url + "/followees",
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "from": {
                    "sign": "else",
                    "data": {}
                }
            },
            callback = self.user_start,
            dont_filter = True
        )

        yield scrapy.Request(
            url = self.user_url + "/followers",
            headers = self.headers,
            meta = {
                "proxy": UsersConfig["proxy"],
                "cookiejar": response.meta["cookiejar"],
                "from": {
                    "sign": "else",
                    "data": {}
                }
            },
            callback = self.user_start,
            dont_filter = True
        )

    def user_start(self, response):
        sel_root = response.xpath("http://h2[@class="zm-list-content-title"]")
        # 判斷關(guān)注列表是否為空
        if len(sel_root):
            for sel in sel_root:
                people_url = sel.xpath("a/@href").extract()[0]

                yield scrapy.Request(
                    url = people_url + "/about",
                    headers = self.headers,
                    meta = {
                        "proxy": UsersConfig["proxy"],
                        "cookiejar": response.meta["cookiejar"],
                        "from": {
                            "sign": "else",
                            "data": {}
                        }
                    },
                    callback = self.user_item,
                    dont_filter = True
                )

                yield scrapy.Request(
                    url = people_url + "/followees",
                    headers = self.headers,
                    meta = {
                        "proxy": UsersConfig["proxy"],
                        "cookiejar": response.meta["cookiejar"],
                        "from": {
                            "sign": "else",
                            "data": {}
                        }
                    },
                    callback = self.user_start,
                    dont_filter = True
                )

                yield scrapy.Request(
                    url = people_url + "/followers",
                    headers = self.headers,
                    meta = {
                        "proxy": UsersConfig["proxy"],
                        "cookiejar": response.meta["cookiejar"],
                        "from": {
                            "sign": "else",
                            "data": {}
                        }
                    },
                    callback = self.user_start,
                    dont_filter = True
                )

    def user_item(self, response):
        def value(list):
            return list[0] if len(list) else ""

        sel = response.xpath("http://div[@class="zm-profile-header ProfileCard"]")

        item = UserItem()
        item["url"] = response.url[:-6]
        item["name"] = sel.xpath("http://a[@class="name"]/text()").extract()[0].encode("utf-8")
        item["bio"] = value(sel.xpath("http://span[@class="bio"]/@title").extract()).encode("utf-8")
        item["location"] = value(sel.xpath("http://span[contains(@class, "location")]/@title").extract()).encode("utf-8")
        item["business"] = value(sel.xpath("http://span[contains(@class, "business")]/@title").extract()).encode("utf-8")
        item["gender"] = 0 if sel.xpath("http://i[contains(@class, "icon-profile-female")]") else 1
        item["avatar"] = value(sel.xpath("http://img[@class="Avatar Avatar--l"]/@src").extract())
        item["education"] = value(sel.xpath("http://span[contains(@class, "education")]/@title").extract()).encode("utf-8")
        item["major"] = value(sel.xpath("http://span[contains(@class, "education-extra")]/@title").extract()).encode("utf-8")
        item["employment"] = value(sel.xpath("http://span[contains(@class, "employment")]/@title").extract()).encode("utf-8")
        item["position"] = value(sel.xpath("http://span[contains(@class, "position")]/@title").extract()).encode("utf-8")
        item["content"] = value(sel.xpath("http://span[@class="content"]/text()").extract()).strip().encode("utf-8")
        item["ask"] = int(sel.xpath("http://div[contains(@class, "profile-navbar")]/a[2]/span[@class="num"]/text()").extract()[0])
        item["answer"] = int(sel.xpath("http://div[contains(@class, "profile-navbar")]/a[3]/span[@class="num"]/text()").extract()[0])
        item["agree"] = int(sel.xpath("http://span[@class="zm-profile-header-user-agree"]/strong/text()").extract()[0])
        item["thanks"] = int(sel.xpath("http://span[@class="zm-profile-header-user-thanks"]/strong/text()").extract()[0])

        yield item

添加爬蟲配置文件

在myspider目錄下新建myconfig.py，并添加以下內(nèi)容，將你的配置信息填入相應(yīng)位置

# -*- coding: utf-8 -*-
UsersConfig = {
    # 代理
    "proxy": "",

    # 知乎用戶名和密碼
    "email": "your email",
    "password": "your password",
}

DbConfig = {
    # db config
    "user": "db user",
    "passwd": "db password",
    "db": "db name",
    "host": "db host",
}

修改items.py

# -*- coding: utf-8 -*-
import scrapy

class UserItem(scrapy.Item):
    # define the fields for your item here like:
    url = scrapy.Field()
    name = scrapy.Field()
    bio = scrapy.Field()
    location = scrapy.Field()
    business = scrapy.Field()
    gender = scrapy.Field()
    avatar = scrapy.Field()
    education = scrapy.Field()
    major = scrapy.Field()
    employment = scrapy.Field()
    position = scrapy.Field()
    content = scrapy.Field()
    ask = scrapy.Field()
    answer = scrapy.Field()
    agree = scrapy.Field()
    thanks = scrapy.Field()

將用戶數(shù)據(jù)存入mysql數(shù)據(jù)庫(kù)

修改pipelines.py

# -*- coding: utf-8 -*-
import MySQLdb
import datetime
from zhihu.myconfig import DbConfig

class UserPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user = DbConfig["user"], passwd = DbConfig["passwd"], db = DbConfig["db"], host = DbConfig["host"], charset = "utf8", use_unicode = True)
        self.cursor = self.conn.cursor()
        # 清空表
        # self.cursor.execute("truncate table weather;")
        # self.conn.commit()

    def process_item(self, item, spider):
        curTime = datetime.datetime.now()
        try:
            self.cursor.execute(
                """INSERT IGNORE INTO users (url, name, bio, location, business, gender, avatar, education, major, employment, position, content, ask, answer, agree, thanks, create_at)
                VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",
                (
                    item["url"],
                    item["name"],
                    item["bio"],
                    item["location"],
                    item["business"],
                    item["gender"],
                    item["avatar"],
                    item["education"],
                    item["major"],
                    item["employment"],
                    item["position"],
                    item["content"],
                    item["ask"],
                    item["answer"],
                    item["agree"],
                    item["thanks"],
                    curTime
                )
            )
            self.conn.commit()
        except MySQLdb.Error, e:
            print "Error %d %s" % (e.args[0], e.args[1])

        return item

修改settings.py

找到 ITEM_PIPELINES，改為：

ITEM_PIPELINES = {
   "myspider.pipelines.UserPipeline": 300,
}

在末尾添加，設(shè)置爬蟲的深度

DEPTH_LIMIT=10

爬取知乎用戶數(shù)據(jù)

確保MySQL已經(jīng)打開(kāi)，在項(xiàng)目根目錄下打開(kāi)終端，
執(zhí)行 scrapy crawl users -a url=https://www.zhihu.com/people/，
其中user為爬蟲的第一個(gè)用戶，之后會(huì)根據(jù)該用戶關(guān)注的人和被關(guān)注的人進(jìn)行爬取數(shù)據(jù)
接下來(lái)會(huì)下載驗(yàn)證碼圖片，若未自動(dòng)打開(kāi)，請(qǐng)到根目錄下打開(kāi) captcha.gif，在終端輸入驗(yàn)證碼
數(shù)據(jù)爬取Loading...

源碼

源碼可以在這里找到 github

云服務(wù)器 GPU云服務(wù)器 scrapy 爬蟲爬蟲scrapy scrapy爬蟲 scrapy登錄爬蟲

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/45426.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

Miyang

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

主機(jī)域名怎么查看-主機(jī)域名中哪個(gè)表示主機(jī)名？

閱讀 1381·2021-09-22 15:09
甲骨文使用KeymouseGo（按鍵精靈）刷ARM云服務(wù)器

閱讀 2734·2021-08-20 09:38
RackNerd：AMD高性能獨(dú)立服務(wù)器，美西猶他州，10Gbps大帶寬，月付$219起

閱讀 2466·2021-08-03 14:03
簡(jiǎn)析 js 碰撞檢測(cè)原理與算法實(shí)現(xiàn)

閱讀 922·2019-08-30 15:55
css編碼技巧【css揭秘讀書筆記】

閱讀 3397·2019-08-30 12:59
js簡(jiǎn)單方法，取一個(gè)數(shù)組里最大值

閱讀 3579·2019-08-26 13:48
任務(wù)、微任務(wù)、隊(duì)列以及調(diào)度

閱讀 1916·2019-08-26 11:40
el-select 下拉框多選實(shí)現(xiàn)全選

閱讀 709·2019-08-26 10:30

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Scrapy爬蟲 - 獲取知乎用戶數(shù)據(jù)

相關(guān)文章

23個(gè)Python爬蟲開(kāi)源項(xiàng)目代碼，包含微信、淘寶、豆瓣、知乎、微博等

零基礎(chǔ)如何學(xué)爬蟲技術(shù)

**22、Python快速開(kāi)發(fā)分布式搜索引擎Scrapy精講—scrapy模擬登陸和知乎倒立文字驗(yàn)證碼識(shí)**

**Python爬蟲之Scrapy學(xué)習(xí)（基礎(chǔ)篇）**

scrapy模擬登陸知乎--抓取熱點(diǎn)話題

發(fā)表評(píng)論

0條評(píng)論

Miyang

男|高級(jí)講師

TA的文章

主機(jī)域名怎么查看-主機(jī)域名中哪個(gè)表示主機(jī)名？

甲骨文使用KeymouseGo（按鍵精靈）刷ARM云服務(wù)器

RackNerd：AMD高性能獨(dú)立服務(wù)器，美西猶他州，10Gbps大帶寬，月付$219起

簡(jiǎn)析 js 碰撞檢測(cè)原理與算法實(shí)現(xiàn)

css編碼技巧【css揭秘讀書筆記】

js簡(jiǎn)單方法，取一個(gè)數(shù)組里最大值

任務(wù)、微任務(wù)、隊(duì)列以及調(diào)度

el-select 下拉框多選實(shí)現(xiàn)全選

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Scrapy爬蟲 - 獲取知乎用戶數(shù)據(jù)

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！