【python爬蟲(chóng)學(xué)習(xí) 】python3.7 scrapy 安裝，demo實(shí)例，實(shí)踐：爬取百度

asoren 發(fā)布于2019-07-30 18:36 / 3544人閱讀

摘要：安裝可能的問(wèn)題問(wèn)題解決實(shí)例教程中文教程文檔第一步創(chuàng)建項(xiàng)目目錄第二步進(jìn)入創(chuàng)建爬蟲(chóng)第三步創(chuàng)建存儲(chǔ)容器，復(fù)制項(xiàng)目下的重命名為第四步修改提取數(shù)據(jù)引入數(shù)據(jù)容器第五步解決百度首頁(yè)網(wǎng)站抓取空白問(wèn)題設(shè)置設(shè)置用戶代理解決相關(guān)解決數(shù)據(jù)保存亂

pip 安裝 pip install scrapy

可能的問(wèn)題：
問(wèn)題/解決：error: Microsoft Visual C++ 14.0 is required.

實(shí)例demo教程中文教程文檔
第一步：創(chuàng)建項(xiàng)目目錄

scrapy startproject tutorial

第二步：進(jìn)入tutorial創(chuàng)建spider爬蟲(chóng)

scrapy genspider baidu www.baidu.com

第三步：創(chuàng)建存儲(chǔ)容器，復(fù)制項(xiàng)目下的items.py重命名為BaiduItems

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class BaiduItems(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    pass

第四步：修改spiders/baidu.py xpath提取數(shù)據(jù)

# -*- coding: utf-8 -*-
import scrapy
# 引入數(shù)據(jù)容器
from tutorial.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    allowed_domains = ["www.readingbar.net"]
    start_urls = ["http://www.readingbar.net/"]
    def parse(self, response):
        for sel in response.xpath("http://ul/li"):
            item = BaiduItems()
            item["title"] = sel.xpath("a/text()").extract()
            item["link"] = sel.xpath("a/@href").extract()
            item["desc"] = sel.xpath("text()").extract()
            yield item
        pass

第五步：解決百度首頁(yè)網(wǎng)站抓取空白問(wèn)題,設(shè)置setting.py

# 設(shè)置用戶代理
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"

# 解決 robots.txt 相關(guān)debug
ROBOTSTXT_OBEY = False
# scrapy 解決數(shù)據(jù)保存亂碼問(wèn)題
FEED_EXPORT_ENCODING = "utf-8"

最后一步：開(kāi)始爬取數(shù)據(jù)命令并保存數(shù)據(jù)為指定的文件
執(zhí)行的時(shí)候可能報(bào)錯(cuò)：No module named "win32api" 可以下載指定版本安裝

scrapy crawl baidu -o baidu.json

深度爬取百度首頁(yè)及導(dǎo)航菜單相關(guān)頁(yè)內(nèi)容

# -*- coding: utf-8 -*-
import scrapy

from scrapyProject.BaiduItems import BaiduItems

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    # 由于tab包含其他域名,需要添加域名否則無(wú)法爬取
    allowed_domains = [
        "www.baidu.com",
        "v.baidu.com",
        "map.baidu.com",
        "news.baidu.com",
        "tieba.baidu.com",
        "xueshu.baidu.com"
    ]
    start_urls = ["https://www.baidu.com/"]
    def parse(self, response):
        item = BaiduItems()
        item["title"] = response.xpath("http://title/text()").extract()
        yield item
        for sel in response.xpath("http://a[@class="mnav"]"):
            item = BaiduItems()
            item["nav"] = sel.xpath("text()").extract()
            item["href"] = sel.xpath("@href").extract()
            yield item
            # 根據(jù)提取的nav地址建立新的請(qǐng)求并執(zhí)行回調(diào)函數(shù)
            yield scrapy.Request(item["href"][0],callback=self.parse_newpage)
        pass
    # 深度提取tab網(wǎng)頁(yè)標(biāo)題信息
    def parse_newpage(self, response):
        item = BaiduItems()
        item["title"] = response.xpath("http://title/text()").extract()
        yield item
        pass

繞過(guò)登錄進(jìn)行爬取
a.解決圖片驗(yàn)證 pytesseract

GPU云服務(wù)器云服務(wù)器 scrapy爬取實(shí)例 scrapy爬蟲(chóng)實(shí)例 scrapy分布式爬蟲(chóng)實(shí)例 python爬蟲(chóng)scrapy

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/42720.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

asoren

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

#11.11#搬瓦工VPS全場(chǎng)8.9折優(yōu)惠，$44.49/年起，年付低至7.5折，續(xù)費(fèi)不漲價(jià)

閱讀 3136·2021-11-15 18:14
阿里云服務(wù)器1M帶寬實(shí)際下載速度是多少?(帶寬和下載速度的關(guān)系)

閱讀 1790·2021-09-22 10:51
一文帶你斬殺Python之Numpy??Pandas全部操作【全網(wǎng)最詳細(xì)】???

閱讀 3306·2021-09-09 09:34
動(dòng)態(tài)內(nèi)存管理（下）

閱讀 3519·2021-09-06 15:02
hostyun：美國(guó)三網(wǎng)廉價(jià)版cn2 gia vps晚高峰簡(jiǎn)單測(cè)評(píng)，看看數(shù)據(jù)和性能~

閱讀 1038·2021-09-01 11:40
CSS—總結(jié)常用垂直居中方法

閱讀 3199·2019-08-30 13:58
前端開(kāi)發(fā)工具集 eutils

閱讀 2537·2019-08-30 11:04
【20160119】貌美的bootstrap模板和git安裝

閱讀 1092·2019-08-28 18:31

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

【python爬蟲(chóng)學(xué)習(xí) 】python3.7 scrapy 安裝，demo實(shí)例，實(shí)踐：爬取百度

相關(guān)文章

***Python3網(wǎng)絡(luò)爬蟲(chóng)實(shí)戰(zhàn)---10、爬蟲(chóng)框架的安裝：PySpider、Scrapy***

首次公開(kāi)，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

**零基礎(chǔ)如何學(xué)爬蟲(chóng)技術(shù)**

python爬蟲(chóng)入門(mén)（一）

發(fā)表評(píng)論

0條評(píng)論

asoren

男|高級(jí)講師

TA的文章

#11.11#搬瓦工VPS全場(chǎng)8.9折優(yōu)惠，$44.49/年起，年付低至7.5折，續(xù)費(fèi)不漲價(jià)

阿里云服務(wù)器1M帶寬實(shí)際下載速度是多少?(帶寬和下載速度的關(guān)系)

一文帶你斬殺Python之Numpy??Pandas全部操作【全網(wǎng)最詳細(xì)】???

動(dòng)態(tài)內(nèi)存管理（下）

hostyun：美國(guó)三網(wǎng)廉價(jià)版cn2 gia vps晚高峰簡(jiǎn)單測(cè)評(píng)，看看數(shù)據(jù)和性能~

CSS—總結(jié)常用垂直居中方法

前端開(kāi)發(fā)工具集 eutils

【20160119】貌美的bootstrap模板和git安裝

最新活動(dòng)

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

【python爬蟲(chóng)學(xué)習(xí) 】python3.7 scrapy 安裝，demo實(shí)例，實(shí)踐：爬取百度

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

【python爬蟲(chóng)學(xué)習(xí) 】python3.7 scrapy 安裝，demo實(shí)例，實(shí)踐：爬取百度