scrapy學(xué)習(xí)筆記(三)：使用item與pipeline保存數(shù)據(jù)

13651657101 發(fā)布于2019-07-25 11:48 / 2945人閱讀

摘要：最近真是忙的吐血。。。上篇寫的是直接在爬蟲中使用，這樣不是很好，下使用才是正經(jīng)方法。

最近真是忙的吐血。。。

上篇寫的是直接在爬蟲中使用mongodb，這樣不是很好，scrapy下使用item才是正經(jīng)方法。
在item中定義需要保存的內(nèi)容，然后在pipeline處理item，爬蟲流程就成了這樣：

抓取 --> 按item規(guī)則收集需要數(shù)據(jù) -->使用pipeline處理（存儲(chǔ)等）

定義item,在items.py中定義抓取內(nèi)容

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class GetquotesItem(scrapy.Item):
    # define the fields for your item here like:
    # 定義我們需要抓取的內(nèi)容：
    # 1.名言內(nèi)容
    # 2.作者
    # 3.標(biāo)簽
    content = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

我們將數(shù)據(jù)庫的配置信息保存在setting.py文件中，方便調(diào)用

MONGODB_HOST = "localhost"
MONGODB_PORT = 27017
MONGODB_DBNAME = "store_quotes2"
MONGODB_TABLE = "quotes2"

另外，在setting.py文件中一點(diǎn)要將pipeline注釋去掉，要不然pipeline不會(huì)起作用：

#ITEM_PIPELINES = {
#    "getquotes.pipelines.SomePipeline": 300,
#}

改成

ITEM_PIPELINES = {
    "getquotes.pipelines.GetquotesPipeline": 300,
}

現(xiàn)在在pipeline.py中定義處理item方法：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don"t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

# 將setting導(dǎo)入，以使用定義內(nèi)容
from scrapy.conf import settings
import pymongo

class GetquotesPipeline(object):

    # 連接數(shù)據(jù)庫
    def __init__(self):
        
        # 獲取數(shù)據(jù)庫連接信息
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DBNAME"]
        client = pymongo.MongoClient(host=host, port=port)
        
        # 定義數(shù)據(jù)庫
        db = client[dbname]
        self.table = db[settings["MONGODB_TABLE"]]
    
    # 處理item
    def process_item(self, item, spider):
            # 使用dict轉(zhuǎn)換item，然后插入數(shù)據(jù)庫
            quote_info = dict(item)
            self.table.insert(quote_info)
            return item

相應(yīng)的，myspider.py中的代碼變化一下

import scrapy
import pymongo

# 別忘了導(dǎo)入定義的item
from getquotes.items import GetquotesItem

class myspider(scrapy.Spider):

    # 設(shè)置爬蟲名稱
    name = "get_quotes"

    # 設(shè)置起始網(wǎng)址
    start_urls = ["http://quotes.toscrape.com"]

    """
        # 配置client，默認(rèn)地址localhost，端口27017
        client = pymongo.MongoClient("localhost",27017)
        # 創(chuàng)建一個(gè)數(shù)據(jù)庫，名稱store_quote
        db_name = client["store_quotes"]
        # 創(chuàng)建一個(gè)表
        quotes_list = db_name["quotes"]
    """
    def parse(self, response):

        #使用 css 選擇要素進(jìn)行抓取，如果喜歡用BeautifulSoup之類的也可以
        #先定位一整塊的quote，在這個(gè)網(wǎng)頁塊下進(jìn)行作者、名言,標(biāo)簽的抓取
        for quote in response.css(".quote"):
            """
            # 將頁面抓取的數(shù)據(jù)存入mongodb,使用insert
            yield self.quotes_list.insert({
                "author" : quote.css("small.author::text").extract_first(),
                "tags" : quote.css("div.tags a.tag::text").extract(),
                "content" : quote.css("span.text::text").extract_first()
            })
            """
            item = GetquotesItem()
            item["author"] = quote.css("small.author::text").extract_first()
            item["content"] = quote.css("span.text::text").extract_first()
            item["tags"] = quote.css("div.tags a.tag::text").extract()
            yield item


        # 使用xpath獲取next按鈕的href屬性值
        next_href = response.xpath("http://li[@class="next"]/a/@href").extract_first()
        # 判斷next_page的值是否存在
        if next_href is not None:

            # 如果下一頁屬性值存在，則通過urljoin函數(shù)組合下一頁的url:
            # www.quotes.toscrape.com/page/2
            next_page = response.urljoin(next_href)

            #回調(diào)parse處理下一頁的url
            yield scrapy.Request(next_page,callback=self.parse)

可以再scrapy輸出信息中看到pipeline啟用

再來看看數(shù)據(jù)庫保存情況

完美保存

云服務(wù)器 GPU云服務(wù)器學(xué)習(xí)筆記學(xué)習(xí)筆記一基礎(chǔ)學(xué)習(xí)筆記深度學(xué)習(xí)筆記

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://systransis.cn/yun/38619.html

發(fā)表評論

登陸后可評論

0條評論

13651657101

男|高級講師

我要關(guān)注我要私信

TA的文章

UCloud云主機(jī)云服務(wù)器:刪除主機(jī)注意事項(xiàng)

閱讀 3438·2022-01-04 14:20
主機(jī)和ftp地址是什么-ftp主機(jī)地址是什么？

閱讀 3136·2021-09-22 15:08
Aperture：香港直連vps,原生ip;512MB內(nèi)存/4GB SSD空間/1TB流量/500M

閱讀 2235·2021-09-03 10:44
【Hello CSS】第八章-CSS圖形

閱讀 2338·2019-08-30 15:44
CSS 實(shí)現(xiàn)滾動(dòng)時(shí)隱藏滾動(dòng)進(jìn)度條

閱讀 1523·2019-08-29 18:40
:last-child的坑-CSS3選擇器

閱讀 2688·2019-08-29 17:09
JavaScript 發(fā)布-訂閱模式

閱讀 3011·2019-08-26 13:53
即刻起，加速您的前端構(gòu)建

閱讀 3243·2019-08-26 13:37

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

scrapy學(xué)習(xí)筆記(三)：使用item與pipeline保存數(shù)據(jù)

相關(guān)文章