scrapy汽車之家車型的簡(jiǎn)單爬取

zhangfaliang 發(fā)布于2019-07-31 10:26 / 3290人閱讀

摘要：汽車之家車型的簡(jiǎn)單爬取名字自定義配置重新定義起始爬取點(diǎn)所有首字母按照首字母，組合對(duì)應(yīng)的頁(yè)面，壓入根據(jù)，抓取頁(yè)面定義默認(rèn)的抓取函數(shù)品牌編號(hào)品牌名品牌品牌小類別品牌小類別對(duì)應(yīng)的頁(yè)面品牌小類別的編號(hào)品牌小類別名品牌小類別對(duì)應(yīng)的頁(yè)面的

汽車之家車型的簡(jiǎn)單爬取
spider

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from mininova.items import carItem
import sys
reload(sys)
sys.setdefaultencoding("utf8")
class SplashSpider(scrapy.Spider):
    #spider名字
    name = "car_home"
    allowed_domains = ["autohome.com.cn"]
    start_urls = [
    ]
     # 自定義配置
    custom_settings = {
         "ITEM_PIPELINES": {
         "mininova.pipelines.CarPipeline": 300,
         }
    }
    def start_requests(self): #重新定義起始爬取點(diǎn)
        #所有首字母
        words = ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]
        #按照首字母，組合對(duì)應(yīng)的頁(yè)面，壓入start_urls
        for word in words:
            self.start_urls.append("https://www.autohome.com.cn/grade/carhtml/"+word+".html") 
        #根據(jù)start_urls，抓取頁(yè)面
        for url in self.start_urls:
            yield Request(url,meta={"word":word})
    #定義默認(rèn)的抓取函數(shù)
    def parse(self, response): 
        print("url")
        print(response.url)
        word = response.meta["word"]
        car_cates = response.xpath("http://dl").extract()
        brand_id = 0
        total_cars = []
        for brand_index in range(len(car_cates)):
            #品牌編號(hào)
            brand_num = brand_index + 1
            brand_num = str(brand_num)
            #品牌名
            brand = response.xpath("http://dl["+brand_num+"]/dt/div[1]/a/text()").extract()[0]
            print("brand:"+brand)
            #品牌logo
            brand_logo_url = response.xpath("http://dl["+brand_num+"]/dt//img[1]/@src").extract()[0]
            #品牌小類別
            brand_items = response.xpath("http://dl["+brand_num+"]/dd//div[@class="h3-tit"]/a/text()").extract()
            #品牌小類別對(duì)應(yīng)的頁(yè)面
            brand_item_urls = response.xpath("http://dl["+brand_num+"]/dd//div[@class="h3-tit"]/a/@href").extract()
            for brand_item_index in range(len(brand_items)):
                #品牌小類別的編號(hào)
                brand_item_num = brand_item_index + 1
                brand_item_num = str(brand_item_num)
                #品牌小類別名
                brand_item = brand_items[brand_item_index]
                #品牌小類別對(duì)應(yīng)的頁(yè)面的url
                brand_item_url = brand_item_urls[brand_item_index]
                print("brand_item:"+brand_item)
                print("brand_item_url:"+brand_item_url)
                #品牌小類別中的所有車
                cars = response.xpath("http://dl["+brand_num+"]/dd//ul[@class="rank-list-ul"]["+brand_item_num+"]/li[@id]").extract()
                print("cars_count:"+str(len(cars)))
                for car_index in range(len(cars)):
                    car_num = car_index + 1
                    car_num = str(car_num)
                    #具體車的名稱
                    name = response.xpath("http://dl["+brand_num+"]/dd//ul[@class="rank-list-ul"]["+brand_item_num+"]/li[@id]["+car_num+"]/h4/a/text()").extract()[0]
                    #車對(duì)應(yīng)的頁(yè)面
                    url = response.xpath("http://dl["+brand_num+"]/dd//ul[@class="rank-list-ul"]["+brand_item_num+"]/li[@id]["+car_num+"]/h4/a/@href").extract()[0]
                    #報(bào)價(jià)（最低價(jià)-最高價(jià)）
                    price = response.xpath("http://dl["+brand_num+"]/dd//ul[@class="rank-list-ul"]["+brand_item_num+"]/li[@id]["+car_num+"]/div[1]/a/text()").extract()[0]
                    prices = price.split("-")
                    price_base = "萬(wàn)"
                    if len(prices) != 2:
                        max_price = "暫無(wú)"
                        min_price = "暫無(wú)"
                    else:
                        max_price = str(prices[1].replace(price_base,""))
                        min_price = str(prices[0])
                    print("car:"+name+" max_price:"+str(max_price)+" min_price:"+str(min_price)+" price_base:"+price_base)
                    car_item = carItem()
                    car_item["name"] = name
                    car_item["url"] = url
                    car_item["brand_item"] = brand_item
                    car_item["first_word"] = word
                    car_item["brand"] = brand
                    car_item["brand_logo_url"] = brand_logo_url
                    car_item["max_price"] = max_price
                    car_item["min_price"] = min_price
                    total_cars.append(car_item)
        return total_cars

item

# -*- coding: utf-8 -*-
import scrapy
class carItem(scrapy.Item):
    #具體車名
    name = scrapy.Field()
    #對(duì)應(yīng)的介紹頁(yè)面url
    url = scrapy.Field()
    #最高報(bào)價(jià)，單位（萬(wàn)）
    max_price = scrapy.Field()
    #最低報(bào)價(jià)，單位（萬(wàn)）
    min_price = scrapy.Field()
    #品牌名
    brand = scrapy.Field()
    #品牌logo
    brand_logo_url = scrapy.Field()
    #品牌小類別名
    brand_item = scrapy.Field()
    #品牌首字母
    first_word = scrapy.Field()

mongo_car

from mininova.mongodb import Mongo
from mininova.settings import mongo_setting
class MongoCar():
    db_name = "car"
    brand_set_name = "brand"
    brand_item_set_name = "brand_item"
    car_set_name = "car"
    def __init__(self):
        self.db = Mongo(mongo_setting["mongo_host"],mongo_setting["mongo_port"],mongo_setting["mongo_user"],mongo_setting["mongo_password"])

    def insert(self,item):
        brand_where = {"name":item["brand"]}
        brand = self.brand_exist(self.db,brand_where)
        if brand == False:
            brand = {"name":item["brand"],"first_word":item["first_word"]}
            brand = self.insert_brand(self.db,brand)
            print("brand insert ok!")
        else:
            brand = {"name":item["brand"],"first_word":item["first_word"],"logo_url":item["brand_logo_url"]}
            brand = self.update_brand(self.db,brand_where,brand)
            print("brand_exist!")

        brand_item_where = {"name":item["brand_item"]}
        brand_item = self.brand_item_exist(self.db,brand_item_where)
        if brand_item == False:
            brand_item = {"name":item["brand_item"],"first_word":item["first_word"],"brand_id":brand["_id"]}
            brand_item = self.insert_brand_item(self.db,brand_item)
            print("brand_item insert ok!")
        else:
            print("brand_item_exist!")

        car_where = {"name":item["brand_item"],"name":item["name"]}
        car = self.car_exist(self.db,car_where)
        if car == False:
            car = {"name":item["name"],"url":item["url"],"max_price":item["max_price"],"min_price":item["min_price"],"first_word":item["first_word"],"brand_id":brand["_id"],"brand_item_id":brand_item["_id"]}
            car = self.insert_car(self.db,car)
            print("car insert ok!")
        else:
            print("car_exist!")
            


        if car != False:
            return True;
        else:
            return False;
    def update_brand(self,db,brand_where,brand):
        my_set = db.set(self.db_name,self.brand_set_name)
        my_set.update_one(brand_where,{"$set":brand})
        exist = my_set.find_one(brand_where)
        if(exist is None):
            return False
        else:
            return exist

    def brand_exist(self,db,brand):
        my_set = db.set(self.db_name,self.brand_set_name)
        exist = my_set.find_one(brand)
        if(exist is None):
            return False
        else:
            return exist

    def insert_brand(self,db,brand):
        my_set = db.set(self.db_name,self.brand_set_name)
        my_set.insert_one(brand)
        brand = my_set.find_one(brand)
        return brand

    def brand_item_exist(self,db,brand_item):
        my_set = db.set(self.db_name,self.brand_item_set_name)
        exist = my_set.find_one(brand_item)
        if(exist is None):
            return False
        else:
            return exist

    def insert_brand_item(self,db,brand_item):
        my_set = db.set(self.db_name,self.brand_item_set_name)
        my_set.insert_one(brand_item)
        brand = my_set.find_one(brand_item)
        return brand

    def car_exist(self,db,car):
        my_set = db.set(self.db_name,self.car_set_name)
        exist = my_set.find_one(car)
        if(exist is None):
            return False
        else:
            return exist

    def insert_car(self,db,car):
        my_set = db.set(self.db_name,self.car_set_name)
        my_set.insert_one(car)
        brand = my_set.find_one(car)
        return brand

pipeline

from mininova.settings import settings
import pymysql
import os
from mininova.db import Bookdb
from mininova.mongo_novel import MongoNovel
from mininova.mongo_car import MongoCar
import copy
class CarPipeline(object):   
    def process_item(self,item,spider):
        mongo_car = MongoCar()
        mongo_car.insert(item)
        print(item["name"])
        print("item insert ok!")

setting

mongo_setting = {
    "mongo_host" : "xxx.xxx.xxx.xxx",
    "mongo_port" : 27017,
    "mongo_user" : "username",
    "mongo_password" : "password"
}

云服務(wù)器 GPU云服務(wù)器 scrapy爬取實(shí)例車型車型配件數(shù)據(jù) 汽車的vps功能

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/43814.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

zhangfaliang

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

企業(yè)做視頻直播有哪些好處？需要注意哪些事項(xiàng)？

閱讀 2742·2021-11-22 13:54
野草云：香港CN2建站云服務(wù)器，季付以上七折，1核/1G內(nèi)存/50G硬盤/3M帶寬，月付19元起，年

閱讀 1082·2021-10-14 09:48
7年，我從功能測(cè)試到測(cè)試開發(fā)，寫給即將進(jìn)入或者正在做測(cè)試的你...

閱讀 2305·2021-09-08 09:35
CSS中的包裹性

閱讀 1569·2019-08-30 15:53
通過(guò)js動(dòng)態(tài)設(shè)置根元素的rem方案

閱讀 1180·2019-08-30 13:14
BFC--margin折疊和清除浮動(dòng)

閱讀 619·2019-08-30 13:09
優(yōu)秀博文收藏（不定期更新）

閱讀 2533·2019-08-30 10:57
css邊距重疊的解決方案

閱讀 3345·2019-08-29 13:18

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

scrapy汽車之家車型的簡(jiǎn)單爬取

相關(guān)文章

首次公開，整理12年積累的博客收藏夾，零距離展示《收藏夾吃灰》系列博客

2016年，我對(duì)爬蟲的總結(jié)

Golang爬蟲爬取汽車之家二手車產(chǎn)品庫(kù)

**爬蟲學(xué)習(xí)之基于Scrapy的網(wǎng)絡(luò)爬蟲**

大話爬蟲的實(shí)踐技巧

發(fā)表評(píng)論

0條評(píng)論

zhangfaliang

男|高級(jí)講師

TA的文章

企業(yè)做視頻直播有哪些好處？需要注意哪些事項(xiàng)？

野草云：香港CN2建站云服務(wù)器，季付以上七折，1核/1G內(nèi)存/50G硬盤/3M帶寬，月付19元起，年

7年，我從功能測(cè)試到測(cè)試開發(fā)，寫給即將進(jìn)入或者正在做測(cè)試的你...

CSS中的包裹性

通過(guò)js動(dòng)態(tài)設(shè)置根元素的rem方案

BFC--margin折疊和清除浮動(dòng)

優(yōu)秀博文收藏（不定期更新）

css邊距重疊的解決方案

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

scrapy汽車之家車型的簡(jiǎn)單爬取

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！