4、web爬蟲，scrapy模塊標(biāo)簽選擇器下載圖片，以及正則匹配標(biāo)簽

KitorinZero 發(fā)布于2019-07-31 10:33 / 3455人閱讀

摘要：百度云搜索，搜各種資料搜網(wǎng)盤，搜各種資料標(biāo)簽選擇器對象創(chuàng)建標(biāo)簽選擇器對象，參數(shù)接收回調(diào)的對象需要導(dǎo)入模塊標(biāo)簽選擇器方法，是里的一個方法，參數(shù)接收選擇器規(guī)則，返回列表元素是一個標(biāo)簽對象獲取到選擇器過濾后的內(nèi)容，返回列表元素是內(nèi)容選擇器規(guī)則表示

【百度云搜索，搜各種資料:http://bdy.lqkweb.com】

【搜網(wǎng)盤，搜各種資料:http://www.swpan.cn】

標(biāo)簽選擇器對象

HtmlXPathSelector()創(chuàng)建標(biāo)簽選擇器對象，參數(shù)接收response回調(diào)的html對象
需要導(dǎo)入模塊：from scrapy.selector import HtmlXPathSelector

select()標(biāo)簽選擇器方法，是HtmlXPathSelector里的一個方法，參數(shù)接收選擇器規(guī)則，返回列表元素是一個標(biāo)簽對象

extract()獲取到選擇器過濾后的內(nèi)容，返回列表元素是內(nèi)容

選擇器規(guī)則

　　//x?表示向下查找n層指定標(biāo)簽，如：//div 表示查找所有div標(biāo)簽
　　/x?表示向下查找一層指定的標(biāo)簽
　　/@x?表示查找指定屬性,可以連綴如：@id @src
　　[@class="class名稱"]?表示查找指定屬性等于指定值的標(biāo)簽,可以連綴，查找class名稱等于指定名稱的標(biāo)簽
　　/text()?獲取標(biāo)簽文本類容
　　[x]?通過索引獲取集合里的指定一個元素

獲取指定的標(biāo)簽對象

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導(dǎo)入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導(dǎo)入HtmlXPathSelector模塊
from?urllib?import?request?????????????????????#導(dǎo)入request模塊
import?os

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設(shè)置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????hxs?=?HtmlXPathSelector(response)???????????????#創(chuàng)建HtmlXPathSelector對象，將頁面返回對象傳進(jìn)去

????????items?=?hxs.select("http://div[@class="showlist"]/li")??#標(biāo)簽選擇器，表示獲取所有class等于showlist的div，下面的li標(biāo)簽
????????print(items)???????????????????????????????????????#返回標(biāo)簽對象

循環(huán)獲取到每個li標(biāo)簽里的子標(biāo)簽，以及各種屬性或者文本

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導(dǎo)入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導(dǎo)入HtmlXPathSelector模塊
from?urllib?import?request?????????????????????#導(dǎo)入request模塊
import?os

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設(shè)置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????hxs?=?HtmlXPathSelector(response)???????????????#創(chuàng)建HtmlXPathSelector對象，將頁面返回對象傳進(jìn)去

????????items?=?hxs.select("http://div[@class="showlist"]/li")??#標(biāo)簽選擇器，表示獲取所有class等于showlist的div，下面的li標(biāo)簽
????????#?print(items)?????????????????????????????????????#返回標(biāo)簽對象
????????for?i?in?range(len(items)):????????????????????????#根據(jù)li標(biāo)簽的長度循環(huán)次數(shù)
????????????title?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@alt"?%?i).extract()???#根據(jù)循環(huán)的次數(shù)作為下標(biāo)獲取到當(dāng)前l(fā)i標(biāo)簽，下的img標(biāo)簽的alt屬性內(nèi)容
????????????src?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@src"?%?i).extract()?????#根據(jù)循環(huán)的次數(shù)作為下標(biāo)獲取到當(dāng)前l(fā)i標(biāo)簽，下的img標(biāo)簽的src屬性內(nèi)容
????????????if?title?and?src:
????????????????print(title,src)??#返回類容列表

將獲取到的圖片下載到本地

urlretrieve()將文件保存到本地，參數(shù)1要保存文件的src，參數(shù)2保存路徑
urlretrieve是urllib下request模塊的一個方法，需要導(dǎo)入from urllib import request

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導(dǎo)入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導(dǎo)入HtmlXPathSelector模塊
from?urllib?import?request?????????????????????#導(dǎo)入request模塊
import?os

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設(shè)置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????hxs?=?HtmlXPathSelector(response)???????????????#創(chuàng)建HtmlXPathSelector對象，將頁面返回對象傳進(jìn)去

????????items?=?hxs.select("http://div[@class="showlist"]/li")??#標(biāo)簽選擇器，表示獲取所有class等于showlist的div，下面的li標(biāo)簽
????????#?print(items)?????????????????????????????????????#返回標(biāo)簽對象
????????for?i?in?range(len(items)):????????????????????????#根據(jù)li標(biāo)簽的長度循環(huán)次數(shù)
????????????title?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@alt"?%?i).extract()???#根據(jù)循環(huán)的次數(shù)作為下標(biāo)獲取到當(dāng)前l(fā)i標(biāo)簽，下的img標(biāo)簽的alt屬性內(nèi)容
????????????src?=?hxs.select("http://div[@class="showlist"]/li[%d]//img/@src"?%?i).extract()?????#根據(jù)循環(huán)的次數(shù)作為下標(biāo)獲取到當(dāng)前l(fā)i標(biāo)簽，下的img標(biāo)簽的src屬性內(nèi)容
????????????if?title?and?src:
????????????????#?print(title[0],src[0])????????????????????????????????????????????????????#通過下標(biāo)獲取到字符串內(nèi)容
????????????????file_path?=?os.path.join(os.getcwd()?+?"/img/",?title[0]?+?".jpg")??????????#拼接圖片保存路徑
????????????????request.urlretrieve(src[0],?file_path)??????????????????????????#將圖片保存到本地，參數(shù)1獲取到的src，參數(shù)2保存路徑

xpath()標(biāo)簽選擇器，是Selector類里的一個方法，參數(shù)是選擇規(guī)則【推薦】

選擇器規(guī)則同上

selector()創(chuàng)建選擇器類，需要接受html對象
需要導(dǎo)入：from scrapy.selector import Selector

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導(dǎo)入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導(dǎo)入HtmlXPathSelector模塊
from?scrapy.selector?import?Selector

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設(shè)置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????items?=?Selector(response=response).xpath("http://div[@class="showlist"]/li").extract()
????????#?print(items)?????????????????????????????????????#返回標(biāo)簽對象
????????for?i?in?range(len(items)):
????????????title?=?Selector(response=response).xpath("http://div[@class="showlist"]/li[%d]//img/@alt"?%?i).extract()
????????????src?=?Selector(response=response).xpath("http://div[@class="showlist"]/li[%d]//img/@src"?%?i).extract()
????????????print(title,src)

正則表達(dá)式的應(yīng)用

正則表達(dá)式是彌補(bǔ)，選擇器規(guī)則無法滿足過濾情況時使用的，

分為兩種正則使用方式

　　1、將選擇器規(guī)則過濾出來的結(jié)果進(jìn)行正則匹配

　　2、在選擇器規(guī)則里應(yīng)用正則進(jìn)行過濾

1、將選擇器規(guī)則過濾出來的結(jié)果進(jìn)行正則匹配，用正則取最終內(nèi)容

最后.re("正則")

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導(dǎo)入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導(dǎo)入HtmlXPathSelector模塊
from?scrapy.selector?import?Selector

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設(shè)置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????items?=?Selector(response=response).xpath("http://div[@class="showlist"]/li//img")[0].extract()
????????print(items)?????????????????????????????????????#返回標(biāo)簽對象
????????items2?=?Selector(response=response).xpath("http://div[@class="showlist"]/li//img")[0].re("alt="(w+)")
????????print(items2)

#?
#?["人體藝術(shù)mmSunny前凸后翹性感誘惑寫真"]

2、在選擇器規(guī)則里應(yīng)用正則進(jìn)行過濾

[re:正則規(guī)則]

#?-*-?coding:?utf-8?-*-
import?scrapy???????#導(dǎo)入爬蟲模塊
from?scrapy.selector?import?HtmlXPathSelector??#導(dǎo)入HtmlXPathSelector模塊
from?scrapy.selector?import?Selector

class?AdcSpider(scrapy.Spider):
????name?=?"adc"????????????????????????????????????????#設(shè)置爬蟲名稱
????allowed_domains?=?["www.shaimn.com"]
????start_urls?=?["http://www.shaimn.com/xinggan/"]

????def?parse(self,?response):
????????items?=?Selector(response=response).xpath("http://div").extract()
????????#?print(items)?????????????????????????????????????#返回標(biāo)簽對象
????????items2?=?Selector(response=response).xpath("http://div[re:test(@class,?"showlist")]").extract()??#正則找到div的class等于showlist的元素
????????print(items2)

【轉(zhuǎn)載自：http://www.leiqiankun.com/?id=47】

云服務(wù)器 GPU云服務(wù)器 jsp標(biāo)簽選擇器匹配a標(biāo)簽正則html標(biāo)簽正則校驗html標(biāo)簽

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://systransis.cn/yun/44026.html

發(fā)表評論

登陸后可評論

0條評論

KitorinZero

男|高級講師

我要關(guān)注我要私信

TA的文章

美國第三大公共圖書館波士頓公共圖書館遭網(wǎng)絡(luò)攻擊全系統(tǒng)技術(shù)中斷

閱讀 555·2021-08-31 09:45
CloudCone ,大硬盤vps補(bǔ)貨，$20/年，1核/1G/250GB HDD/5TB月流量(理

閱讀 1666·2021-08-11 11:19
在單頁應(yīng)用中，如何優(yōu)雅的上報前端性能數(shù)據(jù)

閱讀 898·2019-08-30 15:55
重學(xué)前端學(xué)習(xí)筆記（十三）--瀏覽器工作解析（三）

閱讀 836·2019-08-30 10:52
5行js代碼搞定導(dǎo)航吸頂效果

閱讀 2872·2019-08-29 13:11
js驗證身份證號碼記錄

閱讀 2940·2019-08-23 17:08
React.js 常見問題

閱讀 2851·2019-08-23 15:11
JavaScript是如何工作的:Web推送通知的機(jī)制

閱讀 3080·2019-08-23 14:33

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

4、web爬蟲，scrapy模塊標(biāo)簽選擇器下載圖片，以及正則匹配標(biāo)簽

【百度云搜索，搜各種資料:http://bdy.lqkweb.com】

【搜網(wǎng)盤，搜各種資料:http://www.swpan.cn】

相關(guān)文章

網(wǎng)絡(luò)爬蟲介紹

**11、web爬蟲講解2—Scrapy框架爬蟲—Scrapy使用**

scrapy學(xué)習(xí)筆記

爬蟲入門

爬蟲入門

發(fā)表評論

0條評論

KitorinZero

男|高級講師

TA的文章

美國第三大公共圖書館波士頓公共圖書館遭網(wǎng)絡(luò)攻擊全系統(tǒng)技術(shù)中斷

CloudCone ,大硬盤vps補(bǔ)貨，$20/年，1核/1G/250GB HDD/5TB月流量(理

在單頁應(yīng)用中，如何優(yōu)雅的上報前端性能數(shù)據(jù)

重學(xué)前端學(xué)習(xí)筆記（十三）--瀏覽器工作解析（三）

5行js代碼搞定導(dǎo)航吸頂效果

js驗證身份證號碼記錄

React.js 常見問題

JavaScript是如何工作的:Web推送通知的機(jī)制

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

4、web爬蟲，scrapy模塊標(biāo)簽選擇器下載圖片，以及正則匹配標(biāo)簽

【百度云搜索，搜各種資料:http://bdy.lqkweb.com】

【搜網(wǎng)盤，搜各種資料:http://www.swpan.cn】

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

4、web爬蟲，scrapy模塊標(biāo)簽選擇器下載圖片，以及正則匹配標(biāo)簽

【搜網(wǎng)盤，搜各種資料:http://www.swpan.cn】