摘要:前言自從之前爬取后公司要求對進(jìn)行爬取,瞬間心中有一萬只。畢竟這些社交網(wǎng)絡(luò)的站點(diǎn)反爬機(jī)制做的很不錯(cuò)。但既然上面安排下來只能硬著頭皮上了。通過抓包,發(fā)現(xiàn)登陸站點(diǎn)的數(shù)據(jù)相比要簡單所有就寫了一套利用爬取的爬蟲。
前言
自從之前爬取twitter后公司要求對fancebook進(jìn)行爬取,瞬間心中有一萬只×××。畢竟這些社交網(wǎng)絡(luò)的站點(diǎn)反爬機(jī)制做的很不錯(cuò)。但既然上面安排下來只能硬著頭皮上了。通過抓包,發(fā)現(xiàn)登陸m.facebook.com站點(diǎn)psot的數(shù)據(jù)相比facebook.com要簡單,所有就寫了一套利用scrapy爬取facebook的爬蟲。
模擬登陸from scrapy import Spider from scrapy.http import Request, FormRequest class FacebookLogin(Spider): download_delay = 0.5 usr = "××××" # your username/email/phone number pwd = "××××" #account password def start_requests(self): return [Request("https://m.facebook.com/", callback=self.parse)] def parse(self, response): return FormRequest.from_response(response, formdata={ "email": self.usr, "pass": self.pwd }, callback=self.remember_browser) def remember_browser(self, response): # if re.search(r"(checkpoint)", response.url): # Use "save_device" instead of "dont_save" to save device return FormRequest.from_response(response, formdata={"name_action_selected": "dont_save"}, callback=self.after_login) def after_login(self, response): pass
注:為了保險(xiǎn)起見可以在seething文件中添加一個(gè)手機(jī)端的USER-AGENT
爬取用戶基本信息# -*- coding: UTF-8 -*- import re from urlparse import urljoin from scrapy import Item, Field from scrapy.http import Request from scrapy.selector import Selector from facebook_login import FacebookLogin class FacebookItems(Item): id = Field() url = Field() name = Field() work = Field() education = Field() family = Field() skills = Field() address = Field() contact_info = Field() basic_info = Field() bio = Field() quote = Field() nicknames = Field() relationship = Field() image_urls = Field() class FacebookProfile(FacebookLogin): download_delay = 2 name = "fb" links = None start_ids = [ "plok74122", "bear.black.12","tabaco.wang","chaolin.chang.q","ahsien.liu","kaiwen.cheng.100","liang.kevin.92","bingheng.tsai.9","psppupu", "cscgbakery","hc.shiao.l","asusisbad","benjamin","franklin", # "RobertScoble" ] # "https://m.facebook.com/tabaco.wang?v=info","https://m.facebook.com/RobertScoble?v=info"] def after_login(self, response): for id in self.start_ids: url = "https://m.facebook.com/%s?v=info" %id yield Request(url, callback=self.parse_profile,meta={"id":id}) def parse_profile(self, response): item = FacebookItems() item["id"] = response.meta["id"] item["url"] = response.url item["name"] = "".join(response.css("#root strong *::text").extract()) item["work"] = self.parse_info_has_image(response, response.css("#work")) item["education"] = self.parse_info_has_image(response, response.css("#education")) item["family"] = self.parse_info_has_image(response, response.css("#family")) item["address"] = self.parse_info_has_table(response.css("#living")) item["contact_info"] = self.parse_info_has_table(response.css("#contact-info")) item["basic_info"] = self.parse_info_has_table(response.css("#basic-info")) item["nicknames"] = self.parse_info_has_table(response.css("#nicknames")) item["skills"] = self.parse_info_text_only(response.css("#skills")) item["bio"] = self.parse_info_text_only(response.css("#bio")) item["quote"] = self.parse_info_text_only(response.css("#quote")) item["relationship"] = self.parse_info_text_only(response.css("#relationship")) yield item def parse_info_has_image(self, response, css_path): info_list = [] for div in css_path.xpath("div/div[2]/div"): url = urljoin(response.url, "".join(div.css("div > a::attr(href)").extract())) title = "".join(div.css("div").xpath("span | h3").xpath("a/text()").extract()) info = " ".join(div.css("div").xpath("span | h3").xpath("text()").extract()) if url and title and info: info_list.append({"url": url, "title": title, "info": info}) return info_list def parse_info_has_table(self, css_path): info_dict = {} for div in css_path.xpath("div/div[2]/div"): key = "".join(div.css("td:first-child div").xpath("span | span/span[1]").xpath("text()").extract()) value = "".join(div.css("td:last-child").xpath("div//text()").extract()).strip() if key and value: if key in info_dict: info_dict[key] += ", %s" % value else: info_dict[key] = value return info_dict def parse_info_text_only(self, css_path): text = css_path.xpath("div/div[2]//text()").extract() text = [t.strip() for t in text] text = [t for t in text if re.search("w+", t) and t != "Edit"] return " ".join(text)爬取用戶的所有圖片
雖然圖片在https://m.facebook.com/%s?v=info中會(huì)有顯示,但是真正的圖片鏈接卻需要幾次請求之后才能拿到,本作在spider中盡量少的操作原則故將抓取圖片也多帶帶寫成了一個(gè)爬蟲,如下:
# -*- coding: UTF-8 -*- from scrapy.spider import CrawlSpider,Rule,Spider from scrapy.linkextractor import LinkExtractor from facebook_login import FacebookLogin from scrapy.http import Request from scrapy.selector import Selector from scrapy import Item, Field import re,hashlib import sys reload(sys) sys.setdefaultencoding("utf-8") class FacebookPhotoItems(Item): url = Field() id = Field() photo_links = Field() md5 = Field() class CrawlPhoto(FacebookLogin): name = "fbphoto" timelint_photo = None id = None links = [] start_ids = [ "plok74122", "bear.black.12", "tabaco.wang", "chaolin.chang.q", # "ashien.liu", "liang.kevin.92","qia.chen", "bingheng.tsai.9", "psppupu", "cscgbakery", "hc.shiao.l", "asusisbad", "benjamin", "franklin", # "RobertScoble" ] def after_login(self, response): for url in self.start_ids: yield Request("https://m.facebook.com/%s/photos"%url,callback=self.parse_item,meta={"id":url}) # yield Request("https://m.facebook.com/%s/photos"%self.id,callback=self.parse_item) def parse_item(self,response): # print response.body urls = response.xpath("http://span").extract() next_page = None try: next_page = response.xpath("http://div[@class="co"]/a/@href").extract()[0].strip() except: pass # urls = response.xpath("http://div[@data-sigil="marea"]").extract() for i in urls: # if i.find(u"時(shí)間線照片")!=-1: try: self.timeline_photo = Selector(text=i).xpath("http://span/a/@href").extract()[0] if self.timeline_photo is not None: yield Request("https://m.facebook.com/%s"%self.timeline_photo,callback=self.parse_photos,meta=response.meta) except: continue if next_page: print "-----------------------next image page -----------------------------------------" yield Request("https://m.facebook.com/%s"%next_page,callback=self.parse_item,meta=response.meta) def parse_photos(self,response): urls = response.xpath("http://a[@class="bw bx"]/@href").extract() # urls = response.xpath("http://a[@class="_39pi _4i6j"]/@href").extract() for i in urls: yield Request("https://m.facebook.com/%s"%i,callback=self.process_photo_url,meta=response.meta) if len(urls) == 12: next_page = response.xpath("http://div[@id="m_more_item"]/a/@href").extract()[0] yield Request("https://m.facebook.com/%s"%next_page,callback=self.parse_photos,meta=response.meta) def process_photo_url(self,response): # photo_url = response.xpath("http://i[@class="img img"]").extract() item = FacebookPhotoItems() item["url"] = response.url item["id"] = response.meta["id"] photo_url = response.xpath("http://div[@style="text-align:center;"]/img/@src").extract()[0] item["photo_links"] = photo_url item["md5"] = self.getstr_md5(item["photo_links"])+".jpg" yield item def wirtefile(self,str): with open("temp2.html","w") as file: file.write(str) file.write(" ") def getstr_md5(self, input): if input is None: input = "" md = hashlib.md5() md.update(input) return md.hexdigest()
因?yàn)槲业膒ython水平也是半路出家,所有還沒有找到一個(gè)好的辦法將圖片鏈接的抓取集成到抓取基本信息的那個(gè)爬蟲上,如果有大神知道還請指點(diǎn)一二。
下載圖片沒有使用scrapy的imagePipline,而是使用的wget命令,原因就是上面所說,python水平太菜。。。
下面是自己寫的一個(gè)下載圖片的pipline:
class MyOwenImageDownload(object): def process_item(self, item,spider): if len(item) >6: pass else: file = "image/"+item["id"] if os.path.exists(file): pass else: os.makedirs(file) cmd = "wget "%s" -O %s -P %s --timeout=10 -q"%(item["photo_links"],file+"/"+item["md5"],file) os.system(cmd) return item結(jié)語
至此,整個(gè)爬蟲基本的結(jié)構(gòu)已經(jīng)寫完。。。源碼地址
In the end, we will remember not the words of our enemies but the silence of our friends
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/38167.html
摘要:用戶輸入用戶名和密碼后,用戶名和密碼會(huì)經(jīng)過加密附加到請求信息中再次請求服務(wù)器,服務(wù)器會(huì)根據(jù)請求頭攜帶的認(rèn)證信息,決定是否認(rèn)證成功及做出相應(yīng)的響應(yīng)。給出的認(rèn)證提示。認(rèn)證窗口關(guān)閉之前,瀏覽器狀態(tài)一直是等待用戶輸入。 Basic 概述 Basic 認(rèn)證是HTTP 中非常簡單的認(rèn)證方式,因?yàn)楹唵危圆皇呛馨踩?,不過仍然非常常用。 當(dāng)一個(gè)客戶端向一個(gè)需要認(rèn)證的HTTP服務(wù)器進(jìn)行數(shù)據(jù)請求時(shí),如果...
摘要:主要元素是身體內(nèi)容,可以表示為。提取每個(gè)元素的文本并最終組成單個(gè)文本。我們將使用故意慢的服務(wù)器來顯示這一點(diǎn)。是表示值的承諾的對象。我們將使用倉庫中提供的準(zhǔn)備示例作為示例。請注意,其余代碼基本上不受影響除了返回函數(shù)中的源鏈接。 showImg(https://segmentfault.com/img/remote/1460000019190698?w=480&h=260); 來源 | ...
摘要:注意爬豆爬一定要加入選項(xiàng),因?yàn)橹灰馕龅骄W(wǎng)站的有,就會(huì)自動(dòng)進(jìn)行過濾處理,把處理結(jié)果分配到相應(yīng)的類別,但偏偏豆瓣里面的為空不需要分配,所以一定要關(guān)掉這個(gè)選項(xiàng)。 本課只針對python3環(huán)境下的Scrapy版本(即scrapy1.3+) 選取什么網(wǎng)站來爬取呢? 對于歪果人,上手練scrapy爬蟲的網(wǎng)站一般是官方練手網(wǎng)站 http://quotes.toscrape.com 我們中國人,當(dāng)然...
摘要:負(fù)責(zé)處理被提取出來的。典型的處理有清理驗(yàn)證及持久化例如存取到數(shù)據(jù)庫知識庫項(xiàng)目的設(shè)置文件實(shí)現(xiàn)自定義爬蟲的目錄中間件是在引擎及之間的特定鉤子,處理的輸入和輸出及。 【百度云搜索:http://www.bdyss.com】 【搜網(wǎng)盤:http://www.swpan.cn】 Scrapy框架安裝 1、首先,終端執(zhí)行命令升級pip: python -m pip install --upgrad...
摘要:組件引擎負(fù)責(zé)控制數(shù)據(jù)流在系統(tǒng)中所有組件中流動(dòng),并在相應(yīng)動(dòng)作發(fā)生時(shí)觸發(fā)事件。下載器下載器負(fù)責(zé)獲取頁面數(shù)據(jù)并提供給引擎,而后提供給。下載器中間件下載器中間件是在引擎及下載器之間的特定鉤子,處理傳遞給引擎的。 Scrapy 是用Python實(shí)現(xiàn)一個(gè)為爬取網(wǎng)站數(shù)據(jù)、提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。 一、Scrapy框架簡介 Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù),提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。 ...
閱讀 678·2023-04-26 02:03
閱讀 1045·2021-11-23 09:51
閱讀 1159·2021-10-14 09:42
閱讀 1750·2021-09-13 10:23
閱讀 974·2021-08-27 13:12
閱讀 851·2019-08-30 11:21
閱讀 1010·2019-08-30 11:14
閱讀 1054·2019-08-30 11:09