摘要:分布式爬蟲,分發(fā)網(wǎng)址是基于地址。注意要使用管理同一個(gè)集群,爬蟲項(xiàng)目名稱須一致,同時(shí)集群中配置相同任務(wù)瀏覽器訪問啟動(dòng)爬蟲時(shí)即可看見兩個(gè)集群配置,啟動(dòng)同名爬蟲開始分布式爬蟲啟動(dòng)分布式爬蟲后狀態(tài)
Scrapy-cluster 建設(shè)
基于Scrapy-cluster庫的kafka-monitor可以實(shí)現(xiàn)分布式爬蟲
Scrapyd+Spiderkeeper實(shí)現(xiàn)爬蟲的可視化管理
環(huán)境IP | Role |
---|---|
168.*.*.118 | Scrapy-cluster,scrapyd,spiderkeeper |
168.*.*.119 | Scrapy-cluster,scrapyd,kafka,redis,zookeeper |
# cat /etc/redhat-release CentOS Linux release 7.4.1708 (Core) # python -V Python 2.7.5 # java -version openjdk version "1.8.0_181" OpenJDK Runtime Environment (build 1.8.0_181-b13) OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)Zookeeper 單機(jī)配置
下載并配置
# wget http://mirror.bit.edu.cn/apache/zookeeper/zookeeper-3.4.13/zookeeper-3.4.13.tar.gz # tar -zxvf zookeeper-3.4.13.tar.gz # cd zookeeper-3.4.13/conf # cp zoo_sample.cfg zoo.cfg # cd .. # PATH=/opt/zookeeper-3.4.13/bin:$PATH # echo "export PATH=/opt/zookeeper-3.4.13/bin:$PATH" > /etc/profile.d/zoo.sh
單節(jié)點(diǎn)啟動(dòng)
# zkServer.sh status ZooKeeper JMX enabled by default Using config: /opt/zookeeper-3.4.13/bin/../conf/zoo.cfg Error contacting service. It is probably not running. # zkServer.sh startkafka 單機(jī)配置
下載
# wget http://mirrors.hust.edu.cn/apache/kafka/2.0.0/kafka_2.12-2.0.0.tgz # tar -zxvf kafka_2.12-2.0.0.tgz # cd kafka_2.12-2.0.0/
配置
# vim config/server.properties ############################# Server Basics ############################# # The id of the broker. This must be set to a unique integer for each broker. broker.id=0 # kafka的機(jī)器編號(hào), host.name = 168.*.*.119 # 綁定ip port=9092 # 默認(rèn)端口9092, # Switch to enable topic deletion or not, default value is false delete.topic.enable=true ############################# Zookeeper ############################# zookeeper.connect=localhost:2181
啟動(dòng)
nohup bin/kafka-server-start.sh config/server.properties &
停止命令bin/kafka-server-stop.sh config/server.properties
redis 單機(jī)配置安裝配置
# yum -y install redis # vim /etc/redis.conf bind 168.*.*.119
啟動(dòng)
# systemctl start redis.servicescrapy-cluster 單機(jī)配置
# git clone https://github.com/istresearch/scrapy-cluster.git # cd scrapy-cluster # pip install -r requirements.txt
離線運(yùn)行單元測試,以確保一切似乎正常
# ./run_offline_tests.sh
修改配置
# vim kafka-monitor/settings.py # vim redis-monitor/settings.py # vim crawlers/crawling/settings.py
修改以下
# Redis host configuration REDIS_HOST = "168.*.*.119" REDIS_PORT = 6379 REDIS_DB = 0 KAFKA_HOSTS = "168.*.*.119:9092" KAFKA_TOPIC_PREFIX = "demo" KAFKA_CONN_TIMEOUT = 5 KAFKA_APPID_TOPICS = False KAFKA_PRODUCER_BATCH_LINGER_MS = 25 # 25 ms before flush KAFKA_PRODUCER_BUFFER_BYTES = 4 * 1024 * 1024 # 4MB before blocking # Zookeeper Settings ZOOKEEPER_ASSIGN_PATH = "/scrapy-cluster/crawler/" ZOOKEEPER_ID = "all" ZOOKEEPER_HOSTS = "168.*.*.119:2181"
啟動(dòng)監(jiān)聽
# nohup python kafka_monitor.py run >> /root/scrapy-cluster/kafka-monitor/kafka_monitor.log 2>&1 & # nohup python redis_monitor.py >> /root/scrapy-cluster/redis-monitor/redis_monitor.log 2>&1 &scrapyd 爬蟲管理工具配置
安裝
# pip install scrapyd
配置
# sudo mkdir /etc/scrapyd # sudo vi /etc/scrapyd/scrapyd.conf
[scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 10 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 http_port = 6800 debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
啟動(dòng)
# nohup scrapyd >> /root/scrapy-cluster/scrapyd.log 2>&1 &
建議做Nginx反向代理
啟動(dòng)異常
File "/usr/local/lib/python3.6/site-packages/scrapyd-1.2.0-py3.6.egg/scrapyd/app.py", line 2, infrom twisted.application.internet import TimerService, TCPServer File "/usr/local/lib64/python3.6/site-packages/twisted/application/internet.py", line 54, in from automat import MethodicalMachine File "/usr/local/lib/python3.6/site-packages/automat/__init__.py", line 2, in from ._methodical import MethodicalMachine File "/usr/local/lib/python3.6/site-packages/automat/_methodical.py", line 210, in class MethodicalInput(object): File "/usr/local/lib/python3.6/site-packages/automat/_methodical.py", line 220, in MethodicalInput @argSpec.default builtins.TypeError: "_Nothing" object is not callable Failed to load application: "_Nothing" object is not callable
解決:Automat降級(jí)
pip install Automat==0.6.0Spiderkeeper 爬蟲管理界面配置
安裝
pip install SpiderKeeper
啟動(dòng)
mkdir /root/spiderkeeper/ nohup spiderkeeper --server=http://168.*.*.118:6800 --username=admin --password=admin --database-url=sqlite:////root/spiderkeeper/SpiderKeeper.db >> /root/scrapy-cluster/spiderkeeper.log 2>&1 &
瀏覽器訪問http://168.*.*.118:5000
使用Spiderkeeper 管理爬蟲 使用scrapyd-deploy部署爬蟲項(xiàng)目修改scrapy.cfg配置
vim /root/scrapy-cluster/crawler/scrapy.cfg
[settings] default = crawling.settings [deploy] url = http://168.*.*.118:6800/ project = crawling
添加新的spider
cd /root/scrapy-cluster/crawler/crawling/spider
使用scrapyd-deploy部署項(xiàng)目
# cd /root/scrapy-cluster/crawler # scrapyd-deploy Packing version 1536225989 Deploying to project "crawling" in http://168.*.*.118:6800/addversion.json Server response (200): {"status": "ok", "project": "crawling", "version": "1536225989", "spiders": 3, "node_name": "ambari"}spiderkeeper 配置爬蟲項(xiàng)目
登錄Spiderkeeper創(chuàng)建項(xiàng)目
使用scrapy.cfg中配置的項(xiàng)目名
創(chuàng)建后再Spiders->Dashboard中看到所有spider
Scrapy-cluster 分布式爬蟲Scrapy Cluster需要在不同的爬蟲服務(wù)器之間進(jìn)行協(xié)調(diào),以確保最大的內(nèi)容吞吐量,同時(shí)控制集群服務(wù)器爬取網(wǎng)站的速度。
Scrapy Cluster提供了兩種主要策略來控制爬蟲對(duì)不同域名的攻擊速度。這由爬蟲的類型與IP地址確定,但他們都作用于不同的域名隊(duì)列。
Scrapy-cluster分布式爬蟲,分發(fā)網(wǎng)址是基于IP地址。在不同的機(jī)器上啟動(dòng)集群,不同服務(wù)器上的每個(gè)爬蟲去除隊(duì)列中的所有鏈接。
部署集群中第二個(gè)scrapy-cluster配置一臺(tái)新的服務(wù)器參照scrapy-cluster 單機(jī)配置,同時(shí)使用第一臺(tái)服務(wù)器配置kafka-monitor/settings.py redis-monitor/settings.py crawling/settings.py
Current public ip 問題由于兩臺(tái)服務(wù)器同時(shí)部署在相同內(nèi)網(wǎng),spider運(yùn)行后即獲取相同Current public ip,導(dǎo)致scrapy-cluster調(diào)度器無法根據(jù)IP分發(fā)鏈接
2018-09-07 16:08:29,684 [sc-crawler] DEBUG: Current public ip: b"110.*.*.1"
參考代碼/root/scrapy-cluster/crawler/crawling/distributed_scheduler.py第282行:
try: obj = urllib.request.urlopen(settings.get("PUBLIC_IP_URL", "http://ip.42.pl/raw")) results = self.ip_regex.findall(obj.read()) if len(results) > 0: # results[0] 獲取IP地址即為110.90.122.1 self.my_ip = results[0] else: raise IOError("Could not get valid IP Address") obj.close() self.logger.debug("Current public ip: {ip}".format(ip=self.my_ip)) except IOError: self.logger.error("Could not reach out to get public ip") pass
建議修改代碼,獲取本機(jī)IP
self.my_ip = [(s.connect(("8.8.8.8", 53)), s.getsockname()[0], s.close()) for s in [socket.socket(socket.AF_INET, socket.SOCK_DGRAM)]][0][1]運(yùn)行分布式爬蟲
在兩個(gè)scrapy-cluster中運(yùn)行相同Spider
execute(["scrapy", "runspider", "crawling/spiders/link_spider.py"])
使用python kafka_monitor.py feed投遞多個(gè)鏈接,使用DEBUG即可觀察到鏈接分配情況
使用SpiderKeeper管理分布式爬蟲 配置scrapyd管理集群第二個(gè)scrapy-cluster在第二臺(tái)scrapy-cluster服務(wù)器上安裝配置scrapyd,參考scrapyd 爬蟲管理工具配置并修改配置
[settings] default = crawling.settings [deploy] url = http://168.*.*.119:6800/ project = crawling
啟動(dòng)scrapyd后使用scrapyd-deploy工具部署兩個(gè)scrapy-cluster上的爬蟲項(xiàng)目。
使用Spiderkeeper連接多個(gè)scrapy-cluster重新啟動(dòng)spiderkeeper,對(duì)接兩個(gè)scrapy-cluster的管理工具scrapyd。
nohup spiderkeeper --server=http://168.*.*.118:6800 --server=http://168.*.*.119:6800 --username=admin --password=admin --database-url=sqlite:////root/spiderkeeper/SpiderKeeper.db >> /root/scrapy-cluster/spiderkeeper.log 2>&1 &
注意:要使用spiderkeeper管理同一個(gè)集群,爬蟲項(xiàng)目名稱須一致,同時(shí)集群中scrapy-cluster配置相同spider任務(wù)
瀏覽器訪問http://168.*.*.118:5000 啟動(dòng)爬蟲時(shí)即可看見兩個(gè)scrapy-cluster集群配置,啟動(dòng)同名爬蟲開始scrapy-cluster分布式爬蟲
啟動(dòng)分布式爬蟲后狀態(tài)
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/42398.html
摘要:前言最近發(fā)現(xiàn)了一個(gè)的庫這個(gè)庫的主要用途是在于配合這管理你的爬蟲支持一鍵式部署定時(shí)采集任務(wù)啟動(dòng)暫停等一系列的操作簡單來說將的進(jìn)行封裝最大限度減少你跟命令行交互次數(shù)不得說這個(gè)是很棒的事情的連接環(huán)境配置由于是基于以上的版本兼容性較好所以我們需要的 前言 最近發(fā)現(xiàn)了一個(gè)spdierkeeper的庫,這個(gè)庫的主要用途是在于.配合這scrpyd管理你的爬蟲,支持一鍵式部署,定時(shí)采集任務(wù),啟動(dòng),暫停...
摘要:基于的爬蟲分布式爬蟲管理平臺(tái),支持多種編程語言以及多種爬蟲框架。后臺(tái)程序會(huì)自動(dòng)發(fā)現(xiàn)這些爬蟲項(xiàng)目并儲(chǔ)存到數(shù)據(jù)庫中。每一個(gè)節(jié)點(diǎn)需要啟動(dòng)應(yīng)用來支持爬蟲部署。任務(wù)將以環(huán)境變量的形式存在于爬蟲任務(wù)運(yùn)行的進(jìn)程中,并以此來關(guān)聯(lián)抓取數(shù)據(jù)。 Crawlab 基于Celery的爬蟲分布式爬蟲管理平臺(tái),支持多種編程語言以及多種爬蟲框架。 Github: https://github.com/tikazyq/...
摘要:,首先把爬蟲項(xiàng)目上傳到服務(wù)器我的服務(wù)器架設(shè)在公司內(nèi)網(wǎng)里。部署名會(huì)在后面的部署項(xiàng)目環(huán)節(jié)用到。新項(xiàng)目創(chuàng)建創(chuàng)建完成。,部署項(xiàng)目的文件想要爬蟲程序真正進(jìn)行抓取起來,還有一步。選擇文件,部署文件至此,部署爬蟲項(xiàng)目就結(jié)束了。 1,首先把scrapy爬蟲項(xiàng)目上傳到服務(wù)器 我的服務(wù)器架設(shè)在公司內(nèi)網(wǎng)里。所以在這里使用WinSCP作為上傳工具。showImg(https://segmentfault....
摘要:在系統(tǒng)正常運(yùn)行時(shí),可以變更爬蟲的配置,一旦實(shí)時(shí)監(jiān)控爬蟲出現(xiàn)異常,可實(shí)時(shí)修正配置進(jìn)行干預(yù)。從數(shù)據(jù)庫中實(shí)時(shí)讀取配置信息,響應(yīng)業(yè)務(wù)層的配置請(qǐng)求。處理系統(tǒng)通過服務(wù)層,每次去取配置信息可能維護(hù)人員在實(shí)時(shí)修正及待抓取的列表進(jìn)行處理。 showImg(https://segmentfault.com/img/bVLa4V?w=960&h=540); 一 ?緣起 在我工作的多家公司,有眾多的領(lǐng)域,如房...
摘要:包括爬蟲編寫爬蟲避禁動(dòng)態(tài)網(wǎng)頁數(shù)據(jù)抓取部署分布式爬蟲系統(tǒng)監(jiān)測共六個(gè)內(nèi)容,結(jié)合實(shí)際定向抓取騰訊新聞數(shù)據(jù),通過測試檢驗(yàn)系統(tǒng)性能。 1 項(xiàng)目介紹 本項(xiàng)目的主要內(nèi)容是分布式網(wǎng)絡(luò)新聞抓取系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)。主要有以下幾個(gè)部分來介紹: (1)深入分析網(wǎng)絡(luò)新聞爬蟲的特點(diǎn),設(shè)計(jì)了分布式網(wǎng)絡(luò)新聞抓取系統(tǒng)爬取策略、抓取字段、動(dòng)態(tài)網(wǎng)頁抓取方法、分布式結(jié)構(gòu)、系統(tǒng)監(jiān)測和數(shù)據(jù)存儲(chǔ)六個(gè)關(guān)鍵功能。 (2)結(jié)合程序代碼分解說...
閱讀 1561·2021-11-25 09:43
閱讀 2347·2019-08-30 15:55
閱讀 1471·2019-08-30 13:08
閱讀 2682·2019-08-29 10:59
閱讀 822·2019-08-29 10:54
閱讀 1594·2019-08-26 18:26
閱讀 2555·2019-08-26 13:44
閱讀 2659·2019-08-23 18:36