摘要:如需遠(yuǎn)程訪問,則需在配置文件中設(shè)置,然后重啟。詳見如果是同個容器,直接使用即可,這里是演示了不同容器或主機(jī)下的情況訪問即可
文章開始,先摘錄一下文中各軟件的官方定義
Scrapy
An open source and collaborative framework for extracting the data you
need from websites.In a fast, simple, yet extensible way.
Scrapyd
Scrapy comes with a built-in service, called “Scrapyd”, which allows
you to deploy (aka. upload) your projects and control their spiders
using a JSON web service.
Scrapydweb
A full-featured web UI for Scrapyd cluster management,
with Scrapy log analysis & visualization supported.
Docker
Docker Container: A container is a standard unit of software that packages up code and
all its dependencies so the application runs quickly and reliably from
one computing environment to another. A Docker container image is a
lightweight, standalone, executable package of software that includes
everything needed to run an application: code, runtime, system tools,
system libraries and settings.
整套系統(tǒng)的運(yùn)行并不依賴docker, docker為我們提供的是標(biāo)準(zhǔn)化的系統(tǒng)運(yùn)行環(huán)境,降低了運(yùn)維成本, 并且可以在將來分布式部署的時候提供快速統(tǒng)一的方案;scrapyd+scrapydweb的作用也僅僅是可以提供一個UI界面來觀察測試
scrapy,scrapyd,scrapydweb也可以拆分成三個獨(dú)立的鏡像,不過這里為了解釋方便就統(tǒng)一使用了一個docker鏡像配置
scrapy工程向scrapyd部署的時候可以使用命令行工具scrapyd-deploy, 也可以在scrapydweb管理后臺的deploy控制臺進(jìn)行,但前提都是要啟動scrapyd監(jiān)聽服務(wù)(默認(rèn)6800端口)
scrapyd的服務(wù)可以只運(yùn)行在內(nèi)網(wǎng)環(huán)境中,scrapydweb可以通過內(nèi)網(wǎng)地址訪問到SCRAPYD_SERVERS設(shè)定的服務(wù),而自身向外網(wǎng)暴露監(jiān)聽端口(默認(rèn)5000)即可
dockerfile的內(nèi)容基于 aciobanu/scrapy 修改
FROM alpine:latest RUN echo "https://mirror.tuna.tsinghua.edu.cn/alpine/latest-stable/main/" > /etc/apk/repositories #RUN apk update && apk upgrade RUN apk -U add gcc bash bash-doc bash-completion libffi-dev libxml2-dev libxslt-dev libevent-dev musl-dev openssl-dev python-dev py-imaging py-pip redis curl ca-certificates && update-ca-certificates && rm -rf /var/cache/apk/* RUN pip install --upgrade pip && pip install Scrapy RUN pip install scrapyd && pip install scrapyd-client && pip install scrapydweb RUN pip install fake_useragent && pip install scrapy_proxies && pip install sqlalchemy && pip install mongoengine && pip install redis WORKDIR /runtime/app EXPOSE 5000 COPY launch.sh /runtime/launch.sh RUN chmod +x /runtime/launch.sh # 測試正常后可以打開下面的注釋 # ENTRYPOINT ["/runtime/launch.sh"]
如果是把scrapy+scrapyd+scrapydweb拆分成三個獨(dú)立的鏡像,就把下面啟動服務(wù)的部分拆分即可,通過容器啟動時的link選項來通信
#!/bin/sh # kill any existing scrapyd process if any kill -9 $(pidof scrapyd) # enter directory where configure file lies and launch scrapyd cd /runtime/app/scrapyd && nohup /usr/bin/scrapyd > ./scrapyd.log 2>&1 & cd /runtime/app/scrapydweb && /usr/bin/scrapydweb
/runtime/app的目錄結(jié)構(gòu)為
根目錄(/usr/local/src/scrapy-d-web【實(shí)際目錄】:/runtime/app【容器內(nèi)的目錄】)
Dockerfile - 編輯完后需要執(zhí)行[docker build -t scrapy-d-web:v1 .]生成鏡像, 筆者編譯的時候一開始使用了阿里云1cpu-1G內(nèi)存的實(shí)例,但是lxml始終報錯,后來升級為2G內(nèi)存即可正常編譯 scrapyd - 存放scrapyd的配置文件和其他目錄 scrapydweb - 存放scrapydweb的配置文件 knowsmore - scrapy startproject 新建的工程目錄1 pxn - scrapy startproject 新建的工程目錄2
現(xiàn)在我們手動啟動各個服務(wù)來逐步解釋, 首先啟動容器并進(jìn)入bash
docker network create --subnet=192.168.0.0/16 mynetwork #新建一個自定義網(wǎng)絡(luò)(如果容器沒拆分這一步可以忽略,因為監(jiān)聽的是localhost,如果拆分后,就需要設(shè)定IP地址,方便下文中scrapyd+scrapydweb的配置) docker run -it --rm --net mynetwork --ip 192.168.1.100 --name scrapyd -p 5000:5000 -v /usr/local/src/scrapy-d-web/:/runtime/app scrapy-d-web:v1 /bin/sh #定義網(wǎng)絡(luò)地址,容器名稱;建立目錄映射,端口映射
進(jìn)入scrapyd.conf文件所在目錄(/runtime/app/scrapyd),這里我選擇了當(dāng)前目錄中的scarpyd.conf, 至于啟動scrapyd配置文件的生效順序請查閱scrapyd官方文檔,下文為官方配置文件示例
[scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 127.0.0.1(因為不需要外網(wǎng)訪問,所以沒有改成0.0.0.0) http_port = 6800(這里如果修改了端口號,要記得同時修改scrapydweb的配置) debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs
再次打開一個終端進(jìn)入上面的docker容器, 進(jìn)入scrapydweb配置文件所在的目錄(/runtime/app/scrapydweb), 啟動scrapydweb
docker exec -it scrapyd /bin/bash
scrapydweb的項目詳細(xì)內(nèi)容請查看github地址,下文為我的部分配置內(nèi)容
############################## ScrapydWeb ##################################### # Setting SCRAPYDWEB_BIND to "0.0.0.0" or IP-OF-CURRENT-HOST would make # ScrapydWeb server visible externally, otherwise, set it to "127.0.0.1". # The default is "0.0.0.0". SCRAPYDWEB_BIND = "0.0.0.0" # Accept connections on the specified port, the default is 5000. SCRAPYDWEB_PORT = 5000 # The default is False, set it to True to enable basic auth for web UI. ENABLE_AUTH = True # In order to enable basic auth, both USERNAME and PASSWORD should be non-empty strings. USERNAME = "user" PASSWORD = "pass" ############################## Scrapy ######################################### # ScrapydWeb is able to locate projects in the SCRAPY_PROJECTS_DIR, # so that you can simply select a project to deploy, instead of eggifying it in advance. # e.g., "C:/Users/username/myprojects/" or "/home/username/myprojects/" SCRAPY_PROJECTS_DIR = "/runtime/app/" ############################## Scrapyd ######################################## # Make sure that [Scrapyd](https://github.com/scrapy/scrapyd) has been installed # and started on all of your hosts. # Note that for remote access, you have to manually set "bind_address = 0.0.0.0" # in the configuration file of Scrapyd and restart Scrapyd to make it visible externally. # Check out "https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file" for more info. # ------------------------------ Chinese -------------------------------------- # 請先確保所有主機(jī)都已經(jīng)安裝和啟動 [Scrapyd](https://github.com/scrapy/scrapyd)。 # 如需遠(yuǎn)程訪問 Scrapyd,則需在 Scrapyd 配置文件中設(shè)置 "bind_address = 0.0.0.0",然后重啟 Scrapyd。 # 詳見 https://scrapyd.readthedocs.io/en/latest/config.html#example-configuration-file # - the string format: username:password@ip:port#group # - The default port would be 6800 if not provided, # - Both basic auth and group are optional. # - e.g., "127.0.0.1" or "username:[email protected]:6801#group" # - the tuple format: (username, password, ip, port, group) # - When the username, password, or group is too complicated (e.g., contains ":@#"), # - or if ScrapydWeb fails to parse the string format passed in, # - it"s recommended to pass in a tuple of 5 elements. # - e.g., ("", "", "127.0.0.1", "", "") or ("username", "password", "192.168.123.123", "6801", "group") SCRAPYD_SERVERS = [ "192.168.1.100:6800",# 如果是同個容器,直接使用localhost即可,這里是演示了不同容器或主機(jī)下的情況 # "username:password@localhost:6801#group", # ("username", "password", "localhost", "6801", "group"), ] # If the IP part of a Scrapyd server is added as "127.0.0.1" in the SCRAPYD_SERVERS above, # ScrapydWeb would try to read Scrapy logs directly from disk, instead of making a request # to the Scrapyd server. # Check out this link to find out where the Scrapy logs are stored: # https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir # e.g., "C:/Users/username/logs/" or "/home/username/logs/" SCRAPYD_LOGS_DIR = "/runtime/app/scrapyd/logs/"
訪問 http://[YOUR IP ADDRESS]:5000 即可
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/27664.html
摘要:如需遠(yuǎn)程訪問,則需在配置文件中設(shè)置,然后重啟。詳見如果是同個容器,直接使用即可,這里是演示了不同容器或主機(jī)下的情況訪問即可 文章開始,先摘錄一下文中各軟件的官方定義Scrapy An open source and collaborative framework for extracting the data youneed from websites.In a fast, simpl...
摘要:以上示例代表當(dāng)發(fā)現(xiàn)條或條以上的級別的時,自動停止當(dāng)前任務(wù),如果當(dāng)前時間在郵件工作時間內(nèi),則同時發(fā)送通知郵件。 showImg(https://segmentfault.com/img/remote/1460000018052810); 一、需求分析 初級用戶: 只有一臺開發(fā)主機(jī) 能夠通過 Scrapyd-client 打包和部署 Scrapy 爬蟲項目,以及通過 Scrapyd JS...
摘要:支持一鍵部署項目到集群。添加郵箱帳號設(shè)置郵件工作時間和基本觸發(fā)器,以下示例代表每隔小時或當(dāng)某一任務(wù)完成時,并且當(dāng)前時間是工作日的點(diǎn),點(diǎn)和點(diǎn),將會發(fā)送通知郵件。除了基本觸發(fā)器,還提供了多種觸發(fā)器用于處理不同類型的,包括和等。 showImg(https://segmentfault.com/img/remote/1460000018772067?w=1680&h=869); 安裝和配置 ...
摘要:試用安裝更新配置文件,其余配置項詳見官方文檔啟動由于的最新提交已經(jīng)重構(gòu)了頁面,如果正在使用管理,則需同步更新 Issue in 2014 scrapy/scrapyd/issues/43showImg(https://segmentfault.com/img/remote/1460000019125253?w=790&h=400); Pull request in 2019 scrap...
摘要:基于的爬蟲分布式爬蟲管理平臺,支持多種編程語言以及多種爬蟲框架。后臺程序會自動發(fā)現(xiàn)這些爬蟲項目并儲存到數(shù)據(jù)庫中。每一個節(jié)點(diǎn)需要啟動應(yīng)用來支持爬蟲部署。任務(wù)將以環(huán)境變量的形式存在于爬蟲任務(wù)運(yùn)行的進(jìn)程中,并以此來關(guān)聯(lián)抓取數(shù)據(jù)。 Crawlab 基于Celery的爬蟲分布式爬蟲管理平臺,支持多種編程語言以及多種爬蟲框架。 Github: https://github.com/tikazyq/...
閱讀 1438·2021-11-19 11:38
閱讀 3573·2021-11-15 11:37
閱讀 817·2021-09-30 09:48
閱讀 967·2021-09-29 09:46
閱讀 906·2021-09-23 11:22
閱讀 1885·2019-08-30 15:44
閱讀 3405·2019-08-26 13:58
閱讀 2392·2019-08-26 13:26