基于asyncio、aiohttp、xpath的異步爬蟲(chóng)

ckllj 發(fā)布于2019-07-30 17:31 / 1086人閱讀

摘要：今天介紹一下基于和的異步爬蟲(chóng)的編寫(xiě)，解析用的是。通過(guò)輸入問(wèn)題，該爬蟲(chóng)能爬取關(guān)于健康方面的數(shù)據(jù)。先讀取規(guī)則，再爬取數(shù)據(jù)。

今天介紹一下基于asyncio和aiohttp的異步爬蟲(chóng)的編寫(xiě)，解析html用的是xpath。

該爬蟲(chóng)實(shí)現(xiàn)了以下功能:
1.讀取csv文件中的爬取規(guī)則，根據(jù)規(guī)則爬取數(shù)據(jù)；代碼中添加了對(duì)3個(gè)網(wǎng)站的不同提取規(guī)則，如有需要，還可以繼續(xù)添加；
2.將爬取到的數(shù)據(jù)保存到mysql數(shù)據(jù)庫(kù)中。

通過(guò)輸入問(wèn)題，該爬蟲(chóng)能爬取關(guān)于健康方面的數(shù)據(jù)。

具體代碼如下:

# coding:utf-8


"""
async-apiser xpath
"""


from lxml import etree
import csv
import re
import os
import asyncio
import aiohttp
import aiomysql
from datetime import datetime

from config import Config


class HealthSpider(object):

    def __init__(self, user_id, keyword, url, hrule, drule, count, trule):
        self.user_id = user_id
        self.keyword = keyword
        self.url = url
        self.hrule = hrule
        self.drule = drule
        self.count = count
        self.trule = trule
        self.headers = ""
        self.urls_done = []
        self.urls_will = []
        self.spider_data = {}

    @staticmethod
    def handle_flag(str):
        """
        去除字符串中的style樣式標(biāo)簽
        :param html:
        :return:
        """
        pattern = re.compile(r" style=".*?;"", re.S)
        return pattern.sub("", str)

    async def get_html(self, url, session):
        """
        根據(jù)url，返回html
        :param url:
        :return:
        """
        try:
            async with session.get(url, headers=self.headers, timeout=5) as resp:
                if resp.status in [200, 201]:
                    data = await resp.text()
                    return data
        except Exception as e:
            raise Exception("數(shù)據(jù)搜索錯(cuò)誤")

    def get_url(self, resp):
        """
        根據(jù)html獲取每條數(shù)據(jù)的url
        :param resp:
        :return:
        """
        # 保存爬取的數(shù)據(jù)
        root = etree.HTML(str(resp))
        items = root.xpath(self.hrule)
        # html結(jié)構(gòu)不同，組織url的方式也不同
        if 5 == self.count:
            self.urls_will = ["https://dxy.com" + i for i in items[:5]]
        elif 3 == self.count:
            self.urls_will = [i for i in items[:3]]
        elif 2 == self.count:
            self.urls_will = [i for i in items[:2]]

    async def get_data(self, url, session, pool):
        """
        根據(jù)url獲取具體數(shù)據(jù)
        :return:
        """
        # 根據(jù)url解析出htm
        html = await self.get_html(url, session)
        # 保存爬取的數(shù)據(jù)
        root = etree.HTML(str(html))
        html_data = ""
        try:
            title = root.xpath(self.trule)
            title = "".join(title)
        except Exception as e:
            title = ""
        try:
            data = root.xpath(self.drule)
            if data:
                # html結(jié)構(gòu)不同，獲取數(shù)據(jù)的方式也不同
                if 3 == self.count:
                    html_data = "".join(map(etree.tounicode, data))
                    # 去除結(jié)果中的style標(biāo)簽
                    html_data = HealthSpider.handle_flag(html_data)
                else:
                    html_data = etree.tounicode(data[0])
                    html_data = HealthSpider.handle_flag(html_data)
        except Exception as e:
            html_data = []

        self.urls_done.append(url)
        # 數(shù)據(jù)入庫(kù),保存：用戶(hù)id, 關(guān)鍵詞, 日期, 主url, 子url, html數(shù)據(jù)
        if html_data:
            self.spider_data["data"].append({"title": title, "html_data": html_data})
            spide_date = datetime.now()
            data = (self.user_id, self.keyword, spide_date, self.url, url, title, html_data)
            stmt = "INSERT INTO spider_data (user_id, keyword, spide_date,  main_url, sub_url, title, html_data) " 
                   "VALUES (%s, %s, %s, %s, %s, %s, %s)"
            try:
                async with pool.acquire() as conn:
                    async with conn.cursor() as cur:
                        await cur.execute(stmt, data)
            except Exception as e:
                pass

    async def start_spider(self, pool):
        """
        開(kāi)始爬取數(shù)據(jù)
        :return:
        """
        async with aiohttp.ClientSession() as session:
            self.spider_data["user_id"] = self.user_id
            self.spider_data["keyword"] = self.keyword
            self.spider_data["data"] = []
            while True:
                # 待爬取url隊(duì)列為空或者已經(jīng)爬取3條數(shù)據(jù),則停止爬取
                if (len(self.urls_will) == 0) or len(self.spider_data["data"]) == self.count:
                    break
                # 獲取待爬url
                url = self.urls_will.pop()
                # 開(kāi)始爬取數(shù)據(jù)
                if url not in self.urls_done:
                    await self.get_data(url, session, pool)
            return self.spider_data

    async def main(self, loop):
        # 請(qǐng)求頭
        self.headers = {"Accept": "text/html, application/xhtml+xml, application/xml;q=0.9,*/*;q=0.8",
                        "Accept-Encoding": "gzip, deflate",
                        "Accept-Language": "zh-Hans-CN, zh-Hans; q=0.5",
                        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
                                      "(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063"
                        }

        # 連接mysql數(shù)據(jù)庫(kù)
        pool = await aiomysql.create_pool(host=Config.DB_HOST, port=Config.DB_PORT,
                                          user=Config.DB_USER, password=Config.DB_PASSWORD,
                                          db=Config.DB_NAME, loop=loop, charset="utf8", autocommit=True)
        async with aiohttp.ClientSession() as session:
            # 首次獲取html
            html = await self.get_html(self.url, session)
            # 獲取url
            self.get_url(html)
        data = await self.start_spider(pool)
        return data
        # asyncio.ensure_future(self.start_spider(pool))


def get_rules(keyword):
    """
    獲取csv中的xpath規(guī)則
    :return:
    """
    csv_dict = []
    path = os.path.join(os.path.dirname(__file__), "rules.csv")
    with open(path, "rU") as f:
        reader = csv.DictReader(f)
        for line in reader:
            url = line["url"].format(keyword)
            hrule = line["hrule"]
            drule = line["drule"]
            count = int(line["count"])
            title = line["trule"]
            csv_dict.append({"url": url, "hrule": hrule, "drule": drule, "count": count, "trule": title})
    return csv_dict


def start_spider(keyword):
    """
    爬取數(shù)據(jù)
    :param user_id:
    :param keyword:
    :return:
    """
    try:
        data_list = get_rules(keyword)
    except Exception as e:
        raise Exception("搜索規(guī)則獲取失敗")
    spider_data = []
    tasks = []
    loop = asyncio.get_event_loop()
    for i in data_list:
        spider = HealthSpider(1, keyword, i["url"], i["hrule"], i["drule"], i["count"], i["trule"])
        # 任務(wù)列表
        tasks.append(asyncio.ensure_future(spider.main(loop)))
        # 添加到loop
        loop.run_until_complete(asyncio.wait(tasks))

    try:
        for task in tasks:
            for i in range(len(task.result()["data"])):
                spider_data.append(task.result()["data"][i])
    except Exception as e:
        pass
    # 延時(shí)以等待底層打開(kāi)的連接關(guān)閉
    loop.run_until_complete(asyncio.sleep(0.250))
    loop.close()
    return spider_data


if __name__ == "__main__":
    # 爬取感冒了怎么辦相關(guān)內(nèi)容
    start_spider("感冒了怎么辦")

下面講一下代碼中某些方法的作用:
1.handle_flag()方法用于去掉html字符串中的style樣式標(biāo)簽，保留html中的其他標(biāo)簽，便于前端的展示；
2.get_data()方法用于爬取具體數(shù)據(jù)，并使用aiomysql將爬取道德數(shù)據(jù)保存到數(shù)據(jù)庫(kù)；
數(shù)據(jù)庫(kù)的配置文件config.py:

# coding=utf-8


class Config(object):
    DB_ENGINE = "mysql"
    DB_HOST = "127.0.0.1"
    DB_PORT = 3306
    DB_USER = "root"
    DB_PASSWORD = "wyzane"
    DB_NAME = "db_tornado"
    DB_OPTIONS = {
        "init_command": "SET sql_mode="STRICT_TRANS_TABLES"",
        "charset": "utf8mb4",
    }

3.get_rules()方法用于從rules.csv文件中讀取爬取的規(guī)則。因?yàn)檫@里同時(shí)爬取了3個(gè)不同的網(wǎng)站，由于每個(gè)網(wǎng)站解析html的xpath規(guī)則不同，并且每個(gè)網(wǎng)站提取的數(shù)據(jù)條數(shù)不同，所以把這些規(guī)則寫(xiě)到了rules.csv文件(就是一個(gè)excel文件)中。先讀取規(guī)則，再爬取數(shù)據(jù)。

以上就是基于asyncio的異步爬蟲(chóng)的代碼，如有錯(cuò)誤，歡迎交流指正！

云服務(wù)器 GPU云服務(wù)器 aiohttp Asyncio xpath php操作xpath

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/42252.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

ckllj

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

LINUX：程序和進(jìn)程

閱讀 2346·2021-11-23 09:51
短信驗(yàn)證碼平臺(tái)有哪些比較好用？需從這3個(gè)方面來(lái)決定！

閱讀 1152·2021-11-22 13:52
[11.11]CMIVPS年度大促VPS主機(jī)5折,香港大帶寬/直連線(xiàn)路月付3.5美元起

閱讀 3623·2021-11-10 11:35
Tmwhost，澳門(mén)VPS(7.5折優(yōu)惠)，$5.62/月，1核/1G內(nèi)存/50G Raid5 SS

閱讀 1203·2021-10-25 09:47
Resultful API的攔截（過(guò)濾器——Filter）

閱讀 3008·2021-09-07 09:58
前端每日實(shí)戰(zhàn)：145# 視頻演示如何用純 CSS 創(chuàng)作一個(gè)電源開(kāi)關(guān)控件

閱讀 1073·2019-08-30 15:54
PHP基于Thinkphp5的砍價(jià)活動(dòng)相關(guān)設(shè)計(jì)

閱讀 2830·2019-08-29 14:21
CSS形狀之border-radius

閱讀 3041·2019-08-29 12:20

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

基于asyncio、aiohttp、xpath的異步爬蟲(chóng)

相關(guān)文章

如何實(shí)現(xiàn)一個(gè)Python爬蟲(chóng)框架

**基于asyncio編寫(xiě)一個(gè)telegram爬蟲(chóng)機(jī)器人**

關(guān)于Python爬蟲(chóng)種類(lèi)、法律、輪子的一二三

**Python3 基于asyncio的新聞爬蟲(chóng)思路**

Python爬蟲(chóng)入門(mén)教程 7-100 蜂鳥(niǎo)網(wǎng)圖片爬取之二

發(fā)表評(píng)論

0條評(píng)論

ckllj

男|高級(jí)講師

TA的文章

LINUX：程序和進(jìn)程

短信驗(yàn)證碼平臺(tái)有哪些比較好用？需從這3個(gè)方面來(lái)決定！

[11.11]CMIVPS年度大促VPS主機(jī)5折,香港大帶寬/直連線(xiàn)路月付3.5美元起

Tmwhost，澳門(mén)VPS(7.5折優(yōu)惠)，$5.62/月，1核/1G內(nèi)存/50G Raid5 SS

Resultful API的攔截（過(guò)濾器——Filter）

前端每日實(shí)戰(zhàn)：145# 視頻演示如何用純 CSS 創(chuàng)作一個(gè)電源開(kāi)關(guān)控件

PHP基于Thinkphp5的砍價(jià)活動(dòng)相關(guān)設(shè)計(jì)

CSS形狀之border-radius

最新活動(dòng)

資訊專(zhuān)欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

基于asyncio、aiohttp、xpath的異步爬蟲(chóng)

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

基于asyncio、aiohttp、xpath的異步爬蟲(chóng)