Python 爬蟲(chóng)實(shí)戰(zhàn)（一）：使用 requests 和 BeautifulSoup

jokester 發(fā)布于2019-07-30 15:10 / 1819人閱讀

摘要：建立連接插入數(shù)據(jù)使用方法創(chuàng)建一個(gè)游標(biāo)對(duì)象執(zhí)行語(yǔ)句提交事務(wù)已經(jīng)存在如果發(fā)生錯(cuò)誤則回滾關(guān)閉游標(biāo)連接關(guān)閉數(shù)據(jù)庫(kù)連接定時(shí)設(shè)置做了一個(gè)定時(shí)，過(guò)段時(shí)間就去爬一次。

Python 基礎(chǔ)

我之前寫(xiě)的《Python 3 極簡(jiǎn)教程.pdf》，適合有點(diǎn)編程基礎(chǔ)的快速入門(mén)，通過(guò)該系列文章學(xué)習(xí)，能夠獨(dú)立完成接口的編寫(xiě)，寫(xiě)寫(xiě)小東西沒(méi)問(wèn)題。

requests

requests，Python HTTP 請(qǐng)求庫(kù)，相當(dāng)于 Android 的 Retrofit，它的功能包括 Keep-Alive 和連接池、Cookie 持久化、內(nèi)容自動(dòng)解壓、HTTP 代理、SSL 認(rèn)證、連接超時(shí)、Session 等很多特性，同時(shí)兼容 Python2 和 Python3，GitHub：https://github.com/requests/r... 。

安裝

Mac：

pip3 install requests

Windows：

pip install requests

發(fā)送請(qǐng)求

HTTP 請(qǐng)求方法有 get、post、put、delete。

import requests

# get 請(qǐng)求
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all")

# post 請(qǐng)求
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert")

# put 請(qǐng)求
response = requests.put("http://127.0.0.1:1024/developer/api/v1.0/update")

# delete 請(qǐng)求
response = requests.delete("http://127.0.0.1:1024/developer/api/v1.0/delete")

請(qǐng)求返回 Response 對(duì)象，Response 對(duì)象是對(duì) HTTP 協(xié)議中服務(wù)端返回給瀏覽器的響應(yīng)數(shù)據(jù)的封裝，響應(yīng)的中的主要元素包括：狀態(tài)碼、原因短語(yǔ)、響應(yīng)首部、響應(yīng) URL、響應(yīng) encoding、響應(yīng)體等等。

# 狀態(tài)碼
print(response.status_code)

# 響應(yīng) URL
print(response.url)

# 響應(yīng)短語(yǔ)
print(response.reason)

# 響應(yīng)內(nèi)容
print(response.json())

定制請(qǐng)求頭

請(qǐng)求添加 HTTP 頭部 Headers，只要傳遞一個(gè) dict 給 headers 關(guān)鍵字參數(shù)就可以了。

header = {"Application-Id": "19869a66c6",
          "Content-Type": "application/json"
          }
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all/", headers=header)

構(gòu)建查詢參數(shù)

想為 URL 的查詢字符串(query string)傳遞某種數(shù)據(jù)，比如：http://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2 ，Requests 允許你使用 params 關(guān)鍵字參數(shù)，以一個(gè)字符串字典來(lái)提供這些參數(shù)。

payload = {"key1": "value1", "key2": "value2"}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

還可以將 list 作為值傳入：

payload = {"key1": "value1", "key2": ["value2", "value3"]}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", params=payload)

# 響應(yīng) URL
print(response.url)# 打?。篽ttp://127.0.0.1:1024/developer/api/v1.0/all?key1=value1&key2=value2&key2=value3

post 請(qǐng)求數(shù)據(jù)

如果服務(wù)器要求發(fā)送的數(shù)據(jù)是表單數(shù)據(jù)，則可以指定關(guān)鍵字參數(shù) data。

payload = {"key1": "value1", "key2": "value2"}
response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", data=payload)

如果要求傳遞 json 格式字符串參數(shù)，則可以使用 json 關(guān)鍵字參數(shù)，參數(shù)的值都可以字典的形式傳過(guò)去。

obj = {
    "article_title": "小公務(wù)員之死2"
}
# response = requests.post("http://127.0.0.1:1024/developer/api/v1.0/insert", json=obj)

響應(yīng)內(nèi)容

Requests 會(huì)自動(dòng)解碼來(lái)自服務(wù)器的內(nèi)容。大多數(shù) unicode 字符集都能被無(wú)縫地解碼。請(qǐng)求發(fā)出后，Requests 會(huì)基于 HTTP 頭部對(duì)響應(yīng)的編碼作出有根據(jù)的推測(cè)。

# 響應(yīng)內(nèi)容
# 返回是 是 str 類型內(nèi)容
# print(response.text())
# 返回是 JSON 響應(yīng)內(nèi)容
print(response.json())
# 返回是二進(jìn)制響應(yīng)內(nèi)容
# print(response.content())
# 原始響應(yīng)內(nèi)容，初始請(qǐng)求中設(shè)置了 stream=True
# response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", stream=True)
# print(response.raw())

超時(shí)

如果沒(méi)有顯式指定了 timeout 值，requests 是不會(huì)自動(dòng)進(jìn)行超時(shí)處理的。如果遇到服務(wù)器沒(méi)有響應(yīng)的情況時(shí)，整個(gè)應(yīng)用程序一直處于阻塞狀態(tài)而沒(méi)法處理其他請(qǐng)求。

response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", timeout=5)  # 單位秒數(shù)

代理設(shè)置

如果頻繁訪問(wèn)一個(gè)網(wǎng)站，很容易被服務(wù)器屏蔽掉，requests 完美支持代理。

# 代理
proxies = {
    "http": "http://127.0.0.1:1024",
    "https": "http://127.0.0.1:4000",
}
response = requests.get("http://127.0.0.1:1024/developer/api/v1.0/all", proxies=proxies)

BeautifulSoup

BeautifulSoup，Python Html 解析庫(kù)，相當(dāng)于 Java 的 jsoup。

安裝

BeautifulSoup 3 目前已經(jīng)停止開(kāi)發(fā)，直接使用BeautifulSoup 4。

Mac：

pip3 install beautifulsoup4

Windows：

pip install beautifulsoup4

安裝解析器

我用的是 html5lib，純 Python 實(shí)現(xiàn)的。

Mac：

pip3 install html5lib

Windows：

pip install html5lib

簡(jiǎn)單使用

BeautifulSoup 將復(fù)雜 HTML 文檔轉(zhuǎn)換成一個(gè)復(fù)雜的樹(shù)形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是 Python 對(duì)象。

解析

from bs4 import BeautifulSoup

def get_html_data():
    html_doc = """
    
    
    WuXiaolong
    
    
    分享 Android 技術(shù)，也關(guān)注 Python 等熱門(mén)技術(shù)。
    寫(xiě)博客的初衷：總結(jié)經(jīng)驗(yàn)，記錄自己的成長(zhǎng)。
    你必須足夠的努力，才能看起來(lái)毫不費(fèi)力！專注！精致！
    
    WuXiaolong"s blog
    公眾號(hào)：吳小龍同學(xué) 
    GitHub
    
       
    """
    soup = BeautifulSoup(html_doc, "html5lib")

tag

tag = soup.head
print(tag)  # WuXiaolong
print(tag.name)  # head
print(tag.title)  # WuXiaolong
print(soup.p)  # 分享 Android 技術(shù)，也關(guān)注 Python 等熱門(mén)技術(shù)。
print(soup.a["href"])  # 輸出 a 標(biāo)簽的 href 屬性：http://wuxiaolong.me/

注意：tag 如果多個(gè)匹配，返回第一個(gè)，比如這里的 p 標(biāo)簽。

查找

print(soup.find("p"))  # 分享 Android 技術(shù)，也關(guān)注 Python 等熱門(mén)技術(shù)。

find 默認(rèn)也是返回第一個(gè)匹配的標(biāo)簽，沒(méi)找到匹配的節(jié)點(diǎn)則返回 None。如果我想指定查找，比如這里的公眾號(hào)，可以指定標(biāo)簽的如 class 屬性值：

# 因?yàn)?class 是 Python 關(guān)鍵字，所以這里指定為 class_。
print(soup.find("p", class_="WeChat"))
# 公眾號(hào)

查找所有的 P 標(biāo)簽：

for p in soup.find_all("p"):
    print(p.string)

實(shí)戰(zhàn)

前段時(shí)間，有用戶反饋，我的個(gè)人 APP 掛了，雖然這個(gè) APP 我已經(jīng)不再維護(hù)，但是我也得起碼保證它能正常運(yùn)行。大部分人都知道這個(gè) APP 數(shù)據(jù)是爬來(lái)的（詳見(jiàn)：《手把手教你做個(gè)人app》），數(shù)據(jù)爬來(lái)的好處之一就是不用自己管數(shù)據(jù)，弊端是別人網(wǎng)站掛了或網(wǎng)站的 HTML 節(jié)點(diǎn)變了，我這邊就解析不到，就沒(méi)數(shù)據(jù)。這次用戶反饋，我在想要不要把他們網(wǎng)站數(shù)據(jù)直接爬蟲(chóng)了，正好自學(xué) Python，練練手，嗯說(shuō)干就干，本來(lái)是想著先用 Python 爬蟲(chóng)，MySQL 插入本地?cái)?shù)據(jù)庫(kù)，然后 Flask 自己寫(xiě)接口，用 Android 的 Retrofit 調(diào)，再用 bmob sdk 插入 bmob……哎，費(fèi)勁，感覺(jué)行不通，后來(lái)我得知 bmob 提供了 RESTful，解決大問(wèn)題，我可以直接 Python 爬蟲(chóng)插入就好了，這里我演示的是插入本地?cái)?shù)據(jù)庫(kù)，如果用 bmob，是調(diào) bmob 提供的 RESTful 插數(shù)據(jù)。

網(wǎng)站選定

我選的演示網(wǎng)站：https://meiriyiwen.com/random ，大家可以發(fā)現(xiàn)，每次請(qǐng)求的文章都不一樣，正好利用這點(diǎn)，我只要定時(shí)去請(qǐng)求，解析自己需要的數(shù)據(jù)，插入數(shù)據(jù)庫(kù)就 OK 了。

創(chuàng)建數(shù)據(jù)庫(kù)

我直接用 NaviCat Premium 創(chuàng)建的，當(dāng)然也可以用命令行。

創(chuàng)建表

創(chuàng)建表 article，用的 pymysql，表需要 id，article_title，article_author，article_content 字段，代碼如下，只需要調(diào)一次就好了。

import pymysql


def create_table():
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn")
    # 創(chuàng)建名為 article 數(shù)據(jù)庫(kù)語(yǔ)句
    sql = """create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )"""
    # 使用 cursor() 方法創(chuàng)建一個(gè)游標(biāo)對(duì)象 cursor
    cursor = db.cursor()
    try:
        # 執(zhí)行 sql 語(yǔ)句
        cursor.execute(sql)
        # 提交事務(wù)
        db.commit()
        print("create table success")
    except BaseException as e:  # 如果發(fā)生錯(cuò)誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關(guān)閉游標(biāo)連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫(kù)連接
        db.close()


if __name__ == "__main__":
    create_table()

解析網(wǎng)站

首先需要 requests 請(qǐng)求網(wǎng)站，然后 BeautifulSoup 解析自己需要的節(jié)點(diǎn)。

import requests
from bs4 import BeautifulSoup


def get_html_data():
    # get 請(qǐng)求
    response = requests.get("https://meiriyiwen.com/random")

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id="article_show")
    article_title = article.h1.string
    print("article_title=%s" % article_title)
    article_author = article.find("p", class_="article_author").string
    print("article_author=%s" % article.find("p", class_="article_author").string)
    article_contents = article.find("div", class_="article_text").find_all("p")
    article_content = ""
    for content in article_contents:
        article_content = article_content + str(content)
        print("article_content=%s" % article_content)

插入數(shù)據(jù)庫(kù)

這里做了一個(gè)篩選，默認(rèn)這個(gè)網(wǎng)站的文章標(biāo)題是唯一的，插入數(shù)據(jù)時(shí)，如果有了同樣的標(biāo)題就不插入。

import pymysql


def insert_table(article_title, article_author, article_content):
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn",
                         charset="utf8")
    # 插入數(shù)據(jù)
    query_sql = "select * from article where article_title=%s"
    sql = "insert into article (article_title,article_author,article_content) values (%s, %s, %s)"
    # 使用 cursor() 方法創(chuàng)建一個(gè)游標(biāo)對(duì)象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執(zhí)行 sql 語(yǔ)句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務(wù)
            db.commit()
            print("--------------《%s》 insert table success-------------" % article_title)
            return True
        else:
            print("--------------《%s》 已經(jīng)存在-------------" % article_title)
            return False

    except BaseException as e:  # 如果發(fā)生錯(cuò)誤則回滾
        db.rollback()
        print(e)

    finally:  # 關(guān)閉游標(biāo)連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫(kù)連接
        db.close()

定時(shí)設(shè)置

做了一個(gè)定時(shí)，過(guò)段時(shí)間就去爬一次。

import sched
import time


# 初始化 sched 模塊的 scheduler 類
# 第一個(gè)參數(shù)是一個(gè)可以返回時(shí)間戳的函數(shù)，第二個(gè)參數(shù)可以在定時(shí)未到達(dá)之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性調(diào)度觸發(fā)的函數(shù)
def print_time(inc):
    # to do something
    print("to do something")
    schedule.enter(inc, 0, print_time, (inc,))


# 默認(rèn)參數(shù) 60 s
def start(inc=60):
    # enter四個(gè)參數(shù)分別為：間隔事件、優(yōu)先級(jí)（用于同時(shí)間到達(dá)的兩個(gè)事件同時(shí)執(zhí)行時(shí)定序）、被調(diào)用觸發(fā)的函數(shù)，
    # 給該觸發(fā)函數(shù)的參數(shù)（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == "__main__":
    # 5 s 輸出一次
    start(5)

完整代碼

import pymysql
import requests
from bs4 import BeautifulSoup
import sched
import time


def create_table():
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn")
    # 創(chuàng)建名為 article 數(shù)據(jù)庫(kù)語(yǔ)句
    sql = """create table if not exists article (
    id int NOT NULL AUTO_INCREMENT, 
    article_title text,
    article_author text,
    article_content text,
    PRIMARY KEY (`id`)
    )"""
    # 使用 cursor() 方法創(chuàng)建一個(gè)游標(biāo)對(duì)象 cursor
    cursor = db.cursor()
    try:
        # 執(zhí)行 sql 語(yǔ)句
        cursor.execute(sql)
        # 提交事務(wù)
        db.commit()
        print("create table success")
    except BaseException as e:  # 如果發(fā)生錯(cuò)誤則回滾
        db.rollback()
        print(e)

    finally:
        # 關(guān)閉游標(biāo)連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫(kù)連接
        db.close()


def insert_table(article_title, article_author, article_content):
    # 建立連接
    db = pymysql.connect(host="localhost",
                         user="root",
                         password="root",
                         db="python3learn",
                         charset="utf8")
    # 插入數(shù)據(jù)
    query_sql = "select * from article where article_title=%s"
    sql = "insert into article (article_title,article_author,article_content) values (%s, %s, %s)"
    # 使用 cursor() 方法創(chuàng)建一個(gè)游標(biāo)對(duì)象 cursor
    cursor = db.cursor()
    try:
        query_value = (article_title,)
        # 執(zhí)行 sql 語(yǔ)句
        cursor.execute(query_sql, query_value)
        results = cursor.fetchall()
        if len(results) == 0:
            value = (article_title, article_author, article_content)
            cursor.execute(sql, value)
            # 提交事務(wù)
            db.commit()
            print("--------------《%s》 insert table success-------------" % article_title)
            return True
        else:
            print("--------------《%s》 已經(jīng)存在-------------" % article_title)
            return False

    except BaseException as e:  # 如果發(fā)生錯(cuò)誤則回滾
        db.rollback()
        print(e)

    finally:  # 關(guān)閉游標(biāo)連接
        cursor.close()
        # 關(guān)閉數(shù)據(jù)庫(kù)連接
        db.close()


def get_html_data():
    # get 請(qǐng)求
    response = requests.get("https://meiriyiwen.com/random")

    soup = BeautifulSoup(response.content, "html5lib")
    article = soup.find("div", id="article_show")
    article_title = article.h1.string
    print("article_title=%s" % article_title)
    article_author = article.find("p", class_="article_author").string
    print("article_author=%s" % article.find("p", class_="article_author").string)
    article_contents = article.find("div", class_="article_text").find_all("p")
    article_content = ""
    for content in article_contents:
        article_content = article_content + str(content)
        print("article_content=%s" % article_content)

    # 插入數(shù)據(jù)庫(kù)
    insert_table(article_title, article_author, article_content)


# 初始化 sched 模塊的 scheduler 類
# 第一個(gè)參數(shù)是一個(gè)可以返回時(shí)間戳的函數(shù)，第二個(gè)參數(shù)可以在定時(shí)未到達(dá)之前阻塞。
schedule = sched.scheduler(time.time, time.sleep)


# 被周期性調(diào)度觸發(fā)的函數(shù)
def print_time(inc):
    get_html_data()
    schedule.enter(inc, 0, print_time, (inc,))


# 默認(rèn)參數(shù) 60 s
def start(inc=60):
    # enter四個(gè)參數(shù)分別為：間隔事件、優(yōu)先級(jí)（用于同時(shí)間到達(dá)的兩個(gè)事件同時(shí)執(zhí)行時(shí)定序）、被調(diào)用觸發(fā)的函數(shù)，
    # 給該觸發(fā)函數(shù)的參數(shù)（tuple形式）
    schedule.enter(0, 0, print_time, (inc,))
    schedule.run()


if __name__ == "__main__":
    start(60*5)

問(wèn)題：這只是對(duì)一篇文章爬蟲(chóng)，如果是那種文章列表，點(diǎn)擊是文章詳情，這種如何爬蟲(chóng)解析？首先肯定要拿到列表，再循環(huán)一個(gè)個(gè)解析文章詳情插入數(shù)據(jù)庫(kù)？還沒(méi)有想好該如何做更好，留給后面的課題吧。

最后

雖然我學(xué) Python 純屬業(yè)余愛(ài)好，但是也要學(xué)以致用，不然這些知識(shí)很快就忘記了，期待下篇 Python 方面的文章。

參考

快速上手 — Requests 2.18.1 文檔

爬蟲(chóng)入門(mén)系列（二）：優(yōu)雅的HTTP庫(kù)requests

Beautiful Soup 4.2.0 文檔

爬蟲(chóng)入門(mén)系列（四）：HTML文本解析庫(kù)BeautifulSoup

GPU云服務(wù)器云服務(wù)器 python爬蟲(chóng)實(shí)戰(zhàn) python3爬蟲(chóng)實(shí)戰(zhàn) 爬蟲(chóng)和python python和爬蟲(chóng)

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/41082.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

jokester

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

tensorflow怎么下載

閱讀 2991·2023-04-25 17:22
用純css實(shí)現(xiàn)打星星效果（三）

閱讀 1556·2019-08-30 15:54
視覺(jué)格式化模型(Visual formatting model)

閱讀 1286·2019-08-30 15:53
移動(dòng)端開(kāi)發(fā)IOS 6PLUS中表單輸入造成的頁(yè)面高度縮小bug

閱讀 1805·2019-08-30 15:43
快速判斷瀏覽器是否支持特定css、js功能

閱讀 3060·2019-08-29 12:29
字符串replace方法的使用

閱讀 1245·2019-08-26 11:37
vue formData上傳圖片以及其他表單數(shù)據(jù)

閱讀 3277·2019-08-23 18:02
Ajax詳解

閱讀 1619·2019-08-23 14:15

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python 爬蟲(chóng)實(shí)戰(zhàn)（一）：使用 requests 和 BeautifulSoup

相關(guān)文章

Python 從零開(kāi)始爬蟲(chóng)(三)——實(shí)戰(zhàn)：requests+BeautifulSoup實(shí)現(xiàn)靜態(tài)爬取

Python 爬蟲(chóng)實(shí)戰(zhàn)（二）：使用 requests-html

Python爬蟲(chóng)基礎(chǔ)

python爬蟲(chóng)實(shí)戰(zhàn)：爬取西刺代理的代理ip（二）

發(fā)表評(píng)論

0條評(píng)論

jokester

男|高級(jí)講師

TA的文章

tensorflow怎么下載

用純css實(shí)現(xiàn)打星星效果（三）

視覺(jué)格式化模型(Visual formatting model)

移動(dòng)端開(kāi)發(fā)IOS 6PLUS中表單輸入造成的頁(yè)面高度縮小bug

快速判斷瀏覽器是否支持特定css、js功能

字符串replace方法的使用

vue formData上傳圖片以及其他表單數(shù)據(jù)

Ajax詳解

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

Python 爬蟲(chóng)實(shí)戰(zhàn)（一）：使用 requests 和 BeautifulSoup

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！