成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

爬取知乎“凡爾賽語錄”話題下的所有回答,我知道點開看你的很帥氣,但還是沒我?guī)?

fevin / 2653人閱讀

摘要:普通的炫耀,無非在社交網(wǎng)絡發(fā)發(fā)跑車照片,或不經(jīng)意露出名牌包包,但凡爾賽文學還不這么直接。爬取的網(wǎng)站在知乎搜索凡爾賽語錄,第二個比較適合,就用這個。特別是后面的一串數(shù)字是問題,作為知乎問題的唯一標識。

凡爾賽文學火了。這種特殊的網(wǎng)絡文體,常出現(xiàn)在朋友圈或微博,以波瀾不驚的口吻,假裝不經(jīng)意地炫富、秀恩愛。
普通的炫耀,無非在社交網(wǎng)絡發(fā)發(fā)跑車照片,或不經(jīng)意露出名牌包包 logo,但凡爾賽文學還不這么直接。微博博主還專門制作過凡爾賽文學教學視頻,講解其三大精髓要素:

在豆瓣上,也有一個名叫凡爾賽學研習小組,組員們將凡爾賽定義為一種表演高級人生的精神,好了,進入主題,今天來快速爬取知乎里有關凡爾賽語錄有關的回答,開始。

1.爬取的網(wǎng)站

在知乎搜索凡爾賽語錄,第二個比較適合,就用這個。

點進去后可以發(fā)現(xiàn)關于這個提問共有 393 個回答。

網(wǎng)址:https://www.zhihu.com/question/429548386/answer/1575062220

去掉 answer 以及后面的部分就是這個要爬取的問題網(wǎng)址。特別是后面的一串數(shù)字是問題 id:https://www.zhihu.com/question/429548386,作為知乎問題的唯一標識。

2.爬取問題有關的回答

研究一下上面的網(wǎng)址,我們發(fā)現(xiàn)需要爬取兩部分數(shù)據(jù):

  1. 爬取的詳情,包括創(chuàng)建時間、關注人數(shù)、瀏覽量、問題描述等
  2. 爬取的回答,包括每個答主的用戶名、粉絲數(shù)等信息,問題回答的具體內(nèi)容、發(fā)布時間、評論數(shù)、點贊數(shù)等信息

其中,這個問題詳情可以直接爬取上面的網(wǎng)址,通過 bs4 解析頁面內(nèi)容拿到數(shù)據(jù),而問題的回答則需要通過下面的鏈接,通過設置每頁的起始下標和頁面內(nèi)容偏移量確定,有點類似于分頁內(nèi)容的爬取。

def init_url(question_id, limit, offset):      base_url_start = "https://www.zhihu.com/api/v4/questions/"      base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={0}&offset={1}".format(limit, offset)      return base_url_start + question_id + base_url_end

設置每頁回答數(shù) limit=20,offset 則可以是0、20、40…而 question_id 則是上面提到的網(wǎng)址后面的一串數(shù)字,這里是 429548386,邏輯想明白之后就是通過寫爬蟲獲取數(shù)據(jù)了,下面是完整的爬蟲代碼,運行的時候你只需要修改問題的 id 即可。

3.完整代碼

# 導入相應的庫import jsonimport reimport timefrom datetime import datetimefrom time import sleepimport pandas as pdimport numpy as npimport warningsimport requestsfrom bs4 import BeautifulSoupimport randomimport warningswarnings.filterwarnings("ignore")def get_ua():    """    在UA庫中隨機選擇一個UA    :return: 返回一個庫中的隨機UA    """    ua_list = [        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",        "Opera/8.0 (Windows NT 5.1; U; en)",        "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",        "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",        "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13",        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",        "Mozilla/5.0 (Macintosh; U; IntelMac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50",        "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]    return random.choice(ua_list)    def filter_emoij(text):    """    過濾emoij表情符    @param text:    @return:    """    try:        co = re.compile(u"[/U00010000-/U0010ffff]")    except re.error:        co = re.compile(u"[/uD800-/uDBFF][/uDC00-/uDFFF]")    text = co.sub("", text)    return textdef get_question_base_info(url):    """    獲取問題的詳細描述    @param url:    @return:    """    response = requests.get(url=url, headers={"User-Agent": get_ua()}, timeout=10)    """獲取數(shù)據(jù)并解析"""    soup = BeautifulSoup(response.text, "lxml")    # 問題標題    title = soup.find("h1", {"class": "QuestionHeader-title"}).text    # 具體問題    question = ""    try:        question = soup.find("div", {"class": "QuestionRichText--collapsed"}).text.replace("/u200b", "")    except Exception as e:        print(e)    # 關注者    follower = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[0].text.strip().replace(",", ""))    # 被瀏覽    watched = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[1].text.strip().replace(",", ""))    # 問題回答次數(shù)    answer_str = soup.find_all("h4", {"class": "List-headerText"})[0].span.text.strip()    # 抽取xxx 個回答中的數(shù)字:【正則】數(shù)字出現(xiàn)次數(shù)>=0    answer_count = int(re.findall("/d*", answer_str)[0])    # 問題標簽    tag_list = []    tags = soup.find_all("div", {"class": "QuestionTopic"})    for tag in tags:        tag_list.append(tag.text)    return title, question, follower, watched, answer_count, tag_listdef init_url(question_id, limit, offset):    """    構(gòu)造每一頁訪問的url    @param question_id:    @param limit:    @param offset:    @return:    """    base_url_start = "https://www.zhihu.com/api/v4/questions/"    base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed" /                   "%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by" /                   "%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" /                   "%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info" /                   "%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting" /                   "%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B" /                   "%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics" /                   "&limit={0}&offset={1}".format(limit, offset)    return base_url_start + question_id + base_url_enddef get_time_str(timestamp):    """    將時間戳轉(zhuǎn)換為標準日期字符    @param timestamp:    @return:    """    datetime_str = ""    try:        # 時間戳timestamp 轉(zhuǎn)datetime時間格式        datetime_time = datetime.fromtimestamp(timestamp)        # datetime時間格式轉(zhuǎn)為日期字符串        datetime_str = datetime_time.strftime("%Y-%m-%d %H:%M:%S")    except Exception as e:        print(e)        print("日期轉(zhuǎn)換錯誤")    return datetime_strdef get_answer_info(url, index):    """    解析問題回答    @param url:    @param index:    @return:    """    response = requests.get(url=url, headers={"User-Agent": get_ua()}, timeout=10)    text = response.text.replace("/u200b", "")    per_answer_list = []    try:        question_json = json.loads(text)        """獲取當前頁的回答數(shù)據(jù)"""        print("爬取第{0}頁回答列表,當前頁獲取到{1}個回答".format(index + 1, len(question_json["data"])))        for data in question_json["data"]:            """問題的相關信息"""            # 問題的問題類型、id、提問類型、創(chuàng)建時間、修改時間            question_type = data["question"]["type"]            question_id = data["question"]["id"]            question_question_type = data["question"]["question_type"]            question_created = get_time_str(data["question"]["created"])            question_updated_time = get_time_str(data["question"]["updated_time"])            """答主的相關信息"""            # 答主的用戶名、簽名、性別、粉絲數(shù)            author_name = data["author"]["name"]            author_headline = data["author"]["headline"]            author_gender = data["author"]["gender"]            author_follower_count = data["author"]["follower_count"]            """回答的相關信息"""            # 問題回答id、創(chuàng)建時間、更新時間、贊同數(shù)、評論數(shù)、具體內(nèi)容            id = data["id"]            created_time = get_time_str(data["created_time"])            updated_time = get_time_str(data["updated_time"])            voteup_count = data["voteup_count"]            comment_count = data["comment_count"]            content = data["content"]            per_answer_list.append([question_type, question_id, question_question_type, question_created,                                    question_updated_time, author_name, author_headline, author_gender,                                    author_follower_count, id, created_time, updated_time, voteup_count, comment_count,                                    content                                    ])    except:        print("Json格式校驗錯誤")    finally:        answer_column = ["問題類型", "問題id", "問題提問類型", "問題創(chuàng)建時間", "問題更新時間",                         "答主用戶名", "答主簽名", "答主性別", "答主粉絲數(shù)",                         "答案id", "答案創(chuàng)建時間", "答案更新時間", "答案贊同數(shù)", "答案評論數(shù)", "答案具體內(nèi)容"]        per_answer_data = pd.DataFrame(per_answer_list, columns=answer_column)    return per_answer_dataif __name__ == "__main__":    # question_id = "424516487"    question_id = "429548386"    url = "https://www.zhihu.com/question/" + question_id    """獲取問題的詳細描述"""    title, question, follower, watched, answer_count, tag_list = get_question_base_info(url)    print("問題url:"+ url)    print("問題標題:" + title)    print("問題描述:" + question)    print("該問題被定義的標簽為:" + "、".join(tag_list))    print("該問題關注人數(shù):{0},已經(jīng)被 {1} 人瀏覽過".format(follower, watched))    print("截止 {},該問題有 {} 個回答".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), answer_count))    """獲取問題的回答數(shù)據(jù)"""    # 構(gòu)造url    limit, offset = 20, 0    page_cnt = int(answer_count/limit) + 1    answer_data = pd.DataFrame()    for page_index in range(page_cnt):        answer_url = init_url(question_id, limit, offset+page_index*limit)        # 獲取數(shù)據(jù)        data_per_page = get_answer_info(answer_url, page_index)        answer_data = answer_data.append(data_per_page)        sleep(3)        print("/n爬取完成,數(shù)據(jù)已保存??!")    answer_data.to_csv("凡爾賽沙雕語錄_{0}.csv".format(question_id), encoding="utf-8", index=False)

4.結(jié)果

一共爬取到 393 個答案,需要注意一下,最后保存的文件格式為 UTF-8,讀取亂碼的同學請先檢查格式是否一致。

爬取的結(jié)果部分截圖如下:


感謝看到這里,更多Python精彩內(nèi)容可以關注我看我主頁,你們的三連(點贊,收藏,評論)是我持續(xù)更新下去的動力,感謝。

點擊領取? Q群號: 675240729(純技術交流和資源共享)以自助拿走。

①行業(yè)咨詢、專業(yè)解答
②Python開發(fā)環(huán)境安裝教程
③400集自學視頻
④軟件開發(fā)常用詞匯
⑤最新學習路線圖
⑥3000多本Python電子書

文章版權歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/123098.html

相關文章

  • 從零轉(zhuǎn)行數(shù)據(jù)分析的親身經(jīng)歷

    摘要:我的轉(zhuǎn)行經(jīng)歷博主從開公眾號起前個月開始接觸語言,然后接觸到了數(shù)據(jù)方面的技術,包括爬蟲,數(shù)據(jù)分析,數(shù)據(jù)挖掘,機器學習等,一直到現(xiàn)在仍然在堅持自學,我相信只要堅持結(jié)果總不會太差。對于數(shù)據(jù)分析而言,機器學習和爬蟲等并不是必須,但是加分項。 作者:xiaoyu 微信公眾號:Python數(shù)據(jù)科學 知乎:python數(shù)據(jù)分析師 showImg(https://segmentfault.com/i...

    Rocture 評論0 收藏0
  • 一只node爬蟲的升級打怪之路

    摘要:我是一個知乎輕微重度用戶,之前寫了一只爬蟲幫我爬取并分析它的數(shù)據(jù),我感覺這個過程還是挺有意思,因為這是一個不斷給自己創(chuàng)造問題又去解決問題的過程。所以這只爬蟲還有登陸知乎搜索題目的功能。 我一直覺得,爬蟲是許多web開發(fā)人員難以回避的點。我們也應該或多或少的去接觸這方面,因為可以從爬蟲中學習到web開發(fā)中應當掌握的一些基本知識。而且,它還很有趣。 我是一個知乎輕微重度用戶,之前寫了一只爬...

    shiweifu 評論0 收藏0
  • [PHP] 又是知乎,用 Beanbun 爬取知乎用戶

    摘要:最近看了很多關于爬蟲入門的文章,發(fā)現(xiàn)其中大部分都是以知乎為爬取對象,所以這次我也以知乎為目標來進行爬取的演示,用到的爬蟲框架為編寫的。項目地址這次寫的內(nèi)容為爬取知乎的用戶,下面就是詳細說一下寫爬蟲的過程了。 最近看了很多關于爬蟲入門的文章,發(fā)現(xiàn)其中大部分都是以知乎為爬取對象,所以這次我也以知乎為目標來進行爬取的演示,用到的爬蟲框架為 PHP 編寫的 Beanbun。 項目地址:http...

    tomato 評論0 收藏0

發(fā)表評論

0條評論

最新活動
閱讀需要支付1元查看
<