摘要:之前打算做個微信小程序的社區(qū),所以寫了爬蟲去爬取微信小程序,后面發(fā)現(xiàn)做微信小程序沒有前途,就把原來的項目廢棄了做了現(xiàn)在的網(wǎng)站觀點不過代碼放著也是放著,還不如公開讓大家用,所以我把代碼貼出來,有需要的復(fù)制了使用就是了。
之前打算做個微信小程序的社區(qū),所以寫了爬蟲去爬取微信小程序,后面發(fā)現(xiàn)做微信小程序沒有前途,就把原來的項目廢棄了做了現(xiàn)在的網(wǎng)站觀點,不過代碼放著也是放著,還不如公開讓大家用,所以我把代碼貼出來,有需要的復(fù)制了使用就是了。
#coding:utf-8 __author__ = "haoning" #!/usr/bin/env python import time import urllib2 import datetime import requests import json import random import sys import platform import uuid reload(sys) sys.setdefaultencoding( "utf-8" ) import re import os import MySQLdb as mdb from bs4 import BeautifulSoup DB_HOST = "127.0.0.1" DB_USER = "root" DB_PASS = "root" #init database conn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "pybbs-springboot", charset="utf8") conn.autocommit(False) curr = conn.cursor() count=0 how_many=0 base_url="http://www.wechat-cloud.com" url=base_url+"/index.php?s=/home/article/ajax_get_list.html&category_id={category_id}&page={page}&size={size}" user_agents = [ "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11", "Opera/9.25 (Windows NT 5.1; U; en)", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12", "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9", "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7", "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ", ] def fake_header(): agent=random.choice(user_agents) cookie="PHPSESSID=p5mokvec7ct1gqe9efcnth9d44; Hm_lvt_c364957e96174b029f292041f7d822b7=1487492811,1487556626; Hm_lpvt_c364957e96174b029f292041f7d822b7=1487564069" req_header = { "Accept":"application/json, text/javascript, */*; q=0.01", #"Accept-Encoding":"gzip, deflate, sdch", "Accept-Language":"zh-CN,zh;q=0.8", "Cache-Control":"max-age=0", "Connection":"keep-alive", "Host":"www.wechat-cloud.com", #"Cookie":cookie, "Referer":"http://www.wechat-cloud.com/index.php?s=/home/index/index.html", "Upgrade-Insecure-Requests":"1", "User-Agent":agent, "X-Requested-With":"XMLHttpRequest", } return req_header def gethtml(url): try: header=fake_header() req = urllib2.Request(url,headers=header) response = urllib2.urlopen(req, None,15) html = response.read() return html except Exception as e: print "e",e return None def get_img_data(url): try: #添加頭信息,模仿瀏覽器抓取網(wǎng)頁,對付返回403禁止訪問的問題 req = urllib2.Request(url) response = urllib2.urlopen(req, None,15) dataimg = response.read() return dataimg except Exception as e: print "image data",e return None def makeDateFolder(par,classify): try: if os.path.isdir(par): newFolderName=par + "http://" + str(classify)+ "http://" +GetDateString() if not os.path.isdir( newFolderName ): os.makedirs( newFolderName ) return newFolderName else: return par except Exception,e: print "kk",e return par def map_folder(what): return what def GetDateString(): when=time.strftime("%Y-%m-%d",time.localtime(time.time())) foldername = str(when) return foldername def get_extension(name): where=name.rfind(".") if where!=-1: return name[where:len(name)] return "#" def download_img(url,what): try: #print url extention=get_extension(url) dataimg=get_img_data(url) name=str(uuid.uuid1()).replace("-","")+"-www.weixinapphome.com" #print "name",name classfiy_folder=map_folder(what) top="E://wxapp_store" filename =makeDateFolder(top,classfiy_folder)+"http://"+name+extention try: if not os.path.exists(filename): file_object = open(filename,"w+b") file_object.write(dataimg) file_object.close() return classfiy_folder+"/"+GetDateString()+"/"+name+extention else: print "file exist" return None except IOError,e1: print "e1=",e1 #pass return None #如果沒有下載下來就利用原來網(wǎng)站的鏈接 except Exception,e: print "problem",e pass return None def work(): page=0 global how_many while 1: try: page=page+1 begin_url=url.format(category_id=0, page=page,size=12).encode("utf-8") html=gethtml(begin_url) if html is not None: #print html json_results=json.loads(html) is_end=json_results["isEnd"] if str(is_end)=="True": break results=json_results["list"] for result in results: href=result["href"] detail_url=base_url+href #print detail_url detail_html=gethtml(detail_url) if detail_html is not None: soup = BeautifulSoup(detail_html) icon_url=base_url+soup.find("div",{"class":"icon fl"}).find("img").get("src") name=soup.find("div",{"class":"cont fl"}).find("h2").text classify=soup.find("div",{"class":"tab"}).find("span").text classify=str(classify).replace("分類: ","") #print classify barcode_path=base_url+soup.find("div",{"id":"install-code"}).find("img").get("src") view_num=soup.find("span",{"class":"views"}).text #view_num=filter(str.isalnum,str(view_num)) pic_path=base_url+soup.find("div",{"class":"img-box"}).find("img").get("src") temp = time.time() x = time.localtime(float(temp)) acq_time = time.strftime("%Y-%m-%d %H:%M:%S",x) # get time now curr.execute("select id from pybbs_wxapp_store where `from`=%s",(detail_url)) y= curr.fetchone() if not y: y1=download_img(icon_url,"icon") y2=download_img(barcode_path,"barcode") y3=download_img(pic_path,"pic") if (y1 is not None) and (y2 is not None) and (y3 is not None): name=name author=None classify=classify describe=None view_num=view_num #print view_num logo=y1 _from=detail_url barcode=y2 acq_time=acq_time hot_weight=-9999 pic_uuid=str(uuid.uuid1()).replace("-","") pic_path=y3 #print name,author,classify,describe,view_num,logo,_from,barcode,acq_time,hot_weight,pic_uuid curr.execute("INSERT INTO pybbs_wxapp_store(name,author,classify,`describe`,view_num,logo,`from`,barcode,acq_time,hot_weight,pic_path)VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(name,author,classify,describe,view_num,logo,_from,barcode,acq_time,hot_weight,pic_path)) curr.execute("select id from pybbs_wxapp_classify where `classify_name`=%s",(classify)) yx= curr.fetchone() if not yx: describe=None temp = time.time() x = time.localtime(float(temp)) record_time = time.strftime("%Y-%m-%d %H:%M:%S",x) # get time now curr.execute("INSERT INTO pybbs_wxapp_classify(classify_name,`describe`,record_time)VALUES(%s,%s,%s)",(classify,describe,record_time)) how_many+=1 print "new comer:",pic_uuid,">>",how_many if how_many % 10==0: conn.commit() conn.commit() except Exception as e: print "while error",e if __name__ == "__main__": i=3 while i>0: work() i=i-1
其中有些參數(shù)請改成自己的,比如說數(shù)據(jù)庫密碼了,圖片存儲到哪個盤,數(shù)據(jù)庫表格自己建立,因為這些實在太簡單了,所以沒啥可以嘮叨的。
文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/41197.html
摘要:時間永遠(yuǎn)都過得那么快,一晃從年注冊,到現(xiàn)在已經(jīng)過去了年那些被我藏在收藏夾吃灰的文章,已經(jīng)太多了,是時候把他們整理一下了。那是因為收藏夾太亂,橡皮擦給設(shè)置私密了,不收拾不好看呀。 ...
摘要:爬蟲目標(biāo)是獲取用戶的微博數(shù)關(guān)注數(shù)粉絲數(shù)。創(chuàng)建數(shù)據(jù)這部分我只需要個人信息,微博數(shù),關(guān)注數(shù)分?jǐn)?shù)數(shù)這些基本信息就行。 前言 Scrapy學(xué)習(xí)(三) 爬取豆瓣圖書信息 接上篇之后。這次來爬取需要登錄才能訪問的微博。爬蟲目標(biāo)是獲取用戶的微博數(shù)、關(guān)注數(shù)、粉絲數(shù)。為建立用戶關(guān)系圖(尚未實現(xiàn))做數(shù)據(jù)儲備 準(zhǔn)備 安裝第三方庫requests和pymongo 安裝MongoDB 創(chuàng)建一個weibo爬蟲項...
摘要:本人長期出售超大量微博數(shù)據(jù)旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),。如果用戶傳入偽造的,則新浪微博會返回一個錯誤。 PS:(本人長期出售超大量微博數(shù)據(jù)、旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),Message to [email protected]。由于微博接口更新后限制增大,這個代碼已經(jīng)不能用來爬數(shù)據(jù)了。如果只是為了收集數(shù)據(jù)可以咨詢我的郵箱,如果是為了學(xué)習(xí)爬蟲,...
摘要:本人長期出售超大量微博數(shù)據(jù)旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),。如果用戶傳入偽造的,則新浪微博會返回一個錯誤。 PS:(本人長期出售超大量微博數(shù)據(jù)、旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),Message to [email protected]。由于微博接口更新后限制增大,這個代碼已經(jīng)不能用來爬數(shù)據(jù)了。如果只是為了收集數(shù)據(jù)可以咨詢我的郵箱,如果是為了學(xué)習(xí)爬蟲,...
閱讀 4186·2021-11-22 13:52
閱讀 2094·2021-09-22 15:12
閱讀 1133·2019-08-30 15:53
閱讀 3467·2019-08-29 17:12
閱讀 2198·2019-08-29 16:23
閱讀 1662·2019-08-26 13:56
閱讀 1778·2019-08-26 13:44
閱讀 1897·2019-08-26 11:56