成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

爬蟲爬取微信小程序

qianfeng / 953人閱讀

摘要:之前打算做個微信小程序的社區(qū),所以寫了爬蟲去爬取微信小程序,后面發(fā)現(xiàn)做微信小程序沒有前途,就把原來的項目廢棄了做了現(xiàn)在的網(wǎng)站觀點不過代碼放著也是放著,還不如公開讓大家用,所以我把代碼貼出來,有需要的復(fù)制了使用就是了。

之前打算做個微信小程序的社區(qū),所以寫了爬蟲去爬取微信小程序,后面發(fā)現(xiàn)做微信小程序沒有前途,就把原來的項目廢棄了做了現(xiàn)在的網(wǎng)站觀點,不過代碼放著也是放著,還不如公開讓大家用,所以我把代碼貼出來,有需要的復(fù)制了使用就是了。

#coding:utf-8
__author__ = "haoning"
#!/usr/bin/env python
import time
import urllib2
import datetime
import requests
import json
import random
import sys
import platform
import uuid
reload(sys)
sys.setdefaultencoding( "utf-8" )
import re
import os
import MySQLdb as mdb
from bs4 import BeautifulSoup

DB_HOST = "127.0.0.1"
DB_USER = "root"
DB_PASS = "root"
#init database
conn = mdb.connect(DB_HOST, DB_USER, DB_PASS, "pybbs-springboot", charset="utf8")
conn.autocommit(False)
curr = conn.cursor()

count=0
how_many=0

base_url="http://www.wechat-cloud.com"
url=base_url+"/index.php?s=/home/article/ajax_get_list.html&category_id={category_id}&page={page}&size={size}"

user_agents = [
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11",
    "Opera/9.25 (Windows NT 5.1; U; en)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
    "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
    "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
]

def fake_header():
    agent=random.choice(user_agents)
    cookie="PHPSESSID=p5mokvec7ct1gqe9efcnth9d44; Hm_lvt_c364957e96174b029f292041f7d822b7=1487492811,1487556626; Hm_lpvt_c364957e96174b029f292041f7d822b7=1487564069"
    req_header = {
        "Accept":"application/json, text/javascript, */*; q=0.01",
        #"Accept-Encoding":"gzip, deflate, sdch",
        "Accept-Language":"zh-CN,zh;q=0.8",
        "Cache-Control":"max-age=0",
        "Connection":"keep-alive",
        "Host":"www.wechat-cloud.com",
        #"Cookie":cookie,
        "Referer":"http://www.wechat-cloud.com/index.php?s=/home/index/index.html",
        "Upgrade-Insecure-Requests":"1",
        "User-Agent":agent,
        "X-Requested-With":"XMLHttpRequest",
    }
    return req_header

def gethtml(url):
    try:
        header=fake_header()
        req = urllib2.Request(url,headers=header)
        response = urllib2.urlopen(req, None,15)
        html = response.read()
        return html
    except Exception as e:
        print "e",e
    return None


def get_img_data(url):
    try:
        #添加頭信息,模仿瀏覽器抓取網(wǎng)頁,對付返回403禁止訪問的問題
        req = urllib2.Request(url)
        response = urllib2.urlopen(req, None,15)
        dataimg = response.read()
        return dataimg
    except Exception as e:
        print "image data",e
    return None

def makeDateFolder(par,classify):
    try:
        if os.path.isdir(par):
            newFolderName=par + "http://" + str(classify)+ "http://" +GetDateString() 
            if not os.path.isdir( newFolderName ):
                os.makedirs( newFolderName )
            return newFolderName
        else:
            return par 
    except Exception,e:
        print "kk",e
    return par  

def map_folder(what):
    return what

def GetDateString():
    when=time.strftime("%Y-%m-%d",time.localtime(time.time()))
    foldername = str(when)
    return foldername 

def get_extension(name):  
    where=name.rfind(".")
    if where!=-1:
        return name[where:len(name)]
    return "#"

def download_img(url,what):
    try:
        #print url
        extention=get_extension(url)
        dataimg=get_img_data(url)
        name=str(uuid.uuid1()).replace("-","")+"-www.weixinapphome.com"
        #print "name",name
        classfiy_folder=map_folder(what)
        top="E://wxapp_store"
        filename  =makeDateFolder(top,classfiy_folder)+"http://"+name+extention
        try:
            if not os.path.exists(filename):
                file_object = open(filename,"w+b")
                file_object.write(dataimg)
                file_object.close()
                return classfiy_folder+"/"+GetDateString()+"/"+name+extention
            else:
                print "file exist"
                return None
        except IOError,e1:
            print "e1=",e1
            #pass
        return None #如果沒有下載下來就利用原來網(wǎng)站的鏈接
    except Exception,e:
        print "problem",e
        pass
    return None
    
    
def work():
    page=0
    global how_many
    while 1:
        try:
            page=page+1
            begin_url=url.format(category_id=0, page=page,size=12).encode("utf-8")
            html=gethtml(begin_url)
            if html is not None:
                #print html
                json_results=json.loads(html)
                is_end=json_results["isEnd"]
                if str(is_end)=="True":
                    break
                results=json_results["list"]
                for result in results:
                    href=result["href"]
                    detail_url=base_url+href
                    #print detail_url
                    detail_html=gethtml(detail_url)
                    if detail_html is not None:
                        soup = BeautifulSoup(detail_html)
                        icon_url=base_url+soup.find("div",{"class":"icon fl"}).find("img").get("src")
                        name=soup.find("div",{"class":"cont fl"}).find("h2").text
                        classify=soup.find("div",{"class":"tab"}).find("span").text
                        classify=str(classify).replace("分類: ","")
                        #print classify
                        barcode_path=base_url+soup.find("div",{"id":"install-code"}).find("img").get("src")
                        view_num=soup.find("span",{"class":"views"}).text
                        #view_num=filter(str.isalnum,str(view_num))
                        pic_path=base_url+soup.find("div",{"class":"img-box"}).find("img").get("src")
                        temp = time.time()
                        x = time.localtime(float(temp))
                        acq_time = time.strftime("%Y-%m-%d %H:%M:%S",x) # get time now
                        curr.execute("select id from pybbs_wxapp_store where `from`=%s",(detail_url))
                        y= curr.fetchone()
                        if not y:
                            y1=download_img(icon_url,"icon")
                            y2=download_img(barcode_path,"barcode")
                            y3=download_img(pic_path,"pic")
                            if (y1 is not None) and (y2 is not None) and (y3 is not None):
                                name=name
                                author=None
                                classify=classify
                                describe=None
                                view_num=view_num
                                #print view_num
                                logo=y1
                                _from=detail_url
                                barcode=y2
                                acq_time=acq_time
                                hot_weight=-9999
                                pic_uuid=str(uuid.uuid1()).replace("-","")
                                pic_path=y3
                                #print name,author,classify,describe,view_num,logo,_from,barcode,acq_time,hot_weight,pic_uuid
                                curr.execute("INSERT INTO pybbs_wxapp_store(name,author,classify,`describe`,view_num,logo,`from`,barcode,acq_time,hot_weight,pic_path)VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",(name,author,classify,describe,view_num,logo,_from,barcode,acq_time,hot_weight,pic_path))
                                curr.execute("select id from pybbs_wxapp_classify where `classify_name`=%s",(classify))
                                yx= curr.fetchone()
                                if not yx:
                                    describe=None
                                    temp = time.time()
                                    x = time.localtime(float(temp))
                                    record_time = time.strftime("%Y-%m-%d %H:%M:%S",x) # get time now
                                    curr.execute("INSERT INTO pybbs_wxapp_classify(classify_name,`describe`,record_time)VALUES(%s,%s,%s)",(classify,describe,record_time))
                                how_many+=1
                                print "new comer:",pic_uuid,">>",how_many
                                if how_many % 10==0:
                                    conn.commit()
                conn.commit()
        except Exception as e:
            print "while error",e

if __name__ == "__main__":
    i=3
    while i>0:
        work()
        i=i-1

其中有些參數(shù)請改成自己的,比如說數(shù)據(jù)庫密碼了,圖片存儲到哪個盤,數(shù)據(jù)庫表格自己建立,因為這些實在太簡單了,所以沒啥可以嘮叨的。

文章版權(quán)歸作者所有,未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址:http://systransis.cn/yun/41197.html

相關(guān)文章

  • 首次公開,整理12年積累的博客收藏夾,零距離展示《收藏夾吃灰》系列博客

    摘要:時間永遠(yuǎn)都過得那么快,一晃從年注冊,到現(xiàn)在已經(jīng)過去了年那些被我藏在收藏夾吃灰的文章,已經(jīng)太多了,是時候把他們整理一下了。那是因為收藏夾太亂,橡皮擦給設(shè)置私密了,不收拾不好看呀。 ...

    Harriet666 評論0 收藏0
  • Scrapy學(xué)習(xí)(四) 取微博數(shù)據(jù)

    摘要:爬蟲目標(biāo)是獲取用戶的微博數(shù)關(guān)注數(shù)粉絲數(shù)。創(chuàng)建數(shù)據(jù)這部分我只需要個人信息,微博數(shù),關(guān)注數(shù)分?jǐn)?shù)數(shù)這些基本信息就行。 前言 Scrapy學(xué)習(xí)(三) 爬取豆瓣圖書信息 接上篇之后。這次來爬取需要登錄才能訪問的微博。爬蟲目標(biāo)是獲取用戶的微博數(shù)、關(guān)注數(shù)、粉絲數(shù)。為建立用戶關(guān)系圖(尚未實現(xiàn))做數(shù)據(jù)儲備 準(zhǔn)備 安裝第三方庫requests和pymongo 安裝MongoDB 創(chuàng)建一個weibo爬蟲項...

    LiveVideoStack 評論0 收藏0
  • 利用新浪API實現(xiàn)數(shù)據(jù)的抓取微博數(shù)據(jù)取微爬蟲

    摘要:本人長期出售超大量微博數(shù)據(jù)旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),。如果用戶傳入偽造的,則新浪微博會返回一個錯誤。 PS:(本人長期出售超大量微博數(shù)據(jù)、旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),Message to [email protected]。由于微博接口更新后限制增大,這個代碼已經(jīng)不能用來爬數(shù)據(jù)了。如果只是為了收集數(shù)據(jù)可以咨詢我的郵箱,如果是為了學(xué)習(xí)爬蟲,...

    liuyix 評論0 收藏0
  • 利用新浪API實現(xiàn)數(shù)據(jù)的抓取微博數(shù)據(jù)取微爬蟲

    摘要:本人長期出售超大量微博數(shù)據(jù)旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),。如果用戶傳入偽造的,則新浪微博會返回一個錯誤。 PS:(本人長期出售超大量微博數(shù)據(jù)、旅游網(wǎng)站評論數(shù)據(jù),并提供各種指定數(shù)據(jù)爬取服務(wù),Message to [email protected]。由于微博接口更新后限制增大,這個代碼已經(jīng)不能用來爬數(shù)據(jù)了。如果只是為了收集數(shù)據(jù)可以咨詢我的郵箱,如果是為了學(xué)習(xí)爬蟲,...

    vslam 評論0 收藏0

發(fā)表評論

0條評論

qianfeng

|高級講師

TA的文章

閱讀更多
最新活動
閱讀需要支付1元查看
<