成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

Webmagic+Selenium+PhantomJS實(shí)戰(zhàn)

zhangxiangliang / 2307人閱讀

摘要:還是直接貼代碼說明比較實(shí)在。重新調(diào)整窗口大小,以適應(yīng)頁(yè)面,需要耗費(fèi)一定時(shí)間。建議等待合理的時(shí)間。負(fù)責(zé)摳圖指定坐標(biāo)不保持比例,調(diào)用進(jìn)程,返回識(shí)別結(jié)果。

還是直接貼代碼說明比較實(shí)在。
感覺webmagic-selenium這個(gè)模塊有點(diǎn)雞肋,但還是有可借鑒之處。借鑒它寫了一個(gè)SeleniumDownloader,如下:

import org.openqa.selenium.By;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Request;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.downloader.Downloader;
import us.codecraft.webmagic.selector.Html;
import us.codecraft.webmagic.selector.PlainText;
import us.codecraft.webmagic.utils.UrlUtils;

import java.util.Map;

/**
 * @author taojw
 *
 */
public class SeleniumDownloader  implements Downloader{
    private static final Logger log=LoggerFactory.getLogger(SeleniumDownloader.class);
    private int sleepTime=3000;//3s
    private SeleniumAction action=null;
    private WebDriverPool webDriverPool=new WebDriverPool();
    public SeleniumDownloader(){
    }
    public SeleniumDownloader(int sleepTime,WebDriverPool pool){
        this(sleepTime,pool,null);
    }
    public SeleniumDownloader(int sleepTime,WebDriverPool pool,SeleniumAction action){
        this.sleepTime=sleepTime;
        this.action=action;
        if(pool!=null){
            webDriverPool=pool;
        }
    }
    public SeleniumDownloader setSleepTime(int sleepTime) {
        this.sleepTime = sleepTime;
        return this;
    }
    public void setOperator(SeleniumAction action){
        this.action=action;
    }
    @Override
    public Page download(Request request, Task task) {
        WebDriver webDriver;
        try {
            webDriver = webDriverPool.get();
        } catch (InterruptedException e) {
            log.warn("interrupted", e);
            return null;
        }
        log.info("downloading page " + request.getUrl());
        Page page = new Page();
        try {
            webDriver.get(request.getUrl());
            Thread.sleep(sleepTime);
        } catch (InterruptedException e) {
            e.printStackTrace();
        } catch (Exception e) {
            webDriverPool.close(webDriver);
            page.setSkip(true);
            return page;
        }
//        WindowUtil.changeWindow(webDriver);
        WebDriver.Options manage = webDriver.manage();
        Site site = task.getSite();
        if (site.getCookies() != null) {
            for (Map.Entry cookieEntry : site.getCookies()
                    .entrySet()) {
                Cookie cookie = new Cookie(cookieEntry.getKey(),
                        cookieEntry.getValue());
                manage.addCookie(cookie);
            }
        }
        manage.window().maximize();
        if(action!=null){
            action.execute(webDriver);
        }
        SeleniumAction reqAction=(SeleniumAction) request.getExtra("action");
        if(reqAction!=null){
            reqAction.execute(webDriver);
        }

        WebElement webElement = webDriver.findElement(By.xpath("/html"));
        String content = webElement.getAttribute("outerHTML");
        
        page.setRawText(content);
        page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content,
                webDriver.getCurrentUrl())));
        page.setUrl(new PlainText(webDriver.getCurrentUrl()));
        page.setRequest(request);
        webDriverPool.returnToPool(webDriver);
        return page;
    }

    @Override
    public void setThread(int thread) {
        
    }

}

功能:
支持在Spider.setDownloader的時(shí)候添加鉤子SeleniumAction來實(shí)現(xiàn)自定義selenium的通用操作。加強(qiáng)了靈活性
支持對(duì)每個(gè)請(qǐng)求添加action參數(shù),參數(shù)值為SeleniumAction對(duì)象,進(jìn)而可以對(duì)每個(gè)請(qǐng)求實(shí)現(xiàn)自定義selenium操作.加強(qiáng)了靈活性

import org.openqa.selenium.WebDriver;

/**
 * @author taojw
 *
 */
public interface SeleniumAction {
    void execute(WebDriver driver);
}

WebDriverPool實(shí)現(xiàn):注意對(duì)WebDriver的池化來保證性能
也是參考webmagic-selenium作了些修改。

import com.fh.util.FileUtil;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.concurrent.BlockingDeque;
import java.util.concurrent.LinkedBlockingDeque;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * @author taojw
 */
public class WebDriverPool {
    private Logger logger = LoggerFactory.getLogger(getClass());

    private int CAPACITY = 5;
    private AtomicInteger refCount = new AtomicInteger(0);
    private static final String DRIVER_PHANTOMJS = "phantomjs";

    /**
     * store webDrivers available
     */
    private BlockingDeque innerQueue = new LinkedBlockingDeque(
            CAPACITY);

    private static String PHANTOMJS_PATH;
    private static DesiredCapabilities caps = DesiredCapabilities.phantomjs();
    static {
        PHANTOMJS_PATH = FileUtil.getCommonProp("phantomjs.path");
        caps.setJavascriptEnabled(true);
        caps.setCapability(
                PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
                PHANTOMJS_PATH);
        caps.setCapability("takesScreenshot", true);
        caps.setCapability(
                PhantomJSDriverService.PHANTOMJS_PAGE_CUSTOMHEADERS_PREFIX
                        + "User-Agent",
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36");
        caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS,
                "--load-images=no");

    }

    public WebDriverPool() {
    }

    public WebDriverPool(int poolsize) {
        this.CAPACITY = poolsize;
        innerQueue = new LinkedBlockingDeque(poolsize);
    }

    public WebDriver get() throws InterruptedException {
        WebDriver poll = innerQueue.poll();
        if (poll != null) {
            return poll;
        }
        if (refCount.get() < CAPACITY) {
            synchronized (innerQueue) {
                if (refCount.get() < CAPACITY) {

                    WebDriver mDriver = new PhantomJSDriver(caps);
                    // 嘗試性解決:https://github.com/ariya/phantomjs/issues/11526問題
                    mDriver.manage().timeouts()
                            .pageLoadTimeout(60, TimeUnit.SECONDS);
                    // mDriver.manage().window().setSize(new Dimension(1366,
                    // 768));
                    innerQueue.add(mDriver);
                    refCount.incrementAndGet();
                }
            }
        }
        return innerQueue.take();
    }

    public void returnToPool(WebDriver webDriver) {
        // webDriver.quit();
        // webDriver=null;
        innerQueue.add(webDriver);
    }

    public void close(WebDriver webDriver) {
        refCount.decrementAndGet();
        webDriver.close();
        webDriver.quit();
        webDriver = null;
    }

    public void shutdown() {
        try {
            for (WebDriver driver : innerQueue) {
                close(driver);
            }
            innerQueue.clear();
        } catch (Exception e) {
//            e.printStackTrace();
            logger.warn("webdriverpool關(guān)閉失敗",e);
        }
    }
}

修改后:
僅支持PhantomJS作為瀏覽器驅(qū)動(dòng)。
增加phantomjs相關(guān)配置
修改隊(duì)列大小控制邏輯

WindowUtil
注意這個(gè)loadAll方法的實(shí)現(xiàn)很巧妙哦,由于涉及滾動(dòng)加載頁(yè)面的時(shí)候,如果一下子滾到底部可能會(huì)造成中間部分沒有加載出來,這樣就不得不針對(duì)每個(gè)頁(yè)面進(jìn)行滿滿滾動(dòng)。而loadAll采取的思路是直接獲取頁(yè)面可滾動(dòng)大小,然后將瀏覽器窗口調(diào)成對(duì)應(yīng)大小,刷新之后所有內(nèi)容便加載出來了。

import org.apache.commons.io.FileUtils;
import org.openqa.selenium.*;

import java.io.File;
import java.io.IOException;

/**
 * @author taojw
 *
 */
public class WindowUtil {
    
    /**
     * 滾動(dòng)窗口。
     * @param driver
     * @param height
     */
    public static void scroll(WebDriver driver,int height){
        ((JavascriptExecutor)driver).executeScript("window.scrollTo(0,"+height+" );");    
    }
    /**
     * 重新調(diào)整窗口大小,以適應(yīng)頁(yè)面,需要耗費(fèi)一定時(shí)間。建議等待合理的時(shí)間。
     * @param driver
     */
    public static void loadAll(WebDriver driver){
        Dimension od=driver.manage().window().getSize();
        int width=driver.manage().window().getSize().width;
        //嘗試性解決:https://github.com/ariya/phantomjs/issues/11526問題
        driver.manage().timeouts().pageLoadTimeout(60, TimeUnit.SECONDS); 
        long height=(Long)((JavascriptExecutor)driver).executeScript("return document.body.scrollHeight;");
        driver.manage().window().setSize(new Dimension(width, (int)height));
        driver.navigate().refresh();
    }
    public static void taskScreenShot(WebDriver driver,File saveFile){
        File src=((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
        try {
            FileUtils.copyFile(src, saveFile);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    public static void changeWindow(WebDriver driver){
        // 獲取當(dāng)前頁(yè)面句柄
        String handle = driver.getWindowHandle();
        // 獲取所有頁(yè)面的句柄,并循環(huán)判斷不是當(dāng)前的句柄,就做選取switchTo()
        for (String handles : driver.getWindowHandles()) {
            if (handles.equals(handle))
                continue;
            driver.switchTo().window(handles);
        }
    }
}

至此對(duì)爬蟲框架的擴(kuò)展高一段落。

實(shí)戰(zhàn)部分 抓取淘寶店鋪信息
/**
 * 店鋪銷售信息
 *
 * @author taojw
 */
@Scope("prototype")
@Component
public class TaoBaoShopInfoProcessor implements PageProcessor {
    private static final Logger log = LoggerFactory
            .getLogger(TaoBaoShopInfoProcessor.class);

    @Autowired
    private TaoBaoShopInfoService service;

    private Site site = Site
            .me()
            .setCharset("UTF-8")
            .setCycleRetryTimes(3)
            .setSleepTime(3 * 1000)
            .addHeader("Connection", "keep-alive")
            .addHeader("Cache-Control", "max-age=0")
            .addHeader("User-Agent",
                    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0");

    private AtomicBoolean isPageAdd = new AtomicBoolean(false);
    private static AtomicBoolean running = new AtomicBoolean(false);
    private WebDriverPool pool=new WebDriverPool();
    @Override
    public Site getSite() {
        return this.site;
    }

    @Override
    public void process(Page page) {
        if (islistPage(page)) {
            List urls = page.getHtml()
                    .$("dl.item a.J_TGoldData", "href").all();
            List targetUrls = new ArrayList();
            for (String url : urls) {
                targetUrls.add(url.trim());
            }
            page.addTargetRequests(targetUrls);
            if (isPageAdd.compareAndSet(false, true)) {
                // 分頁(yè)處理
                String pageinfo = page.getHtml()
                        .$(".pagination .page-info", "text").get();
                int pageCount = Integer.valueOf(pageinfo.split("/")[1]);
                String cururl = page.getUrl().get();
                //只抓前5頁(yè)
                if(pageCount>5){
                    pageCount=5;
                }
                for (int i = 1; i < pageCount; i++) {
                    String tmp = cururl + "&pageNo=" + (i + 1);
                    page.addTargetRequest(tmp);
                }
            }
            return;
        }

        // 商品頁(yè)面
        String curUrl = page.getUrl().get();
        boolean isTaoBao=curUrl.startsWith("https://item.taobao.com");
        boolean isTmall=curUrl.startsWith("https://detail.tmall.com");
        
        String tmpspm = curUrl.split("?")[1].split("&")[0];
        // spm碼
        String spm = tmpspm.split("=")[1];
        // 網(wǎng)店地址
        String shopUrl="";
     // 商品名稱
        String name="";
     // 價(jià)格
        double price =0; 
     // 30天交易總數(shù)
        int sellCount=0;
     // 交易總價(jià)
        double allPrice=0;
        if(isTaoBao){
            shopUrl= page.getHtml()
                    .xpath("http://div[@class="tb-shop-name"]/dl/dd/strong/a/@href")
                    .get();
            shopUrl = shopUrl.split("?")[0];
            
            name = page.getHtml().xpath("http://*[@id="J_Title"]/h3/text()")
                    .get();
            try{
                price=Double.valueOf(page.getHtml()
                        .$("#J_PromoPriceNum", "text").get().split("-")[0].trim());
                }catch(Exception e){
                    
                    price=Double.valueOf(page.getHtml()
                            .$("#J_StrPrice .tb-rmb-num", "text").get().split("-")[0].trim());
                }
            sellCount = Integer.valueOf(page.getHtml()
                    .$("#J_SellCounter", "text").get());
            allPrice = Double.valueOf(price) * Double.valueOf(sellCount);
        }else if(isTmall){
            shopUrl= page.getHtml()
                    .xpath("http://*[@id="side-shop-info"]/div/h3/div/a/@href")
                    .get();
            shopUrl = shopUrl.split("?")[0];
            
            name = page.getHtml().$(".tb-detail-hd h1","text")
                    .get().trim();
        
            price=Double.valueOf(page.getHtml()
                        .$(".tm-price", "text").get().split("-")[0].trim());
                
            sellCount = Integer.valueOf(page.getHtml()
                    .$(".tm-count", "text").get().trim());
            allPrice = Double.valueOf(price) * Double.valueOf(sellCount);
        }

        // 采集日期
        // Timestamp recordDate=new Timestamp(new Date().getTime());
        String recordDate = DateUtil.formatDate(new Date(), "yyyy-MM-dd");

        log.debug(shopUrl + ":" + spm + ":" + name + ":" + price + ":"
                + sellCount + ":" + allPrice + ":" + recordDate);

        PageData pd = new PageData();
        pd.put("id", UUID.randomUUID().toString());
        pd.put("shopUrl", shopUrl);
        pd.put("spm", spm);
        pd.put("name", name);
        pd.put("price", price);
        pd.put("sellCount", sellCount);
        pd.put("allPrice", allPrice);
        pd.put("recordDate", recordDate);
        service.saveData(pd);
    }

    private boolean islistPage(Page page) {
        String tmp = page.getHtml().$("#J_PromoPrice").get();
        if (StringUtils.isBlank(tmp)) {
            return true;
        }
        return false;
    }

    public void start() {
        if (running.compareAndSet(false, true)) {
            try {
                service.emptyTable();
                List urls = service.getShopUrl();
                if (urls == null) {
                    log.error("店鋪url獲取異常,終止抓取");
                }
                String[] urlStrs=null;
                int size=50;
//                int size=urls.size();
                if(urls.size()
抓取貓眼票房數(shù)據(jù)

由于貓眼票房數(shù)據(jù)采用加密字體圖標(biāo),而且每個(gè)數(shù)字對(duì)應(yīng)的加密碼每次都變化。所以此次采用selenium加載頁(yè)面,截圖,摳圖(給每個(gè)數(shù)字),考慮到貓眼票房數(shù)據(jù)的規(guī)則性,結(jié)合google的 Tesseract-OCR 訓(xùn)練模型來識(shí)別我們摳出來的數(shù)字圖片。

ImageUtil 負(fù)責(zé)摳圖

import net.coobird.thumbnailator.Thumbnails;
import net.coobird.thumbnailator.geometry.Position;
import net.coobird.thumbnailator.geometry.Size;

/**
 * @author taojw
 *
 */
public class ImageUtil {
    public static void crop(String srcfile,String destfile,ImageRegion region){
        //指定坐標(biāo)  
        try {
            Thumbnails.of(srcfile)  
                    .sourceRegion(region.x, region.y, region.width, region.height)  
                    .size(region.width, region.height).outputQuality(1.0) 
                    //.keepAspectRatio(false)  //不保持比例 
                    .toFile(destfile);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  
    }
    public static void main(String[] args) {
        crop("D:data111.png","D:data1112.png",new ImageRegion(66, 264, 422, 426));
    }
}
/**
 * @author taojw
 *
 */
public class ImageRegion {
    public int x;
    public int y;
    public int width;
    public int height;
    public ImageRegion(int x,int y,int width,int height){
        this.x=x;
        this.y=y;
        this.width=width;
        this.height=height;
    }
}

TesseractOcrUtil,調(diào)用tesseract進(jìn)程,返回識(shí)別結(jié)果。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.UUID;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.fh.util.FileUtil;

/**
 * @author taojw
 *
 */
public class TesseractOcrUtil {
    private static final Logger log = LoggerFactory
            .getLogger(TesseractOcrUtil.class);
    private static final String tessPath;
    private static final String basePath;
    static {
        tessPath = FileUtil.getCommonProp("tesseract.path");
        basePath = new File(tessPath).getParentFile().getAbsolutePath();
    }

    public static String getByLangNum(String imagePath) {
        return get(imagePath, "num");
    }

    public static String getByLangChi(String imagePath) {
        return get(imagePath, "chi_sim");
    }

    public static String getByLangEng(String imagePath) {
        return get(imagePath, "eng");
    }

    public static String get(String imagePath, String lang) {
        String outName = UUID.randomUUID().toString();
        String outPath = basePath + File.separator
                + outName + ".txt";
//        String cmd = tessPath + " " + imagePath + " " + outName + " -l " + lang;
        ProcessBuilder pb = new ProcessBuilder();
        pb.directory(new File(basePath));
        
        pb.command(tessPath,imagePath,outName,"-l",lang);
        
        pb.redirectErrorStream(true);
        
        Process process=null;
        String errormsg = "";
        String res = null;
        try {
            process = pb.start();
            // tesseract.exe 1.jpg 1 -l chi_sim
            int excode = process.waitFor();
            
            if (excode == 0) {
                BufferedReader in = new BufferedReader(new InputStreamReader(
                        new FileInputStream(outPath), "UTF-8"));
                res = in.readLine();
                IOUtils.closeQuietly(in);
            } else {
                switch (excode) {
                case 1:
                    errormsg = "Errors accessing files.There may be spaces in your image"s filename.";
                    break;
                case 29:
                    errormsg = "Cannot recongnize the image or its selected region.";
                    break;
                case 31:
                    errormsg = "Unsupported image format.";
                    break;
                default:
                    errormsg = "Errors occurred.";
                }
                log.error("when ocr picture " + imagePath
                        + " an error occured. " + errormsg);
            }

        } catch (IOException e) {
            e.printStackTrace();
            log.warn("orc process occurs an io error",e);
        } catch (InterruptedException e) {
            e.printStackTrace();
            log.warn("orc process was interrupt unexpected!",e);
        }finally{
            FileUtils.deleteQuietly(new File(imagePath));
            FileUtils.deleteQuietly(new File(outPath));
        }
        if(res!=null){
            res=res.trim();
        }
        return res;
    }
}
/**
 * @author taojw
 *
 */
public class MaoyanTest implements PageProcessor{
    private static Site site=Site.me().setCharset("UTF-8").setUserAgent(
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.65 Safari/537.31");
    @Override
    public Site getSite() {
        return site;
    }

    @Override
    public void process(Page page) {

    }
    public  void start() {
        
        Spider cnSpider = Spider.create(this).setDownloader(new SeleniumDownloader(5000,null,new TestAction()))
//                .addUrl("https://shop34068488.taobao.com/?spm=a230r.7195193.1997079397.2.JLFlPa")
//                .addUrl("http://piaofang.maoyan.com/company/cinema?date=2017-01-18&webCityId=288&cityTier=0&page=1&cityName=%E6%8F%AD%E9%98%B3");
                .addUrl("http://piaofang.maoyan.com/company/cinema?date=2017-01-18&webCityId=84&cityTier=0&page=1&cityName=%E4%BF%9D%E5%AE%9A");
//                .addPipeline(new JsonFilePipeline("D:datawebmagicfile.json"))
        
        //SpiderMonitor.instance().register(cnSpider);
        cnSpider.run();
    }
    public static void main(String[] args) {
        new MaoyanTest().start();
    }
    
    private class TestAction implements SeleniumAction{

        @Override
        public void execute(WebDriver driver) {
            WindowUtil.loadAll(driver);
            try {
                Thread.sleep(5000);
                //WebDriverWait wait = new WebDriverWait(driver, 10);
                //wait.until(ExpectedConditions.presenceOfElementLocated(By.id("J_PromoPriceNum")));
                
                File src=((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
                String srcfile="D:data"+UUID.randomUUID().toString()+".png";
                FileUtils.copyFile(src, new File(srcfile));
                List movielist=driver.findElements(By.xpath("http://*[@id="cinema-tbody"]/tr"));
//                movielist.remove(0);
                for(int i=1;i

可供參考鏈接:
selenium系列文章:http://www.cnblogs.com/TankXi...
selenium api:http://seleniumhq.github.io/s...
tesseract-ocr樣本訓(xùn)練: http://blog.csdn.net/firehood...
selenium多窗口切換:http://blog.csdn.net/meyoung0...

文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/66543.html

相關(guān)文章

  • 爬蟲框架WebMagic源碼分析之Selenium

    摘要:有一個(gè)模塊其中實(shí)現(xiàn)了一個(gè)。但是感覺靈活性不大。接口如下它會(huì)獲得一個(gè)實(shí)例,你可以在里面進(jìn)行任意的操作。本部分到此結(jié)束。 webmagic有一個(gè)selenium模塊,其中實(shí)現(xiàn)了一個(gè)SeleniumDownloader。但是感覺靈活性不大。所以我就自己參考實(shí)現(xiàn)了一個(gè)。 首先是WebDriverPool用來管理WebDriver池: import java.util.ArrayList; im...

    MarvinZhang 評(píng)論0 收藏0
  • 優(yōu)雅的使用WebMagic框架寫Java爬蟲

    摘要:優(yōu)雅的使用框架,爬取唐詩(shī)別苑網(wǎng)的詩(shī)人詩(shī)歌數(shù)據(jù)同時(shí)在幾種動(dòng)態(tài)加載技術(shù)中對(duì)比作選擇雖然差不多兩年沒有維護(hù),但其本身是一個(gè)優(yōu)秀的爬蟲框架的實(shí)現(xiàn),源碼中有很多值得參考的地方,特別是對(duì)爬蟲多線程的控制。 優(yōu)雅的使用WebMagic框架,爬取唐詩(shī)別苑網(wǎng)的詩(shī)人詩(shī)歌數(shù)據(jù) 同時(shí)在幾種動(dòng)態(tài)加載技術(shù)(HtmlUnit、PhantomJS、Selenium、JavaScriptEngine)中對(duì)比作選擇 We...

    leejan97 評(píng)論0 收藏0
  • Python3網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)---2、請(qǐng)求庫(kù)安裝:GeckoDriver、PhantomJS、Aioh

    摘要:上一篇文章網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)請(qǐng)求庫(kù)安裝下一篇文章網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)解析庫(kù)的安裝的安裝在上一節(jié)我們了解了的配置方法,配置完成之后我們便可以用來驅(qū)動(dòng)瀏覽器來做相應(yīng)網(wǎng)頁(yè)的抓取。上一篇文章網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)請(qǐng)求庫(kù)安裝下一篇文章網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)解析庫(kù)的安裝 上一篇文章:Python3網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)---1、請(qǐng)求庫(kù)安裝:Requests、Selenium、ChromeDriver下一篇文章:Python3網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)--...

    Cristalven 評(píng)論0 收藏0
  • selenium實(shí)戰(zhàn)-同步網(wǎng)易云音樂歌單到qq音樂

    摘要:對(duì)于這次的爬蟲來說,由于網(wǎng)易云音樂以及音樂網(wǎng)頁(yè)中大部分元素都是使用渲染生成的,因此選擇使用來完成這次的腳本??梢园l(fā)現(xiàn)網(wǎng)易云音樂的手機(jī)版歌單地址是?,F(xiàn)在已經(jīng)支持網(wǎng)易云音樂與音樂歌單的互相同步。 本文主要介紹selenium在爬蟲腳本的實(shí)際應(yīng)用。適合剛接觸python,沒使用過selenium的童鞋。(如果你是老司機(jī)路過的話,幫忙點(diǎn)個(gè)star吧) 項(xiàng)目地址 https://github.c...

    dailybird 評(píng)論0 收藏0
  • Python爬蟲實(shí)戰(zhàn)(4):豆瓣小組話題數(shù)據(jù)采集—?jiǎng)討B(tài)網(wǎng)頁(yè)

    摘要:,引言注釋上一篇爬蟲實(shí)戰(zhàn)安居客房產(chǎn)經(jīng)紀(jì)人信息采集,訪問的網(wǎng)頁(yè)是靜態(tài)網(wǎng)頁(yè),有朋友模仿那個(gè)實(shí)戰(zhàn)來采集動(dòng)態(tài)加載豆瓣小組的網(wǎng)頁(yè),結(jié)果不成功。 showImg(https://segmentfault.com/img/bVzdNZ); 1, 引言 注釋:上一篇《Python爬蟲實(shí)戰(zhàn)(3):安居客房產(chǎn)經(jīng)紀(jì)人信息采集》,訪問的網(wǎng)頁(yè)是靜態(tài)網(wǎng)頁(yè),有朋友模仿那個(gè)實(shí)戰(zhàn)來采集動(dòng)態(tài)加載豆瓣小組的網(wǎng)頁(yè),結(jié)果不成功...

    blastz 評(píng)論0 收藏0

發(fā)表評(píng)論

0條評(píng)論

最新活動(dòng)
閱讀需要支付1元查看
<