摘要:有一個(gè)模塊其中實(shí)現(xiàn)了一個(gè)。但是感覺靈活性不大。接口如下它會(huì)獲得一個(gè)實(shí)例,你可以在里面進(jìn)行任意的操作。本部分到此結(jié)束。
webmagic有一個(gè)selenium模塊,其中實(shí)現(xiàn)了一個(gè)SeleniumDownloader。但是感覺靈活性不大。所以我就自己參考實(shí)現(xiàn)了一個(gè)。
首先是WebDriverPool用來管理WebDriver池:
import java.util.ArrayList; import java.util.concurrent.BlockingDeque; import java.util.concurrent.LinkedBlockingDeque; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; import org.openqa.selenium.WebDriver; import org.openqa.selenium.phantomjs.PhantomJSDriver; import org.openqa.selenium.phantomjs.PhantomJSDriverService; import org.openqa.selenium.remote.DesiredCapabilities; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import net.xby1993.common.util.FileUtil; /** * @author taojw */ public class WebDriverPool { private Logger logger = LoggerFactory.getLogger(getClass()); private int CAPACITY = 5; private AtomicInteger refCount = new AtomicInteger(0); private static final String DRIVER_PHANTOMJS = "phantomjs"; /** * store webDrivers available */ private BlockingDequeinnerQueue = new LinkedBlockingDeque ( CAPACITY); private static String PHANTOMJS_PATH; private static DesiredCapabilities caps = DesiredCapabilities.phantomjs(); static { PHANTOMJS_PATH = FileUtil.getCommonProp("phantomjs.path"); caps.setJavascriptEnabled(true); caps.setCapability( PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, PHANTOMJS_PATH); caps.setCapability("takesScreenshot", false); caps.setCapability( PhantomJSDriverService.PHANTOMJS_PAGE_CUSTOMHEADERS_PREFIX + "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"); ArrayList cliArgsCap = new ArrayList (); //http://phantomjs.org/api/command-line.html cliArgsCap.add("--web-security=false"); cliArgsCap.add("--ssl-protocol=any"); cliArgsCap.add("--ignore-ssl-errors=true"); cliArgsCap.add("--load-images=false"); //不加載圖片 caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, cliArgsCap); caps.setCapability( PhantomJSDriverService.PHANTOMJS_GHOSTDRIVER_CLI_ARGS, new String[] {"--logLevel=INFO"}); } public WebDriverPool() { } public WebDriverPool(int poolsize) { this.CAPACITY = poolsize; innerQueue = new LinkedBlockingDeque (poolsize); } public WebDriver get() throws InterruptedException { WebDriver poll = innerQueue.poll(); if (poll != null) { return poll; } if (refCount.get() < CAPACITY) { synchronized (innerQueue) { if (refCount.get() < CAPACITY) { WebDriver mDriver = new PhantomJSDriver(caps); // 嘗試性解決:https://github.com/ariya/phantomjs/issues/11526問題 mDriver.manage().timeouts() .pageLoadTimeout(60, TimeUnit.SECONDS); // mDriver.manage().window().setSize(new Dimension(1366, // 768)); innerQueue.add(mDriver); refCount.incrementAndGet(); } } } return innerQueue.take(); } public void returnToPool(WebDriver webDriver) { // webDriver.quit(); // webDriver=null; innerQueue.add(webDriver); } public void close(WebDriver webDriver) { refCount.decrementAndGet(); webDriver.quit(); webDriver = null; } public void shutdown() { try { for (WebDriver driver : innerQueue) { close(driver); } innerQueue.clear(); } catch (Exception e) { // e.printStackTrace(); logger.warn("webdriverpool關(guān)閉失敗",e); } } }
之后便是SeleniumDownloader
import org.openqa.selenium.By; import org.openqa.selenium.Cookie; import org.openqa.selenium.WebDriver; import org.openqa.selenium.WebElement; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Request; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Task; import us.codecraft.webmagic.downloader.Downloader; import us.codecraft.webmagic.selector.Html; import us.codecraft.webmagic.selector.PlainText; import us.codecraft.webmagic.utils.UrlUtils; import java.util.Map; /** * @author taojw * */ public class SeleniumDownloader implements Downloader{ private static final Logger log=LoggerFactory.getLogger(SeleniumDownloader.class); private int sleepTime=3000;//3s private SeleniumAction action=null; private WebDriverPool webDriverPool=new WebDriverPool(); public SeleniumDownloader(){ } public SeleniumDownloader(int sleepTime,WebDriverPool pool){ this(sleepTime,pool,null); } public SeleniumDownloader(int sleepTime,WebDriverPool pool,SeleniumAction action){ this.sleepTime=sleepTime; this.action=action; if(pool!=null){ webDriverPool=pool; } } public SeleniumDownloader setSleepTime(int sleepTime) { this.sleepTime = sleepTime; return this; } public void setOperator(SeleniumAction action){ this.action=action; } @Override public Page download(Request request, Task task) { WebDriver webDriver; try { webDriver = webDriverPool.get(); } catch (InterruptedException e) { log.warn("interrupted", e); return null; } log.info("downloading page " + request.getUrl()); Page page = new Page(); try { webDriver.get(request.getUrl()); Thread.sleep(sleepTime); } catch (InterruptedException e) { e.printStackTrace(); } catch (Exception e) { webDriverPool.close(webDriver); page.setSkip(true); return page; } // WindowUtil.changeWindow(webDriver); WebDriver.Options manage = webDriver.manage(); Site site = task.getSite(); if (site.getCookies() != null) { for (Map.EntrycookieEntry : site.getCookies() .entrySet()) { Cookie cookie = new Cookie(cookieEntry.getKey(), cookieEntry.getValue()); manage.addCookie(cookie); } } manage.window().maximize(); if(action!=null){ action.execute(webDriver); } SeleniumAction reqAction=(SeleniumAction) request.getExtra("action"); if(reqAction!=null){ reqAction.execute(webDriver); } WebElement webElement = webDriver.findElement(By.xpath("/html")); String content = webElement.getAttribute("outerHTML"); page.setRawText(content); page.setHtml(new Html(UrlUtils.fixAllRelativeHrefs(content, webDriver.getCurrentUrl()))); page.setUrl(new PlainText(webDriver.getCurrentUrl())); page.setRequest(request); webDriverPool.returnToPool(webDriver); return page; } @Override public void setThread(int thread) { } }
這里的擴(kuò)展性主要體現(xiàn)在,我加入了SeleniumAction接口,可以在SeleniumDownloader初始化的時(shí)候配置一個(gè)全局的SeleniumAction,以及為每個(gè)Request配置對(duì)應(yīng)的SeleniumAction。 SeleniumAction接口如下:
public interface SeleniumAction { void execute(WebDriver driver); }
它會(huì)獲得一個(gè)WebDriver實(shí)例,你可以在里面進(jìn)行任意的Selenium操作。
本部分到此結(jié)束。
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://systransis.cn/yun/66905.html
摘要:優(yōu)雅的使用框架,爬取唐詩(shī)別苑網(wǎng)的詩(shī)人詩(shī)歌數(shù)據(jù)同時(shí)在幾種動(dòng)態(tài)加載技術(shù)中對(duì)比作選擇雖然差不多兩年沒有維護(hù),但其本身是一個(gè)優(yōu)秀的爬蟲框架的實(shí)現(xiàn),源碼中有很多值得參考的地方,特別是對(duì)爬蟲多線程的控制。 優(yōu)雅的使用WebMagic框架,爬取唐詩(shī)別苑網(wǎng)的詩(shī)人詩(shī)歌數(shù)據(jù) 同時(shí)在幾種動(dòng)態(tài)加載技術(shù)(HtmlUnit、PhantomJS、Selenium、JavaScriptEngine)中對(duì)比作選擇 We...
摘要:爬蟲框架源碼分析之爬蟲框架源碼分析之爬蟲框架源碼分析之爬蟲框架源碼分析之爬蟲框架源碼分析之之進(jìn)階 爬蟲框架Webmagic源碼分析之Spider爬蟲框架WebMagic源碼分析之Scheduler爬蟲框架WebMagic源碼分析之Downloader爬蟲框架WebMagic源碼分析之Selector爬蟲框架WebMagic源碼分析之SeleniumWebMagic之Spider進(jìn)階
摘要:獲取正在運(yùn)行的線程數(shù),用于狀態(tài)監(jiān)控。之后初始化組件主要是初始化線程池將到中,初始化開始時(shí)間等。如果線程池中運(yùn)行線程數(shù)量為,并且默認(rèn),那么就停止退出,結(jié)束爬蟲。 本系列文章,針對(duì)Webmagic 0.6.1版本 一個(gè)普通爬蟲啟動(dòng)代碼 public static void main(String[] args) { Spider.create(new GithubRepoPageP...
摘要:主要用于選擇器抽象類,實(shí)現(xiàn)類前面說的兩個(gè)接口,主要用于選擇器繼承。多個(gè)選擇的情形,每個(gè)選擇器各自獨(dú)立選擇,將所有結(jié)果合并。抽象類,定義了一些模板方法。這部分源碼就不做分析了。這里需要提到的一點(diǎn)是返回的不支持選擇,返回的對(duì)象支持選擇。 1、Selector部分:接口:Selector:定義了根據(jù)字符串選擇單個(gè)元素和選擇多個(gè)元素的方法。ElementSelector:定義了根據(jù)jsoup ...
摘要:包主要實(shí)現(xiàn)類,這是一個(gè)抽象類,實(shí)現(xiàn)了通用的模板方法,并在方法內(nèi)部判斷錯(cuò)誤重試去重處理等。重置重復(fù)檢查就是清空,獲取請(qǐng)求總數(shù)也就是獲取的。至于請(qǐng)求總數(shù)統(tǒng)計(jì),就是返回中維護(hù)的的大小。 Scheduler是Webmagic中的url調(diào)度器,負(fù)責(zé)從Spider處理收集(push)需要抓取的url(Page的targetRequests)、并poll出將要被處理的url給Spider,同時(shí)還負(fù)責(zé)...
閱讀 782·2021-09-26 09:55
閱讀 2071·2021-09-22 15:44
閱讀 1480·2019-08-30 15:54
閱讀 1336·2019-08-30 15:54
閱讀 2681·2019-08-29 16:57
閱讀 526·2019-08-29 16:26
閱讀 2496·2019-08-29 15:38
閱讀 2133·2019-08-26 11:48