爬蟲(chóng)框架WebMagic源碼分析之Downloader

104828720 發(fā)布于2019-08-14 17:55 / 2659人閱讀

摘要：方法，首先判斷是否有這是在中配置的，如果有，直接調(diào)用的將相應(yīng)內(nèi)容轉(zhuǎn)化成對(duì)應(yīng)編碼字符串，否則智能檢測(cè)響應(yīng)內(nèi)容的字符編碼。

Downloader是負(fù)責(zé)請(qǐng)求url獲取返回值（html、json、jsonp等）的一個(gè)組件。當(dāng)然會(huì)同時(shí)處理POST重定向、Https驗(yàn)證、ip代理、判斷失敗重試等。

接口：Downloader 定義了download方法返回Page，定義了setThread方法來(lái)請(qǐng)求的設(shè)置線程數(shù)。
抽象類：AbstractDownloader。定義了重載的download方法返回Html，同時(shí)定義了onSuccess、onError狀態(tài)方法，并定義了addToCycleRetry來(lái)判斷是否需要進(jìn)行重試。
實(shí)現(xiàn)類：HttpClientDownloader。負(fù)責(zé)通過(guò)HttpClient下載頁(yè)面
輔助類：HttpClientGenerator。負(fù)責(zé)生成HttpClient實(shí)例。

1、AbstractDownloader

public Html download(String url, String charset) {
        Page page = download(new Request(url), Site.me().setCharset(charset).toTask());
        return (Html) page.getHtml();
    }

這里download邏輯很簡(jiǎn)單，就是調(diào)用子類實(shí)現(xiàn)的download下載。

protected Page addToCycleRetry(Request request, Site site) {
        Page page = new Page();
        Object cycleTriedTimesObject = request.getExtra(Request.CYCLE_TRIED_TIMES);
        if (cycleTriedTimesObject == null) {
            page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, 1));
        } else {
            int cycleTriedTimes = (Integer) cycleTriedTimesObject;
            cycleTriedTimes++;
            if (cycleTriedTimes >= site.getCycleRetryTimes()) {
                return null;
            }
            page.addTargetRequest(request.setPriority(0).putExtra(Request.CYCLE_TRIED_TIMES, cycleTriedTimes));
        }
        page.setNeedCycleRetry(true);
        return page;
    }

判斷重試邏輯：先判斷CYCLE_TRIED_TIMES是否為null，如果不為null，循環(huán)重試次數(shù)+1,判斷是否超過(guò)最大允許值(默認(rèn)為3次)，然后設(shè)置needCycleRetry標(biāo)志說(shuō)明需要被重試。這在我們Spider分析篇提到過(guò)這個(gè)，我們?cè)賮?lái)看看Spider中的代碼片段加深理解

// for cycle retry
        if (page.isNeedCycleRetry()) {
            extractAndAddRequests(page, true);
            sleep(site.getRetrySleepTime());
            return;
        }

2、HttpClientDownloader
繼承了AbstractDownloader.負(fù)責(zé)通過(guò)HttpClient下載頁(yè)面.
實(shí)例變量
httpClients：是一個(gè)Map型的變量，用來(lái)保存根據(jù)站點(diǎn)域名生成的HttpClient實(shí)例，以便重用。

httpClientGenerator：HttpClientGenerator實(shí)例，用來(lái)生成HttpClient

主要方法：
a、獲取HttpClient實(shí)例。

private CloseableHttpClient getHttpClient(Site site, Proxy proxy) {
        if (site == null) {
            return httpClientGenerator.getClient(null, proxy);
        }
        String domain = site.getDomain();
        CloseableHttpClient httpClient = httpClients.get(domain);
        if (httpClient == null) {
            synchronized (this) {
                httpClient = httpClients.get(domain);
                if (httpClient == null) {
                    httpClient = httpClientGenerator.getClient(site, proxy);
                    httpClients.put(domain, httpClient);
                }
            }
        }
        return httpClient;
    }

主要思路是，通過(guò)Site獲取域名，然后通過(guò)域名判斷是否在httpClients這個(gè)map中已存在HttpClient實(shí)例，如果存在則重用，否則通過(guò)httpClientGenerator創(chuàng)建一個(gè)新的實(shí)例，然后加入到httpClients這個(gè)map中，并返回。
注意為了確保線程安全性，這里用到了線程安全的雙重判斷機(jī)制。

b、download方法：

public Page download(Request request, Task task) {
    Site site = null;
    if (task != null) {
        site = task.getSite();
    }
    Set acceptStatCode;
    String charset = null;
    Map headers = null;
    if (site != null) {
        acceptStatCode = site.getAcceptStatCode();
        charset = site.getCharset();
        headers = site.getHeaders();
    } else {
        acceptStatCode = WMCollections.newHashSet(200);
    }
    logger.info("downloading page {}", request.getUrl());
    CloseableHttpResponse httpResponse = null;
    int statusCode=0;
    try {
        HttpHost proxyHost = null;
        Proxy proxy = null; //TODO
        if (site.getHttpProxyPool() != null && site.getHttpProxyPool().isEnable()) {
            proxy = site.getHttpProxyFromPool();
            proxyHost = proxy.getHttpHost();
        } else if(site.getHttpProxy()!= null){
            proxyHost = site.getHttpProxy();
        }
        
        HttpUriRequest httpUriRequest = getHttpUriRequest(request, site, headers, proxyHost);
        httpResponse = getHttpClient(site, proxy).execute(httpUriRequest);
        statusCode = httpResponse.getStatusLine().getStatusCode();
        request.putExtra(Request.STATUS_CODE, statusCode);
        if (statusAccept(acceptStatCode, statusCode)) {
            Page page = handleResponse(request, charset, httpResponse, task);
            onSuccess(request);
            return page;
        } else {
            logger.warn("get page {} error, status code {} ",request.getUrl(),statusCode);
            return null;
        }
    } catch (IOException e) {
        logger.warn("download page {} error", request.getUrl(), e);
        if (site.getCycleRetryTimes() > 0) {
            return addToCycleRetry(request, site);
        }
        onError(request);
        return null;
    } finally {
        request.putExtra(Request.STATUS_CODE, statusCode);
        if (site.getHttpProxyPool()!=null && site.getHttpProxyPool().isEnable()) {
            site.returnHttpProxyToPool((HttpHost) request.getExtra(Request.PROXY), (Integer) request
                    .getExtra(Request.STATUS_CODE));
        }
        try {
            if (httpResponse != null) {
                //ensure the connection is released back to pool
                EntityUtils.consume(httpResponse.getEntity());
            }
        } catch (IOException e) {
            logger.warn("close response fail", e);
        }
    }
}

注意，這里的Task入?yún)?，其?shí)就是Spider實(shí)例。
首先通過(guò)site來(lái)設(shè)置字符集、請(qǐng)求頭、以及允許接收的響應(yīng)狀態(tài)碼。
之后便是設(shè)置代理：首先判斷site是否有設(shè)置代理池，以及代理池是否可用。可用，則隨機(jī)從池中獲取一個(gè)代理主機(jī)，否則判斷site是否設(shè)置過(guò)直接代理主機(jī)。
然后獲取HttpUriRequest(它是HttpGet、HttpPost的接口)，執(zhí)行請(qǐng)求、判斷響應(yīng)碼，并將響應(yīng)轉(zhuǎn)換成Page對(duì)象返回。期間還調(diào)用了狀態(tài)方法onSuccess,onError，但是這兩個(gè)方法都是空實(shí)現(xiàn)。(主要原因可能是在Spider中已經(jīng)通過(guò)調(diào)用Listener來(lái)處理狀態(tài)了)。
如果發(fā)生異常，調(diào)用addToCycleRetry判斷是否需要進(jìn)行重試。
如果這里返回的Page為null，在Spider中就不會(huì)調(diào)用PageProcessor，所以我們?cè)赑ageProcessor中不用擔(dān)心Page是否為null
最后的finally塊中進(jìn)行資源回收處理，回收代理入池，回收HttpClient的connection等(EntityUtils.consume(httpResponse.getEntity());)。

c、具體說(shuō)說(shuō)怎么獲取HttpUriRequest

protected HttpUriRequest getHttpUriRequest(Request request, Site site, Map headers,HttpHost proxy) {
        RequestBuilder requestBuilder = selectRequestMethod(request).setUri(request.getUrl());
        if (headers != null) {
            for (Map.Entry headerEntry : headers.entrySet()) {
                requestBuilder.addHeader(headerEntry.getKey(), headerEntry.getValue());
            }
        }
        RequestConfig.Builder requestConfigBuilder = RequestConfig.custom()
                .setConnectionRequestTimeout(site.getTimeOut())
                .setSocketTimeout(site.getTimeOut())
                .setConnectTimeout(site.getTimeOut())
                .setCookieSpec(CookieSpecs.BEST_MATCH);
        if (proxy !=null) {
            requestConfigBuilder.setProxy(proxy);
            request.putExtra(Request.PROXY, proxy);
        }
        requestBuilder.setConfig(requestConfigBuilder.build());
        return requestBuilder.build();
    }

首先調(diào)用selectRequestMethod來(lái)獲取合適的RequestBuilder，比如是GET還是POST，同時(shí)設(shè)置請(qǐng)求參數(shù)。之后便是調(diào)用HttpClient的相關(guān)API設(shè)置請(qǐng)求頭、超時(shí)時(shí)間、代理等。

關(guān)于selectRequestMethod的改動(dòng)：預(yù)計(jì)在WebMagic0.6.2(目前還未發(fā)布)之后由于作者合并并修改了PR，設(shè)置POST請(qǐng)求參數(shù)會(huì)大大簡(jiǎn)化。
之前POST請(qǐng)求設(shè)置參數(shù)需要
request.putExtra("nameValuePair",NameValuePair[]);然后這個(gè)NameValuePair[]需要不斷add BasicNameValuePair,而且還需要UrlEncodedFormEntity,設(shè)置參數(shù)過(guò)程比較繁瑣，整個(gè)過(guò)程如下：

List formparams = new ArrayList();
formparams.add(new BasicNameValuePair("channelCode", "0008")); 
formparams.add(new BasicNameValuePair("pageIndex", i+""));
formparams.add(new BasicNameValuePair("pageSize", "15"));
formparams.add(new BasicNameValuePair("sitewebName", "廣東省"));
request.putExtra("nameValuePair",formparams.toArray());

之后我們只需要如下就可以了：

request.putParam("sitewebName", "廣東省");
request.putParam("xxx", "xxx");

d、說(shuō)說(shuō)下載的內(nèi)容如何轉(zhuǎn)換為Page對(duì)象：

protected Page handleResponse(Request request, String charset, HttpResponse httpResponse, Task task) throws IOException {
        String content = getContent(charset, httpResponse);
        Page page = new Page();
        page.setRawText(content);
        page.setUrl(new PlainText(request.getUrl()));
        page.setRequest(request);
        page.setStatusCode(httpResponse.getStatusLine().getStatusCode());
        return page;
    }

這個(gè)方法沒(méi)什么好說(shuō)的，唯一要說(shuō)的就是它調(diào)用getContent方法。

protected String getContent(String charset, HttpResponse httpResponse) throws IOException {
    if (charset == null) {
        byte[] contentBytes = IOUtils.toByteArray(httpResponse.getEntity().getContent());
        String htmlCharset = getHtmlCharset(httpResponse, contentBytes);
        if (htmlCharset != null) {
            return new String(contentBytes, htmlCharset);
        } else {
            logger.warn("Charset autodetect failed, use {} as charset. Please specify charset in Site.setCharset()", Charset.defaultCharset());
            return new String(contentBytes);
        }
    } else {
        return IOUtils.toString(httpResponse.getEntity().getContent(), charset);
    }
}

getContent方法，首先判斷是否有charset(這是在Site中配置的)，如果有，直接調(diào)用ApacheCommons的IOUtils將相應(yīng)內(nèi)容轉(zhuǎn)化成對(duì)應(yīng)編碼字符串，否則智能檢測(cè)響應(yīng)內(nèi)容的字符編碼。

protected String getHtmlCharset(HttpResponse httpResponse, byte[] contentBytes) throws IOException {
    return CharsetUtils.detectCharset(httpResponse.getEntity().getContentType().getValue(), contentBytes);
}

getHtmlCharset是調(diào)用CharsetUtils來(lái)檢測(cè)字符編碼，其思路就是，首先判斷httpResponse.getEntity().getContentType().getValue()是否含有比如charset=utf-8
否則用Jsoup解析內(nèi)容，判斷是提取meta標(biāo)簽，然后判斷針對(duì)html4中html4.01 和html5中分情況判斷出字符編碼。
當(dāng)然，你懂的，如果服務(wù)端返回的不是完整的html內(nèi)容(不包含head的)，甚至不是html內(nèi)容(比如json)，那么就會(huì)導(dǎo)致判斷失敗，返回默認(rèn)jvm編碼值.
所以說(shuō)，如果可以，最好手動(dòng)給Site設(shè)置字符編碼。

3、HttpClientGenerator
用于生成HttpClient實(shí)例，算是一種工廠模式了。

public HttpClientGenerator() {
        Registry reg = RegistryBuilder.create()
                .register("http", PlainConnectionSocketFactory.INSTANCE)
                .register("https", buildSSLConnectionSocketFactory())
                .build();
        connectionManager = new PoolingHttpClientConnectionManager(reg);
        connectionManager.setDefaultMaxPerRoute(100);
    }

構(gòu)造函數(shù)主要是注冊(cè)http以及https的socket工廠實(shí)例。https下我們需要提供自定義的工廠以忽略不可信證書(shū)校驗(yàn)(也就是信任所有證書(shū))，在webmagic0.6之前是存在不可信證書(shū)校驗(yàn)失敗這一問(wèn)題的，之后webmagic合并了一個(gè)關(guān)于這一問(wèn)題的PR，目前的策略是忽略證書(shū)校驗(yàn)、信任一切證書(shū)(這才是爬蟲(chóng)該采用的嘛，我們爬的不是安全，是寂寞。)

private CloseableHttpClient generateClient(Site site, Proxy proxy) {
    CredentialsProvider credsProvider = null;
    HttpClientBuilder httpClientBuilder = HttpClients.custom();
    
    if(proxy!=null && StringUtils.isNotBlank(proxy.getUser()) && StringUtils.isNotBlank(proxy.getPassword()))
    {
        credsProvider= new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope(proxy.getHttpHost().getAddress().getHostAddress(), proxy.getHttpHost().getPort()),
                new UsernamePasswordCredentials(proxy.getUser(), proxy.getPassword()));
        httpClientBuilder.setDefaultCredentialsProvider(credsProvider);
    }

    if(site!=null&&site.getHttpProxy()!=null&&site.getUsernamePasswordCredentials()!=null){
        credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope(site.getHttpProxy()),//可以訪問(wèn)的范圍
                site.getUsernamePasswordCredentials());//用戶名和密碼
        httpClientBuilder.setDefaultCredentialsProvider(credsProvider);
    }
    
    httpClientBuilder.setConnectionManager(connectionManager);
    if (site != null && site.getUserAgent() != null) {
        httpClientBuilder.setUserAgent(site.getUserAgent());
    } else {
        httpClientBuilder.setUserAgent("");
    }
    if (site == null || site.isUseGzip()) {
        httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {

            public void process(
                    final HttpRequest request,
                    final HttpContext context) throws HttpException, IOException {
                if (!request.containsHeader("Accept-Encoding")) {
                    request.addHeader("Accept-Encoding", "gzip");
                }
            }
        });
    }
    //解決post/redirect/post 302跳轉(zhuǎn)問(wèn)題
    httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());
    
    SocketConfig socketConfig = SocketConfig.custom().setSoTimeout(site.getTimeOut()).setSoKeepAlive(true).setTcpNoDelay(true).build();
    httpClientBuilder.setDefaultSocketConfig(socketConfig);
    connectionManager.setDefaultSocketConfig(socketConfig);
    if (site != null) {
        httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
    }
    generateCookie(httpClientBuilder, site);
    return httpClientBuilder.build();
}

前面是設(shè)置代理代理及代理的用戶名密碼
這里主要需要關(guān)注的兩點(diǎn)是
1、post/redirect/post 302跳轉(zhuǎn)問(wèn)題：這是是通過(guò)設(shè)置一個(gè)自定義的跳轉(zhuǎn)策略類來(lái)實(shí)現(xiàn)的。(這在0.6版本之前是存在問(wèn)題的，0.6版本之后合并了PR)

httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());

CustomRedirectStrategy在繼承HttpClient自帶額LaxRedirectStrategy(支持GET,POST，HEAD，DELETE請(qǐng)求重定向跳轉(zhuǎn))的基礎(chǔ)上，對(duì)POST請(qǐng)求做了特殊化處理，如果是POST請(qǐng)求，代碼處理如下：

HttpRequestWrapper httpRequestWrapper = (HttpRequestWrapper) request;
httpRequestWrapper.setURI(uri);
httpRequestWrapper.removeHeaders("Content-Length");

可以看到，POST請(qǐng)求時(shí)首先會(huì)重用原先的request對(duì)象，并重新設(shè)置uri為新的重定向url，然后移除新請(qǐng)求不需要的頭部。重用request對(duì)象的好處是，post/redirect/post 302跳轉(zhuǎn)時(shí)會(huì)攜帶原有的POST參數(shù)，就防止了參數(shù)丟失的問(wèn)題。
否則默認(rèn)實(shí)現(xiàn)是這樣的

if (status == HttpStatus.SC_TEMPORARY_REDIRECT) {
                return RequestBuilder.copy(request).setUri(uri).build();
            } else {
                return new HttpGet(uri);
            }

SC_TEMPORARY_REDIRECT是307狀態(tài)碼，也就是說(shuō)只有在307狀態(tài)碼的時(shí)候才會(huì)攜帶參數(shù)跳轉(zhuǎn)。

2、HttpClient的重試：這是是通過(guò)設(shè)置一個(gè)默認(rèn)處理器來(lái)實(shí)現(xiàn)的，同時(shí)設(shè)置了重試次數(shù)(也就是Site中配置的retryTimes)。

httpClientBuilder.setRetryHandler(newDefaultHttpRequestRetryHandler(site.getRetryTimes(), true));

之后便是配置Cookie策略。

private void generateCookie(HttpClientBuilder httpClientBuilder, Site site) {
    CookieStore cookieStore = new BasicCookieStore();
    for (Map.Entry cookieEntry : site.getCookies().entrySet()) {
        BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
        cookie.setDomain(site.getDomain());
        cookieStore.addCookie(cookie);
    }
    for (Map.Entry> domainEntry : site.getAllCookies().entrySet()) {
        for (Map.Entry cookieEntry : domainEntry.getValue().entrySet()) {
            BasicClientCookie cookie = new BasicClientCookie(cookieEntry.getKey(), cookieEntry.getValue());
            cookie.setDomain(domainEntry.getKey());
            cookieStore.addCookie(cookie);
        }
    }
    httpClientBuilder.setDefaultCookieStore(cookieStore);
}

首先創(chuàng)建一個(gè)CookieStore實(shí)例，然后將Site中的cookie加入到cookieStore中。并配置到httpClientBuilder中。那么在這個(gè)HttpClient實(shí)例執(zhí)行的所有請(qǐng)求中都會(huì)用到這個(gè)cookieStore。比如登錄保持就可以通過(guò)配置Site中的Cookie來(lái)實(shí)現(xiàn)。

4、關(guān)于Page對(duì)象說(shuō)明：
Page對(duì)象代表了一個(gè)請(qǐng)求結(jié)果，或者說(shuō)相當(dāng)于頁(yè)面(當(dāng)返回json時(shí)這種說(shuō)法有點(diǎn)勉強(qiáng))。

public Html getHtml() {
        if (html == null) {
            html = new Html(UrlUtils.fixAllRelativeHrefs(rawText, request.getUrl()));
        }
        return html;
    }

通過(guò)它得到的頁(yè)面，原始頁(yè)面中的鏈接是不包含域名的情況下會(huì)被自動(dòng)轉(zhuǎn)換為http[s]開(kāi)頭的完整鏈接。

關(guān)于Downloader就分析到這，后續(xù)會(huì)進(jìn)行補(bǔ)充，下篇主題待定。

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://systransis.cn/yun/66880.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

104828720

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

IR2103H橋驅(qū)動(dòng)電路

閱讀 1794·2021-10-12 10:12
SiteGround分布式備份系統(tǒng)，更好的保護(hù)獨(dú)立站用戶數(shù)據(jù)

閱讀 2551·2021-09-29 09:42
C語(yǔ)言實(shí)現(xiàn)呼吸燈（HAL庫(kù)）

閱讀 2728·2021-09-03 10:28
前端每日實(shí)戰(zhàn)：71# 視頻演示如何用純 CSS 創(chuàng)作一個(gè)跳 8 字型舞的 loader

閱讀 2262·2019-08-30 15:54
使用purifycss精簡(jiǎn)css

閱讀 1168·2019-08-30 15:53
Flex布局實(shí)例，水平垂直居中展示

閱讀 1400·2019-08-30 11:26
overflow屬性便捷語(yǔ)法的不兼容問(wèn)題

閱讀 3366·2019-08-30 11:02
使用 <wbr> 解決長(zhǎng) URL 的換行問(wèn)題

閱讀 2149·2019-08-30 11:02

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

爬蟲(chóng)框架WebMagic源碼分析之Downloader

相關(guān)文章

爬蟲(chóng)框架Webmagic源碼分析之Spider

爬蟲(chóng)框架WebMagic源碼分析之Selenium

爬蟲(chóng)框架WebMagic源碼分析系列目錄

**【Sasila】一個(gè)簡(jiǎn)單易用的爬蟲(chóng)框架**

爬蟲(chóng)框架WebMagic源碼分析之Selector

發(fā)表評(píng)論

0條評(píng)論

104828720

男|高級(jí)講師

TA的文章

IR2103H橋驅(qū)動(dòng)電路

SiteGround分布式備份系統(tǒng)，更好的保護(hù)獨(dú)立站用戶數(shù)據(jù)

C語(yǔ)言實(shí)現(xiàn)呼吸燈（HAL庫(kù)）

前端每日實(shí)戰(zhàn)：71# 視頻演示如何用純 CSS 創(chuàng)作一個(gè)跳 8 字型舞的 loader

使用purifycss精簡(jiǎn)css

Flex布局實(shí)例，水平垂直居中展示

overflow屬性便捷語(yǔ)法的不兼容問(wèn)題

使用 <wbr> 解決長(zhǎng) URL 的換行問(wèn)題

最新活動(dòng)

資訊專欄INFORMATION COLUMN

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！

爬蟲(chóng)框架WebMagic源碼分析之Downloader

相關(guān)文章

發(fā)表評(píng)論

0條評(píng)論

男|高級(jí)講師

TA的文章

最新活動(dòng)

上云采購(gòu)季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來(lái)選購(gòu)！