<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                企業??AI智能體構建引擎,智能編排和調試,一鍵部署,支持知識庫和私有化部署方案 廣告
                ##一、寫在前面 上篇文章以網易微博爬蟲為例,給出了一個很簡單的微博爬蟲的爬取過程,大概說明了網絡爬蟲其實也就這么回事,或許初次看到這個例子覺得有些復雜,不過沒有關系,上篇文章給的例子只是讓大家對爬蟲過程有所了解。接下來的系列里,將一步一步地剖析每個過程。 爬蟲總體流程在上篇文章已經說得很清楚了,沒有看過的朋友可以去看下:[【網絡爬蟲】[java]微博爬蟲(一):網易微博爬蟲(自定義關鍵字爬取微博信息數據)](http://blog.csdn.net/dianacody/article/details/39584977) 現在再回顧下爬蟲過程: step1: 通過請求url得到html的string,用httpClient-4.3.1工具,同時設置socket超時和連接超時connectTimeout,本文將詳解此步驟。 step2: 對于上步得到的html,驗證是否為合法HTML,判斷是否為有效搜索頁面,因為有些請求的html頁面不存在。 step3: 把html這個string存放到本地,寫入txt文件; step4: 從txt文件解析微博數據:userid,timestamp……解析過程才是重點,對于不同網頁結構的分析及特征提取,將在系列三中詳細講解。 step5: 解析出來的數據放入txt和xml中,這里主要jsoup解析html,dom4j工具讀寫xml,將在系列四中講解。 然后在系列五中會給出一些防止被墻的方法,使用代理IP訪問或解析本地IP數據庫(前提是你有存放的IP數據庫),后面再說。 ##二、HttpClient工具包 搞過web開發的朋友對這個應該很熟悉了,不需要再多說,這是個很基本的工具包,一個代碼級Http客戶端工具,可以使用其模擬瀏覽器向http服務器發送請求。HttpClient是HttpComponents(簡稱hc)項目其中的一部分,可以直接下載組件。使用HttpClient還需要HttpCore,后者包括Http請求與Http響應代碼封裝。它使客戶端發送http請求變得容易,同時也會更加深入理解http協議。 在這里可以下載HttpComponents組件:[http://hc.apache.org/](http://hc.apache.org/),下載后目錄結構: ![](https://box.kancloud.cn/2016-02-18_56c5641ba27ea.jpg) 首先要注意的有以下幾點: 1.httpclient**鏈接后釋放**問題很重要,就跟用database connection要釋放資源一樣。 2.https網站使用ssl加密傳輸,證書導入要注意。 3.對于**http協議**要有基本的了解,比如http的200,301,302,400,404,500等返回代碼時什么意思(這個是最基本的),還有cookie和session機制(這個在之后的python爬蟲系列三“模擬登錄”的方法需要抓取數據包分析,主要就是看cookie這些東西,要學會分析數據包) 4.httpclient的redirect(重定向)狀態默認是自動的,這在很大程度上給開發者很大的方便(如一些授權獲得的cookie),但有時需要手動設置,比如有時會遇到CircularRedictException異常,出現這樣的情況是因為返回的頭文件中location值指向之前重復地址(端口號可以不同),導致可能會出現死循環遞歸重定向,此時可以手動關閉:method.setFollowRedirects(false)。 5.模擬瀏覽器登錄,這個對于爬蟲來說相當重要,有的網站會先判別用戶的請求是否來自瀏覽器,如果不是直接拒絕訪問,這個直接偽裝成瀏覽器訪問就好了,好用httpclient抓取信息時在頭部加入一些信息:header.put(“User-Agent”, “Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)”); 6.當post請求提交數據時要改變默認編碼,不然提交上去的數據會出現亂碼。重寫postMethod的setContentCharSet()方法就可以了。 下面給幾個例子: (1)發post請求訪問本地應用并根據傳遞參數不同返回不同結果 ~~~ public void post() { //創建默認httpClient實例 CloseableHttpClient httpclient = HttpClients.createDefault(); //創建httpPost HttpPost httppost = new HttpPost("http://localhost:8088/weibo/Ajax/service.action"); //創建參數隊列 List<keyvalue> formparams = new ArrayList<keyvalue>(); formparams.add(new BasicKeyValue("name", "alice")); UrlEncodeFormEntity uefEntity; try { uefEntity = new UrlEncodeFormEntity(formparams, "utf-8"); httppost.setEntity(uefEntity); System.out.println("executing request " + httppost.getURI()); CloseableHttpResponse response = httpclient.execute(httppost); try { HttpEntity entity = response.getEntity(); if(entity != null) { System.out.println("Response content: " + EntityUtils.toString(entity, "utf-8")); } } finally { response.close(); } } catch (ClientProtocolException e) { e.printStackTrace(); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { //關閉連接,釋放資源 try { httpclient.close(); } catch (IOException e) { e.printStackTrace(); } } } ~~~ (2)發get請求 ~~~ public void get() { CloseableHttpClient httpclient = HttpClients.createDefault(); try { //創建httpget HttpGet httpget = new HttpGet("http://www.baidu.com"); System.out.println("executing request " + httpget.getURI()); //執行get請求 CloseableHttpResponse response = httpclient.execute(httpget); try { //獲取響應實體 HttpEntity entity = response.getEntity(); //響應狀態 System.out.println(response.getStatusLine()); if(entity != null) { //響應內容長度 System.out.println("response length: " + entity.getContentLength()); //響應內容 System.out.println("response content: " + EntityUtils.toString(entity)); } } finally { response.close(); } } catch (ClientProtocolException e) { e.printStackTrace(); } catch (ParseException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { //關閉鏈接,釋放資源 try { httpclient.close(); } catch(IOException e) { e.printStackTrace(); } } } ~~~ (3)設置header 比如在百度搜索”httpclient”關鍵字,百度一下,發送請求,chrome里按F12開發者工具,在Network選項卡查看分析數據包,可以看到數據包相關信息,比如這里請求頭Request Header里的信息。 ![](https://box.kancloud.cn/2016-02-18_56c5641bba3be.jpg) 有時需要**模擬瀏覽器登錄**,把header設置一下就OK,照著這里改吧。 ~~~ public void header() { HttpClient httpClient = new DefaultHttpClient(); try { HttpGet httpget = new HttpGet("http://www.baidu.com"); httpget.setHeader("Accept", "text/html, */*; q=0.01"); httpget.setHeader("Accept-Encoding", "gzip, deflate,sdch"); httpget.setHeader("Accept-Language", "zh-CN,zh;q=0.8"); httpget.setHeader("Connection", "keep-alive"); httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36)"); HttpResponse response = httpClient.execute(httpget); HttpEntity entity = response.getEntity(); System.out.println(response.getStatusLine()); //狀態碼 if(entity != null) { System.out.println(entity.getContentLength()); System.out.println(entity.getContent()); } } catch (Exception e) { e.printStackTrace(); } } ~~~ ##三、通過url得到html頁面 前面說了這么多,都是些準備工作主要是HttpClient的一些基本使用,其實還有很多,網上其他資料更詳細,也不是這里要講的重點。下面來看如何通過url來得到html頁面,其實方法已經在上一篇文章中說過了:[【網絡爬蟲】[java]微博爬蟲(一):網易微博爬蟲(自定義關鍵字爬取微博信息數據)](http://blog.csdn.net/dianacody/article/details/39584977) 新浪微博和網易微博:**(這里尤其要注意地址及參數!)** 新浪微博搜索話題地址:http://s.weibo.com/weibo/蘋果手機&nodup=1&page=50 網易微博搜索話題地址:http://t.163.com/tag/蘋果手機 這里參數&nodup和參數&page=50,表示從搜索結果返回的前50個html頁面,從第50個頁面開始爬取。也可以修改參數的值,爬取的頁面個數不同。 在這里寫了三個方法,分別設置用戶cookie、默認一般的方法、代理IP方法,基本思路差不多,主要是在RequestConfig和CloseableHttpClient的custom()可以自定義配置。 ~~~ /** * @note 三種連接url并獲取html的方法(有一般方法,自定義cookie方法,代理IP方法) * @author DianaCody * @since 2014-09-26 16:03 * */ import java.io.IOException; import java.io.UnsupportedEncodingException; import java.net.URISyntaxException; import java.text.ParseException; import org.apache.http.HttpEntity; import org.apache.http.HttpHost; import org.apache.http.HttpResponse; import org.apache.http.HttpStatus; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.config.CookieSpecs; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpPost; import org.apache.http.config.Registry; import org.apache.http.config.RegistryBuilder; import org.apache.http.cookie.Cookie; import org.apache.http.cookie.CookieOrigin; import org.apache.http.cookie.CookieSpec; import org.apache.http.cookie.CookieSpecProvider; import org.apache.http.cookie.MalformedCookieException; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.DefaultProxyRoutePlanner; import org.apache.http.impl.cookie.BestMatchSpecFactory; import org.apache.http.impl.cookie.BrowserCompatSpec; import org.apache.http.impl.cookie.BrowserCompatSpecFactory; import org.apache.http.protocol.HttpContext; import org.apache.http.util.EntityUtils; public class HTML { /**默認方法 */ public String[] getHTML(String url) throws ClientProtocolException, IOException { String[] html = new String[2]; html[1] = "null"; RequestConfig requestConfig = RequestConfig.custom() .setSocketTimeout(5000) //socket超時 .setConnectTimeout(5000) //connect超時 .build(); CloseableHttpClient httpClient = HttpClients.custom() .setDefaultRequestConfig(requestConfig) .build(); HttpGet httpGet = new HttpGet(url); try { CloseableHttpResponse response = httpClient.execute(httpGet); html[0] = String.valueOf(response.getStatusLine().getStatusCode()); html[1] = EntityUtils.toString(response.getEntity(), "utf-8"); //System.out.println(html); } catch (IOException e) { System.out.println("----------Connection timeout--------"); } return html; } /**cookie方法的getHTMl() 設置cookie策略,防止cookie rejected問題,拒絕寫入cookie --重載,3參數:url, hostName, port */ public String getHTML(String url, String hostName, int port) throws URISyntaxException, ClientProtocolException, IOException { //采用用戶自定義的cookie策略 HttpHost proxy = new HttpHost(hostName, port); DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); CookieSpecProvider cookieSpecProvider = new CookieSpecProvider() { public CookieSpec create(HttpContext context) { return new BrowserCompatSpec() { @Override public void validate(Cookie cookie, CookieOrigin origin) throws MalformedCookieException { //Oh, I am easy... } }; } }; Registry<CookieSpecProvider> r = RegistryBuilder .<CookieSpecProvider> create() .register(CookieSpecs.BEST_MATCH, new BestMatchSpecFactory()) .register(CookieSpecs.BROWSER_COMPATIBILITY, new BrowserCompatSpecFactory()) .register("easy", cookieSpecProvider) .build(); RequestConfig requestConfig = RequestConfig.custom() .setCookieSpec("easy") .setSocketTimeout(5000) //socket超時 .setConnectTimeout(5000) //connect超時 .build(); CloseableHttpClient httpClient = HttpClients.custom() .setDefaultCookieSpecRegistry(r) .setRoutePlanner(routePlanner) .build(); HttpGet httpGet = new HttpGet(url); httpGet.setConfig(requestConfig); String html = "null"; //用于驗證是否正常取到html try { CloseableHttpResponse response = httpClient.execute(httpGet); html = EntityUtils.toString(response.getEntity(), "utf-8"); } catch (IOException e) { System.out.println("----Connection timeout----"); } return html; } /**proxy代理IP方法 */ public String getHTMLbyProxy(String targetUrl, String hostName, int port) throws ClientProtocolException, IOException { HttpHost proxy = new HttpHost(hostName, port); String html = "null"; DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy); RequestConfig requestConfig = RequestConfig.custom() .setSocketTimeout(5000) //socket超時 .setConnectTimeout(5000) //connect超時 .build(); CloseableHttpClient httpClient = HttpClients.custom() .setRoutePlanner(routePlanner) .setDefaultRequestConfig(requestConfig) .build(); HttpGet httpGet = new HttpGet(targetUrl); try { CloseableHttpResponse response = httpClient.execute(httpGet); int statusCode = response.getStatusLine().getStatusCode(); if(statusCode == HttpStatus.SC_OK) { //狀態碼200: OK html = EntityUtils.toString(response.getEntity(), "gb2312"); } response.close(); //System.out.println(html); //打印返回的html } catch (IOException e) { System.out.println("----Connection timeout----"); } return html; } } ~~~ ##四、驗證是否存在HTML頁面 有時請求的html不存在,比如在上篇文章中提到的情況一樣,這里加個判斷函數。 ~~~ private boolean isExistHTML(String html) throws InterruptedException { boolean isExist = false; Pattern pNoResult = Pattern.compile("\\\\u6ca1\\\\u6709\\\\u627e\\\\u5230\\\\u76f8" + "\\\\u5173\\\\u7684\\\\u5fae\\\\u535a\\\\u5462\\\\uff0c\\\\u6362\\\\u4e2a" + "\\\\u5173\\\\u952e\\\\u8bcd\\\\u8bd5\\\\u5427\\\\uff01"); //沒有找到相關的微博呢,換個關鍵詞試試吧!(html頁面上的信息) Matcher mNoResult = pNoResult.matcher(html); if(!mNoResult.find()) { isExist = true; } return isExist; } ~~~ ##五、爬取微博返回的HTML字符串 把所有html寫到本地txt文件里。 ~~~ /**把所有html寫到本地txt文件存儲 */ public static void writeHTML2txt(String html, int num) throws IOException { String savePath = "e:/weibo/weibohtml/" + num + ".txt"; File f = new File(savePath); FileWriter fw = new FileWriter(f); BufferedWriter bw = new BufferedWriter(fw); bw.write(html); bw.close(); } ~~~ 爬下來的html: ![](https://box.kancloud.cn/2016-02-18_56c5641be67f9.jpg) 來看下每個html頁面,頭部一些數據: ![](https://box.kancloud.cn/2016-02-18_56c5641c0c680.jpg) 微博正文數據信息,是個json格式,包含一些信息: ![](https://box.kancloud.cn/2016-02-18_56c5641c1d063.jpg) 至于如何解析提取關鍵數據,在下篇文章中再寫。 原創文章,轉載請注明出處:[http://blog.csdn.net/dianacody/article/details/39695285](http://blog.csdn.net/dianacody/article/details/39695285)
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看