縱橫小說閱讀頁采集 · Lucene案例開發

轉載請注明出處：[http://blog.csdn.net/xiaojimanman/article/details/44937073](http://blog.csdn.net/xiaojimanman/article/details/44937073) [http://www.llwjy.com/blogdetail/29bd8de30e8d17871c707b76ec3212b0.html](http://www.llwjy.com/blogdetail/29bd8de30e8d17871c707b76ec3212b0.html) 個人博客站已經上線了，網址 [www.llwjy.com](http://www.llwjy.com) ~歡迎各位吐槽~ ------------------------------------------------------------------------------------------------- 在之前的三篇博客中，我們已經介紹了關于縱橫小說的更新列表頁、簡介頁、章節列表頁的相關信息采集，今天這篇博客就重點介紹一下最重要的閱讀頁的信息采集。本文還是以一個簡單的URL為例，網址如下：http://book.zongheng.com/chapter/362857/6001264.html 。頁面分析上述url網址下的下面樣式如下： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf1d830c.jpg) 閱讀頁和章節列表頁一樣，都無法通過簡單的鼠標右鍵-->查看網頁源代碼這個操作，所以還是通過**F12-->NetWork-->Ctrl+F5**這個操作找到頁面的源代碼，結果截圖如下： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf217a75.jpg) 對頁面源代碼做簡單的查找，即可找到標題、字數和章節內容這些屬性值所在的位置分別是 **47行、141行和145行**（頁面不同，可能所在的行數也略微有點差別，具體的行數請個人根據實際情況來確定）。對于這三部分的正則，因為和之前的大同小異，使用的方法之前也已經介紹了，所以這里就只給出最終的結果： ~~~ \\章節內容正則 private static final String CONTENT = "<div id=\"chapterContent\" class=\"content\" itemprop=\"acticleBody\">(.*?)</div>"; \\標題正則 private static final String TITLE = "chapterName=\"(.*?)\""; \\字數正則 private static final String WORDCOUNT = "itemprop=\"wordCount\">(\\d*)</span>"; ~~~ **運行結果** ![img](https://box.kancloud.cn/2016-02-22_56ca7bf25c180.jpg) 看到運行結果的截圖，你也許會發現一個問題，就是章節內容中含有一些html標簽，這里是因為我們的案例最終的展示是網頁展示，所以這里就偷個懶，如果需要去掉這些標簽的，可以直接通過String的repalceAll方法對其替換。 **源代碼** 查看最新源代碼請訪問：http://www.llwjy.com/source/com.lulei.crawl.novel.zongheng.ReadPage.html ~~~ /** *@Description: 閱讀頁 */ package com.lulei.crawl.novel.zongheng; import java.io.IOException; import java.util.HashMap; import com.lulei.crawl.CrawlBase; import com.lulei.util.DoRegex; import com.lulei.util.ParseUtil; public class ReadPage extends CrawlBase { private static final String CONTENT = "<div id=\"chapterContent\" class=\"content\" itemprop=\"acticleBody\">(.*?)</div>"; private static final String TITLE = "chapterName=\"(.*?)\""; private static final String WORDCOUNT = "itemprop=\"wordCount\">(\\d*)</span>"; private String pageUrl; private static HashMap<String, String> params; /** * 添加相關頭信息，對請求進行偽裝 */ static { params = new HashMap<String, String>(); params.put("Referer", "http://book.zongheng.com"); params.put("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"); } public ReadPage(String url) throws IOException { readPageByGet(url, "utf-8", params); this.pageUrl = url; } /** * @return * @Author:lulei * @Description: 章節標題 */ private String getTitle() { return DoRegex.getFirstString(getPageSourceCode(), TITLE, 1); } /** * @return * @Author:lulei * @Description: 字數 */ private int getWordCount() { String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1); return ParseUtil.parseStringToInt(wordCount, 0); } /** * @return * @Author:lulei * @Description: 正文 */ private String getContent() { return DoRegex.getFirstString(getPageSourceCode(), CONTENT, 1); } public static void main(String[] args) throws IOException { // TODO Auto-generated method stub ReadPage readPage = new ReadPage("http://book.zongheng.com/chapter/362857/6001264.html"); System.out.println(readPage.pageUrl); System.out.println(readPage.getTitle()); System.out.println(readPage.getWordCount()); System.out.println(readPage.getContent()); } } ~~~ ---------------------------------------------------------------------------------------------------- ps:最近發現其他網站可能會對博客轉載，上面并沒有源鏈接，如想查看更多關于 [基于lucene的案例開發](http://www.llwjy.com/blogtype/lucene.html) 請[點擊這里](http://blog.csdn.net/xiaojimanman/article/category/2841877)。或訪問網址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html