縱橫小說章節列表采集 · Lucene案例開發

轉載請注明出處：[http://blog.csdn.net/xiaojimanman/article/details/44854719](http://blog.csdn.net/xiaojimanman/article/details/44854719) [http://www.llwjy.com/blogdetail/ddcad68eeb91034247ffa331eb461213.html](http://www.llwjy.com/blogdetail/ddcad68eeb91034247ffa331eb461213.html) 個人博客站已經上線了，網址 [www.llwjy.com](http://www.llwjy.com) ~歡迎各位吐槽~ ------------------------------------------------------------------------------------------------- 在上兩篇博客中，已經介紹了縱橫中文小說的更新列表頁和簡介頁內容的采集，這篇將介紹從簡介頁采集獲得的下一跳章節列表頁的信息采集，事例地址：http://book.zongheng.com/showchapter/362857.html **頁面分析** 通過對頁面的分析，我們可以確定下圖中的部分就是我們需要采集信息及下一跳的地址。 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf11d7d5.jpg) 這里當我們想用鼠標右鍵--查看網頁源代碼的時候發現頁面已經把鼠標右鍵這個操作屏蔽了，因此我們只能采用另一種辦法來查看源代碼，對頁面進行分析。在當前頁面，按下**F12**，會出現一個新窗口，也就是之前博客中提到的審查元素出現的窗口，選中Network選項卡，按下 Ctrl + F5，會出現如下畫面： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf13bdf8.jpg) 鼠標單機紅色選中部分，即可查看網頁源代碼，效果圖如下： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf15796a.jpg) 對網頁源代碼做簡單的分析，我們很容易找到章節信息所在的部分，如下圖： ![img](https://box.kancloud.cn/2016-02-22_56ca7bf1861fe.jpg) 每一個章節信息都存儲在td標簽內，因此對這部分信息我們確定最后的正則表達式為“ <td class="chapterBean" chapterId="\d*" chapterName="(.*?)" chapterLevel="\d*" wordNum="(.*?)" updateTime="(.*?)"><a href="(.*?)" title=".*?">?”。 **代碼實現** 對于章節列表也信息的采集我們采用和簡介頁相同的方法，創建一個CrawlBase子類，用它來完成相關信息的采集。對于請求偽裝等操作參照更新列表頁中的介紹，這里只介紹DoRegex類中的一個方法： ~~~ List<String[]> getListArray(String dealStr, String regexStr, int[] array) ~~~ 第一個參數是需要查詢的字符串，第二個參數是正則表達式，第三個是需要提取的信息在正則表達式中的定位，函數的整體功能是返回字符串中所有滿足條件的信息。 **運行結果** ![img](https://box.kancloud.cn/2016-02-22_56ca7bf1ad3d6.jpg) **源代碼** 查看最新源代碼請訪問：http://www.llwjy.com/source/com.lulei.crawl.novel.zongheng.ChapterPage.html ~~~ /** *@Description: 章節列表頁 */ package com.lulei.crawl.novel.zongheng; import java.io.IOException; import java.util.HashMap; import java.util.List; import com.lulei.crawl.CrawlBase; import com.lulei.util.DoRegex; public class ChapterPage extends CrawlBase { private static final String CHAPTER = "<td class=\"chapterBean\" chapterId=\"\\d*\" chapterName=\"(.*?)\" chapterLevel=\"\\d*\" wordNum=\"(.*?)\" updateTime=\"(.*?)\"><a href=\"(.*?)\" title=\".*?\">"; private static final int []ARRAY = {1, 2, 3, 4}; private static HashMap<String, String> params; /** * 添加相關頭信息，對請求進行偽裝 */ static { params = new HashMap<String, String>(); params.put("Referer", "http://book.zongheng.com"); params.put("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"); } public ChapterPage(String url) throws IOException { readPageByGet(url, "utf-8", params); } public List<String[]> getChaptersInfo() { return DoRegex.getListArray(getPageSourceCode(), CHAPTER, ARRAY); } public static void main(String[] args) throws IOException { ChapterPage chapterPage = new ChapterPage("http://book.zongheng.com/showchapter/362857.html"); for (String []ss : chapterPage.getChaptersInfo()) { for (String s : ss) { System.out.println(s); } System.out.println("---------------------------------------------------- "); } } } ~~~ ---------------------------------------------------------------------------------------------------- ps:最近發現其他網站可能會對博客轉載，上面并沒有源鏈接，如想查看更多關于 [基于lucene的案例開發](http://www.llwjy.com/blogtype/lucene.html) 請[點擊這里](http://blog.csdn.net/xiaojimanman/article/category/2841877)。或訪問網址http://blog.csdn.net/xiaojimanman/article/category/2841877 或?http://www.llwjy.com/blogtype/lucene.html