縱橫小說簡介頁采集 · Lucene案例開發

轉載請注明出處：http://blog.csdn.net/xiaojimanman/article/details/44851419 http://www.llwjy.com/blogdetail/1b5ae17c513d127838c2e02102b5bb87.html 個人博客站已經上線了，網址 [www.llwjy.com](http://www.llwjy.com) ~歡迎各位吐槽~ ------------------------------------------------------------------------------------------------- 在上一篇博客中，我們已經對縱橫中文小說的更新列表頁做了簡單的采集，獲得了小說簡介頁的URL，因此這篇博客我們就介紹縱橫中文小說簡介頁信息的采集，事例地址：http://book.zongheng.com/book/362857.html **頁面分析** 在開始之前，建議個人先看一下簡介頁的樣子，下圖只是我們要采集的信息所在的區域。 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf05a44f.jpg) 在這一部分，我們需要獲取書名、作者名、分類、字數、簡介、最新章節名、章節頁URL和標簽等信息。在頁面上，我們通過鼠標右鍵--查看網頁源代碼發現下面一個現象 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf0800e6.jpg) 縱橫小說為了做360的seo，把小說的一些關鍵信息放到head中，這樣就大大減少我們下正則的復雜度，由于這幾個正則大同小異，所以就只用書名做簡單的介紹，其余的正則可以參照后面的源代碼。這里的書名在上述截圖中的**33行**，我們需要提取中間的**飛仙訣** 信息，因此我們提取該信息的正則表達式為” <meta name="og:novel:book_name" content="(.*?)"/>?“ ，其他信息和此正則類似。通過上圖這部分源代碼我們可以輕易的獲取書名、作者名、最新章節、簡介、分類和章節列表頁URL，對于標簽和字數這兩個字段，我們就需要繼續分析下面的源代碼。通過簡單的查找，我們可以找到下圖中的源代碼，這里就包含我們需要的字數和標簽兩個屬性。 ![img](https://box.kancloud.cn/2016-02-22_56ca7bf0aa64f.jpg) 對于字數這個屬性，我們可以通過簡單的正則表達式 ” <span itemprop="wordCount">(\d*?)</span>?“ 獲取，而對于標簽這個屬性，我們需要通過兩步才能得到想要的內容。 **第一步**：獲取keyword所在的html代碼，也就是上圖中的**234行**，這一步的正則表達式為 ”<div class="keyword">(.*?)</div>?“； **第二步**：對第一步獲得的部分html做進一步提取，獲取想要的內容，這一步的正則表達式為 ” <a.*?>(.*?)</a>?“。 **代碼實現** 對于非更新列表也的網頁信息采集，我們統一繼承CrawlBase類，對于如何偽裝可以參照上一篇博客，這里就重點介紹DoRegex類中的兩個方法 **方法一：** ~~~ String getFirstString(String dealStr, String regexStr, int n) ~~~ 這里的第一個參數是要處理的字符串，這里也就是網頁源代碼，第二個參數是要查找內容的正則表達式，第三個參數是要提取的內容在正則表達式中的位置，函數的功能是從指定的字符串中查找與正則第一個匹配的內容，返回指定的提取信息。 **方法二：** ~~~ String getString(String dealStr, String regexStr, String splitStr, int n) ~~~ 這里的第1、2、4參數分別對應方法一中的第1、2、3參數，參數splitStr的意義是分隔符，函數的功能是在指定的字符串中查找與正則表達式匹配的內容，之間用指定的分隔符隔開。 **運行結果** ![](https://box.kancloud.cn/2016-02-22_56ca7bf0cfe92.jpg) **源代碼** 通過對上面兩個方法的介紹，相信對于下面的源代碼也會很簡單。 ~~~ /** *@Description: 簡介頁 */ package com.lulei.crawl.novel.zongheng; import java.io.IOException; import java.util.HashMap; import com.lulei.crawl.CrawlBase; import com.lulei.util.DoRegex; import com.lulei.util.ParseUtil; public class IntroPage extends CrawlBase { private static final String NAME = "<meta name=\"og:novel:book_name\" content=\"(.*?)\"/> "; private static final String AUTHOR = "<meta name=\"og:novel:author\" content=\"(.*?)\"/> "; private static final String DESC = "<meta property=\"og:description\" content=\"(.*?)\"/> "; private static final String TYPE = "<meta name=\"og:novel:category\" content=\"(.*?)\"/> "; private static final String LATESTCHAPTER = "<meta name=\"og:novel:latest_chapter_name\" content=\"(.*?)\"/> "; private static final String CHAPTERLISTURL = "<meta name=\"og:novel:read_url\" content=\"(.*?)\"/> "; private static final String WORDCOUNT = "<span itemprop=\"wordCount\">(\\d*?)</span>"; private static final String KEYWORDS = "<div class=\"keyword\">(.*?)</div>"; private static final String KEYWORD = "<a.*?>(.*?)</a>"; private String pageUrl; private static HashMap<String, String> params; /** * 添加相關頭信息，對請求進行偽裝 */ static { params = new HashMap<String, String>(); params.put("Referer", "http://book.zongheng.com"); params.put("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"); } public IntroPage(String url) throws IOException { readPageByGet(url, "utf-8", params); this.pageUrl = url; } /** * @return * @Author:lulei * @Description: 獲取書名 */ private String getName() { return DoRegex.getFirstString(getPageSourceCode(), NAME, 1); } /** * @return * @Author:lulei * @Description: 獲取作者名 */ private String getAuthor() { return DoRegex.getFirstString(getPageSourceCode(), AUTHOR, 1); } /** * @return * @Author:lulei * @Description: 書籍簡介 */ private String getDesc() { return DoRegex.getFirstString(getPageSourceCode(), DESC, 1); } /** * @return * @Author:lulei * @Description: 書籍分類 */ private String getType() { return DoRegex.getFirstString(getPageSourceCode(), TYPE, 1); } /** * @return * @Author:lulei * @Description: 最新章節 */ private String getLatestChapter() { return DoRegex.getFirstString(getPageSourceCode(), LATESTCHAPTER, 1); } /** * @return * @Author:lulei * @Description: 章節列表頁Url */ private String getChapterListUrl() { return DoRegex.getFirstString(getPageSourceCode(), CHAPTERLISTURL, 1); } /** * @return * @Author:lulei * @Description: 字數 */ private int getWordCount() { String wordCount = DoRegex.getFirstString(getPageSourceCode(), WORDCOUNT, 1); return ParseUtil.parseStringToInt(wordCount, 0); } /** * @return * @Author:lulei * @Description: 標簽 */ private String keyWords() { String keyHtml = DoRegex.getFirstString(getPageSourceCode(), KEYWORDS, 1); return DoRegex.getString(keyHtml, KEYWORD, " ", 1); } public static void main(String[] args) throws IOException { // TODO Auto-generated method stub IntroPage intro = new IntroPage("http://book.zongheng.com/book/362857.html"); System.out.println(intro.pageUrl); System.out.println(intro.getName()); System.out.println(intro.getAuthor()); System.out.println(intro.getDesc()); System.out.println(intro.getType()); System.out.println(intro.getLatestChapter()); System.out.println(intro.getChapterListUrl()); System.out.println(intro.getWordCount()); System.out.println(intro.keyWords()); } } ~~~ ---------------------------------------------------------------------------------------------------- ps:最近發現其他網站可能會對博客轉載，上面并沒有源鏈接，如想查看更多關于 [基于lucene的案例開發](http://www.llwjy.com/blogtype/lucene.html) 請[點擊這里](http://blog.csdn.net/xiaojimanman/article/category/2841877)。或訪問網址http://blog.csdn.net/xiaojimanman/article/category/2841877 或?http://www.llwjy.com/blogtype/lucene.html