# HtmlParser介紹
<div><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">1、相關資料</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
官方文檔:http://htmlparser.sourceforge.net/samples.html</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
API:http://htmlparser.sourceforge.net/javadoc/index.html</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
其它HTML 解釋器:jsoup等。由于HtmlParser自2006年以后就再沒更新,目前很多人推薦使用jsoup代替它。</p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">2、使用HtmlPaser的關鍵步驟</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(1)通過Parser類創建一個解釋器</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(2)創建Filter或者Visitor</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(3)使用parser根據filter或者visitor來取得所有符合條件的節點</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(4)對節點內容進行處理</p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">3、使用Parser的構造函數創建解釋器</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><table border="1" cellpadding="2" cellspacing="0" width="100%" style="color: rgb(0, 0, 0); font-family: Simsun; font-size: 14px;"><tbody><tr style="background-color:rgb(238,238,238);"><td style="height: 41px;"><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>()</code> <br>
Zero argument constructor.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28org.htmlparser.lexer.Lexer%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class in org.htmlparser.lexer" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/Lexer.html" target="_blank" style="color:rgb(106,57,6);">Lexer</a> lexer)</code> <br>
Construct a parser using the provided lexer.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28org.htmlparser.lexer.Lexer,%20org.htmlparser.util.ParserFeedback%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class in org.htmlparser.lexer" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/lexer/Lexer.html" target="_blank" style="color:rgb(106,57,6);">Lexer</a> lexer, <a title="interface in org.htmlparser.util" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/util/ParserFeedback.html" target="_blank" style="color:rgb(106,57,6);">ParserFeedback</a> fb)</code> <br>
Construct a parser using the provided lexer and feedback object.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.lang.String%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.lang" href="http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html" target="_blank" style="color:rgb(106,57,6);">String</a> resource)</code> <br>
Creates a Parser object with the location of the resource (URL or file).</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.lang.String,%20org.htmlparser.util.ParserFeedback%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.lang" href="http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html" target="_blank" style="color:rgb(106,57,6);">String</a> resource, <a title="interface in org.htmlparser.util" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/util/ParserFeedback.html" target="_blank" style="color:rgb(106,57,6);">ParserFeedback</a> feedback)</code> <br>
Creates a Parser object with the location of the resource (URL
or file) You would typically create a DefaultHTMLParserFeedback object
and pass it in.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.net.URLConnection%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.net" href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLConnection.html" target="_blank" style="color:rgb(106,57,6);">URLConnection</a> connection)</code> <br>
Construct a parser using the provided URLConnection.</td></tr><tr style="background-color:rgb(238,238,238);"><td><code><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/Parser.html#Parser%28java.net.URLConnection,%20org.htmlparser.util.ParserFeedback%29" target="_blank" style="color:rgb(106,57,6);">Parser</a></strong>(<a title="class or interface in java.net" href="http://java.sun.com/j2se/1.4.2/docs/api/java/net/URLConnection.html" target="_blank" style="color:rgb(106,57,6);">URLConnection</a> connection, <a title="interface in org.htmlparser.util" href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/util/ParserFeedback.html" target="_blank" style="color:rgb(106,57,6);">ParserFeedback</a> fb)</code> <br>
Constructor for custom HTTP access.</td></tr></tbody></table><span style="color: rgb(54, 46, 43); font-family: Arial;"> 對于大多數使用者來說,使用最多的是通過一個</span><span style="color: blue; font-family: Arial;">URLConnection</span><span style="color: rgb(54, 46, 43); font-family: Arial;">或者一個保存有網頁內容的字符串來初始化Parser,或者使用靜態函數來生成一個Parser對象。</span><span style="color: blue; font-family: Arial;">ParserFeedback</span><span style="color: rgb(54, 46, 43); font-family: Arial;">的代碼很簡單,是針對調試和跟蹤分析過程的,一般不需要改變。而使用</span><span style="color: green; font-family: Arial;">Lexer</span><span style="color: rgb(54, 46, 43); font-family: Arial;">則是一個相對比較高級的話題,放到以后再討論吧。</span><br style="color: rgb(54, 46, 43); font-family: Arial;"><span style="color: rgb(54, 46, 43); font-family: Arial;"> 這里比較有趣的一點是,如果需要設置頁面的編碼方式的話,不使用Lexer就只有靜態函數一個方法了。對于大多數中文頁面來說,好像這是應該用得比較多的一個方法。</span><br style="color: rgb(54, 46, 43); font-family: Arial;"><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">4、HtmlPaser使用Node對象保存各節點信息</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><img src="http://note.youdao.com/yws/res/10738/977917BD60E34D578F9EB0747420F7BB" data-media-type="image" /><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(1)訪問各個節點的方法<br>
Node <span style="color:blue;">getParent</span> ():取得父節點<br>
NodeList <span style="color:blue;">getChildren</span> ():取得子節點的列表<br>
Node <span style="color:blue;">getFirstChild</span> ():取得第一個子節點<br>
Node <span style="color:blue;">getLastChild</span> ():取得最后一個子節點<br>
Node <span style="color:blue;">getPreviousSibling</span> ():取得前一個兄弟(不好意思,英文是兄弟姐妹,直譯太麻煩而且不符合習慣,對不起女同胞了)<br>
Node <span style="color:blue;">getNextSibling</span> ():取得下一個兄弟節點<br>
(2)取得<span style="color:fuchsia;">Node</span>內容的函數<br>
String <span style="color:blue;">getText</span> ():取得文本<br>
String <span style="color:blue;">toPlainTextString</span>():取得純文本信息。<br>
String <span style="color:blue;">toHtml</span> () :取得<span style="color:green;">HTML</span>信息(原始<span style="color:green;">HTML</span>)<br>
String <span style="color:blue;">toHtml</span> (boolean verbatim):取得<span style="color:green;">HTML</span>信息(原始<span style="color:green;">HTML</span>)<br>
String <span style="color:blue;">toString</span> ():取得字符串信息(原始<span style="color:green;">HTML</span>)<br>
Page <span style="color:blue;">getPage</span> ():取得這個<span style="color:green;">Node</span>對應的<span style="color:green;">Page</span>對象<br>
int <span style="color:blue;">getStartPosition</span> ():取得這個<span style="color:green;">Node</span>在<span style="color:green;">HTML</span>頁面中的起始位置<br>
int <span style="color:blue;">getEndPosition</span> ():取得這個<span style="color:green;">Node</span>在<span style="color:green;">HTML</span>頁面中的結束位置</p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:24px;">5、使用Filter訪問Node節點及其內容</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><span style="font-size:18px;">(1)Filter的種類</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
顧名思義,Filter就是對于結果進行過濾,取得需要的內容。</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
所有的Filter均實現了NodeFilter接口,此接口只有一個方法Boolean accept(Node node),用于確定某個節點是否屬于此Filter過濾的范圍。</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
HTMLParser在org.htmlparser.filters包之內一共定義了16個不同的Filter,也可以分為幾類。<br><span style="color:green;"><a href="http://www.baizeju.com/html/HTMLParser/200807/07-121.html#%E5%88%A4%E6%96%AD%E7%B1%BBFilter" target="_blank" style="color:rgb(16,138,198);"><strong>判斷類<span style="color:green;">Filter</span>:</strong></a></span><br><span style="color:blue;">TagNameFilter</span><span style="color:blue;"><br>
HasAttributeFilter</span><br>
HasChildFilter<br>
HasParentFilter<br>
HasSiblingFilter<br>
IsEqualFilter<br><span style="color:green;"><a href="http://www.baizeju.com/html/HTMLParser/200807/07-121.html#%E9%80%BB%E8%BE%91%E8%BF%90%E7%AE%97Filter" target="_blank" style="color:rgb(16,138,198);"><strong>邏輯運算<span style="color:green;">Filter</span>:</strong></a></span><br><span style="color:blue;">AndFilter</span><span style="color:blue;"><br>
NotFilter</span><br>
OrFilter<br>
XorFilter<br><span style="color:green;"><a href="http://www.baizeju.com/html/HTMLParser/200807/07-121.html#%E5%85%B6%E4%BB%96Filter" target="_blank" style="color:rgb(16,138,198);"><strong>其他<span style="color:green;">Filter</span>:</strong></a></span><br><span style="color:blue;">NodeClassFilter</span><span style="color:blue;"><br>
StringFilter</span><br>
LinkStringFilter<br>
LinkRegexFilter<br>
RegexFilter<br>
CssSelectorNodeFilter</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
除此以外,可以自定義一些Filter,用于完成特殊需求的過濾。<br><span style="font-size:18px;">(2)Filter的使用示例</span></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
以下示例用于提取HTML文件中的鏈接</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><div style="background-color:rgb(231,229,220);color:rgb(54,46,43);font-family:Consolas,'Courier New',Courier,mono,serif;"><div><div style="background-color:rgb(248,248,248);color:silver;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:9px;"><strong>[java]</strong> <a title="view plain" href="http://blog.csdn.net/jediael_lu/article/details/26396705#" target="_blank" style="color: rgb(160, 160, 160);">view
plain</a><a title="copy" href="http://blog.csdn.net/jediael_lu/article/details/26396705#" target="_blank" style="color: rgb(160, 160, 160);">copy</a><a title="在CODE上查看代碼片" href="https://code.csdn.net/snippets/356130" target="_blank" style="color: rgb(160, 160, 160);"><img src="http://note.youdao.com/yws/res/10737/F9100224A02B471E9B4A148E168E4281" alt="在CODE上查看代碼片" width="12" height="12" data-media-type="image" /></a><a title="派生到我的代碼片" href="https://code.csdn.net/snippets/356130/fork" target="_blank" style="color: rgb(160, 160, 160);"><img src="https://code.csdn.net/assets/ico_fork.svg" alt="派生到我的代碼片" width="12" height="12" data-media-type="image" /></a><div></div></div></div><ol start="1" style="background-color:rgb(255,255,255);color:rgb(92,92,92);"><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">package</span> org.ljh.search.html; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.HashSet; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.Set; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.Node; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.NodeFilter; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.Parser; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.filters.NodeClassFilter; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.filters.OrFilter; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.tags.LinkTag; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.util.NodeList; </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.htmlparser.util.ParserException; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 130, 0);">//本類創建用于HTML文件解釋工具</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">class</span> HtmlParserTool { </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">// 本方法用于提取某個html文檔中內嵌的鏈接</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">static</span> Set<String> extractLinks(String url, LinkFilter filter) { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> Set<String> links = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> HashSet<String>(); </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">try</span> { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">// 1、構造一個Parser,并設置相關的屬性</span> </span></li><li style="color:inherit;"><span style="color: black;"> Parser parser = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> Parser(url); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> parser.setEncoding(<span style="color: blue;">"gb2312"</span>); </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">// 2.1、自定義一個Filter,用于過濾<Frame >標簽,然后取得標簽中的src屬性值</span> </span></li><li style="color:inherit;"><span style="color: black;"> NodeFilter frameNodeFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> NodeFilter() { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(100, 100, 100);">@Override</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">boolean</span> accept(Node node) { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span> (node.getText().startsWith(<span style="color: blue;">"frame src="</span>)) { </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">true</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } <span style="color: rgb(0, 102, 153); font-weight: bold;">else</span> { </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">false</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> }; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//2.2、創建第二個Filter,過濾<a>標簽</span> </span></li><li style="color:inherit;"><span style="color: black;"> NodeFilter aNodeFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> NodeClassFilter(LinkTag.<span style="color: rgb(0, 102, 153); font-weight: bold;">class</span>); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//2.3、凈土上述2個Filter形成一個組合邏輯Filter。</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> OrFilter linkFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> OrFilter(frameNodeFilter, aNodeFilter); </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//3、使用parser根據filter來取得所有符合條件的節點</span> </span></li><li style="color:inherit;"><span style="color: black;"> NodeList nodeList = parser.extractAllNodesThatMatch(linkFilter); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//4、對取得的Node進行處理</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">for</span>(<span style="color: rgb(0, 102, 153); font-weight: bold;">int</span> i = <span style="color: rgb(192, 0, 0);">0</span>; i<nodeList.size();i++){ </span></li><li style="color:inherit;"><span style="color: black;"> Node node = nodeList.elementAt(i); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> String linkURL = <span style="color: blue;">""</span>; </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//如果鏈接類型為<a /></span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(node <span style="color: rgb(0, 102, 153); font-weight: bold;">instanceof</span> LinkTag){ </span></li><li style="color:inherit;"><span style="color: black;"> LinkTag link = (LinkTag)node; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> linkURL= link.getLink(); </span></li><li style="color:inherit;"><span style="color: black;"> }<span style="color: rgb(0, 102, 153); font-weight: bold;">else</span>{ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//如果類型為<frame /></span> </span></li><li style="color:inherit;"><span style="color: black;"> String nodeText = node.getText(); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">int</span> beginPosition = nodeText.indexOf(<span style="color: blue;">"src="</span>); </span></li><li style="color:inherit;"><span style="color: black;"> nodeText = nodeText.substring(beginPosition); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">int</span> endPosition = nodeText.indexOf(<span style="color: blue;">" "</span>); </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(endPosition == -<span style="color: rgb(192, 0, 0);">1</span>){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> endPosition = nodeText.indexOf(<span style="color: blue;">">"</span>); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> linkURL = nodeText.substring(<span style="color: rgb(192, 0, 0);">5</span>, endPosition - <span style="color: rgb(192, 0, 0);">1</span>); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 130, 0);">//判斷是否屬于本次搜索范圍的url</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(filter.accept(linkURL)){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> links.add(linkURL); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } <span style="color: rgb(0, 102, 153); font-weight: bold;">catch</span> (ParserException e) { </span></li><li style="color:inherit;"><span style="color: black;"> e.printStackTrace(); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> links; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;">} </span></li></ol></div><p style="color: rgb(54, 46, 43); font-family: Arial;">
程序中的一些說明:</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(1)通過Node#getText()取得節點的String。</p><p style="color: rgb(54, 46, 43); font-family: Arial;">
(2)node instanceof TagLink,即<a/>節點,其它還有很多的類似節點,如tableTag等,基本上每個常見的html標簽均會對應一個tag。官方文檔說明如下:</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><table border="1" cellpadding="2" cellspacing="0" width="100%" style="color: rgb(0, 0, 0); font-family: Simsun; font-size: 14px;"><tbody><tr><td><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/nodes/package-summary.html" target="_blank" style="color:rgb(106,57,6);">org.htmlparser.nodes</a></strong></td><td>The nodes package has the concrete node implementations.</td></tr><tr><td><strong><a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/tags/package-summary.html" target="_blank" style="color:rgb(106,57,6);">org.htmlparser.tags</a></strong></td><td>The tags package contains specific tags.</td></tr></tbody></table><span style="color: rgb(54, 46, 43); font-family: Arial;">因此可以通過此方法直接判斷一個節點是否某個標簽內容。</span><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
其中用到的LinkFilter接口定義如下:</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><div style="background-color:rgb(231,229,220);color:rgb(54,46,43);font-family:Consolas,'Courier New',Courier,mono,serif;"><div><div style="background-color:rgb(248,248,248);color:silver;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:9px;"><strong>[java]</strong><div></div></div></div><ol start="1" style="background-color:rgb(255,255,255);color:rgb(92,92,92);"><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">package</span> org.ljh.search.html; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 130, 0);">//本接口所定義的過濾器,用于判斷url是否屬于本次搜索范圍。</span> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">interface</span> LinkFilter { </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">boolean</span> accept(String url); </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;">} </span></li></ol></div><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
測試程序如下:</p><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><div style="background-color:rgb(231,229,220);color:rgb(54,46,43);font-family:Consolas,'Courier New',Courier,mono,serif;"><div><div style="background-color:rgb(248,248,248);color:silver;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:9px;"><strong>[java]</strong> <div></div></div></div><ol start="1" style="background-color:rgb(255,255,255);color:rgb(92,92,92);"><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">package</span> org.ljh.search.html; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.Iterator; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> java.util.Set; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">import</span> org.junit.Test; </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"><span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">class</span> HtmlParserToolTest { </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(100, 100, 100);">@Test</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">void</span> testExtractLinks() { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> String url = <span style="color: blue;">"http://www.baidu.com"</span>; </span></li><li style="color:inherit;"><span style="color: black;"> LinkFilter linkFilter = <span style="color: rgb(0, 102, 153); font-weight: bold;">new</span> LinkFilter(){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(100, 100, 100);">@Override</span> </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">public</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">boolean</span> accept(String url) { </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">if</span>(url.contains(<span style="color: blue;">"baidu"</span>)){ </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">true</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> }<span style="color: rgb(0, 102, 153); font-weight: bold;">else</span>{ </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">return</span> <span style="color: rgb(0, 102, 153); font-weight: bold;">false</span>; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> </span></li><li style="color:inherit;"><span style="color: black;"> }; </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> Set<String> urlSet = HtmlParserTool.extractLinks(url, linkFilter); </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> Iterator<String> it = urlSet.iterator(); </span></li><li style="color:inherit;"><span style="color: black;"> <span style="color: rgb(0, 102, 153); font-weight: bold;">while</span>(it.hasNext()){ </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> System.out.println(it.next()); </span></li><li style="color:inherit;"><span style="color: black;"> } </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;"> } </span></li><li style="color:inherit;"><span style="color: black;"> </span></li><li style="background-color:rgb(248,248,248);"><span style="color: black;">} </span></li></ol></div><p style="color: rgb(54, 46, 43); font-family: Arial;"><br></p><span style="color: rgb(54, 46, 43); font-family: Arial;">輸出結果如下:</span><p style="color: rgb(54, 46, 43); font-family: Arial;"></p><p style="color: rgb(54, 46, 43); font-family: Arial;">
http://www.hao123.com<br>
http://www.baidu.com/<br>
http://www.baidu.com/duty/<br>
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=<br>
http://music.baidu.com<br>
http://ir.baidu.com<br>
http://www.baidu.com/gaoji/preferences.html<br>
http://news.baidu.com<br>
http://map.baidu.com<br>
http://music.baidu.com/search?fr=ps&key=<br>
http://image.baidu.com<br>
http://zhidao.baidu.com<br>
http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=<br>
http://www.baidu.com/more/<br>
http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w<br>
http://wenku.baidu.com<br>
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=<br>
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F<br>
http://www.baidu.com/cache/sethelp/index.html<br>
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt<br>
http://tieba.baidu.com/f?kw=&fr=wwwt<br>
http://home.baidu.com<br>
https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F<br>
http://v.baidu.com<br>
http://e.baidu.com/?refer=888<br>
;<br>
http://tieba.baidu.com<br>
http://baike.baidu.com<br>
http://wenku.baidu.com/search?word=&lm=0&od=0<br>
http://top.baidu.com<br>
http://map.baidu.com/m?word=&fr=ps01000</p></div>
- Introduction
- 爬蟲相關技能介紹
- 爬蟲簡單介紹
- 爬蟲涉及到的知識點
- 爬蟲用途
- 爬蟲流程介紹
- 需求描述
- Http請求處理
- http基礎知識介紹
- http狀態碼
- httpheader
- java原生態處理http
- URL類
- 獲取URL請求狀態
- 模擬Http請求
- apache httpclient
- Httpclient1
- httpclient2
- httpclient3
- httpclient4
- httpclient5
- httpclient6
- okhttp
- OKhttp使用教程
- 技術使用
- java執行javascript
- 網頁解析
- Xpath介紹
- HtmlCleaner
- HtmlCleaner介紹
- HtmlCleaner使用
- HtmlParser
- HtmlParser介紹
- Jsoup
- 解析和遍歷一個HTML文檔
- 解析一個HTML字符串
- 解析一個body片斷
- 從一個URL加載一個Document
- 從一個文件加載一個文檔
- 使用DOM方法來遍歷一個文檔
- 使用選擇器語法來查找元素
- 從元素抽取屬性,文本和HTML
- 處理URLs
- 示例程序 獲取所有鏈接
- 設置屬性的值
- 設置一個元素的HTML內容
- 消除不受信任的HTML (來防止XSS攻擊)
- 正則表達式
- elasticsearch筆記
- 下載安裝elasticsearch
- 檢查es服務健康