# 示例程序 獲取所有鏈接
<div><div><div><p>這個示例程序將展示如何從一個URL獲得一個頁面。然后提取頁面中的所有鏈接、圖片和其它輔助內容。并檢查URLs和文本信息。</p>
<p>運行下面程序需要指定一個URLs作為參數</p>
<pre><code>package org.jsoup.examples;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class ListLinks {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
print("\nMedia: (%d)", media.size());
for (Element src : media) {
if (src.tagName().equals("img"))
print(" * %s: <%s> %sx%s (%s)",
src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
trim(src.attr("alt"), 20));
else
print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
}
print("\nImports: (%d)", imports.size());
for (Element link : imports) {
print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
}
print("\nLinks: (%d)", links.size());
for (Element link : links) {
print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
<p><a href="http://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/ListLinks.java">org/jsoup/examples/ListLinks.java</a></p></code></pre>
<h2>示例輸入結果</h2>
<pre><code>Fetching http://news.ycombinator.com/...
Media: (38)
* img: <http://ycombinator.com/images/y18.gif> 18x18 ()
* img: <http://ycombinator.com/images/s.gif> 10x1 ()
* img: <http://ycombinator.com/images/grayarrow.gif> x ()
* img: <http://ycombinator.com/images/s.gif> 0x10 ()
* script: <http://www.co2stats.com/propres.php?s=1138>
* img: <http://ycombinator.com/images/s.gif> 15x1 ()
* img: <http://ycombinator.com/images/hnsearch.png> x ()
* img: <http://ycombinator.com/images/s.gif> 25x1 ()
* img: <http://mixpanel.com/site_media/images/mixpanel_partner_logo_borderless.gif> x (Analytics by Mixpan.)
Imports: (2)
* link <http://ycombinator.com/news.css> (stylesheet)
* link <http://ycombinator.com/favicon.ico> (shortcut icon)
Links: (141)
* a: <http://ycombinator.com> ()
* a: <http://news.ycombinator.com/news> (Hacker News)
* a: <http://news.ycombinator.com/newest> (new)
* a: <http://news.ycombinator.com/newcomments> (comments)
* a: <http://news.ycombinator.com/leaders> (leaders)
* a: <http://news.ycombinator.com/jobs> (jobs)
* a: <http://news.ycombinator.com/submit> (submit)
* a: <http://news.ycombinator.com/x?fnid=JKhQjfU7gW> (login)
* a: <http://news.ycombinator.com/vote?for=1094578&dir=up&whence=%6e%65%77%73> ()
* a: <http://www.readwriteweb.com/archives/facebook_gets_faster_debuts_homegrown_php_compiler.php?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29&utm_content=Twitter> (Facebook speeds up PHP)
* a: <http://news.ycombinator.com/user?id=mcxx> (mcxx)
* a: <http://news.ycombinator.com/item?id=1094578> (9 comments)
* a: <http://news.ycombinator.com/vote?for=1094649&dir=up&whence=%6e%65%77%73> ()
* a: <http://groups.google.com/group/django-developers/msg/a65fbbc8effcd914> ("Tough. Django produces XHTML.")
* a: <http://news.ycombinator.com/user?id=andybak> (andybak)
* a: <http://news.ycombinator.com/item?id=1094649> (3 comments)
* a: <http://news.ycombinator.com/vote?for=1093927&dir=up&whence=%6e%65%77%73> ()
* a: <http://news.ycombinator.com/x?fnid=p2sdPLE7Ce> (More)
* a: <http://news.ycombinator.com/lists> (Lists)
* a: <http://news.ycombinator.com/rss> (RSS)
* a: <http://ycombinator.com/bookmarklet.html> (Bookmarklet)
* a: <http://ycombinator.com/newsguidelines.html> (Guidelines)
* a: <http://ycombinator.com/newsfaq.html> (FAQ)
* a: <http://ycombinator.com/newsnews.html> (News News)
* a: <http://news.ycombinator.com/item?id=363> (Feature Requests)
* a: <http://ycombinator.com> (Y Combinator)
* a: <http://ycombinator.com/w2010.html> (Apply)
* a: <http://ycombinator.com/lib.html> (Library)
* a: <http://www.webmynd.com/html/hackernews.html> ()
* a: <http://mixpanel.com/?from=yc> ()
</code></pre>
</div>
<div>
<div>
<h2><a href="http://jsoup.org/cookbook">Cookbook 目錄 </a></h2>
<h3>入門</h3>
<ol start="1">
<li><a href="http://www.open-open.com/jsoup/parsing-a-document.htm">解析和遍歷一個html文檔</a></li></ol>
<h3>輸入</h3>
<ol start="2">
<li><a href="http://www.open-open.com/jsoup/parse-document-from-string.htm">解析一個html字符串</a></li>
<li><a href="http://www.open-open.com/jsoup/parse-body-fragment.htm">解析一個body片斷</a></li>
<li><a href="http://www.open-open.com/jsoup/load-document-from-url.htm">從一個URL加載一個Document對象</a></li>
<li><a href="http://www.open-open.com/jsoup/load-document-from-file.htm">根據一個文件加載Document對象</a></li></ol>
<h3>數據抽取</h3>
<ol start="6">
<li><a href="http://www.open-open.com/jsoup/dom-navigation.htm">使用dom方法來遍歷一個Document對象</a></li>
<li><a href="http://www.open-open.com/jsoup/selector-syntax.htm">使用選擇器語法來查找元素</a></li>
<li><a href="http://www.open-open.com/jsoup/attributes-text-html.htm">從元素集合抽取屬性、文本和html內容</a></li>
<li><a href="http://www.open-open.com/jsoup/working-with-urls.htm">URL處理</a></li>
<li>程序示例:獲取所有鏈接</li></ol>
<h3> 數據修改 </h3>
<ol start="11">
<li><a href="http://www.open-open.com/jsoup/set-attributes.htm">設置屬性值</a></li>
<li><a href="http://www.open-open.com/jsoup/set-html.htm">設置元素的html內容</a></li>
<li><a href="http://www.open-open.com/jsoup/set-text.htm">設置元素的文本內容</a></li></ol>
<h3>HTML清理</h3>
<ol start="14">
<li><a href="http://www.open-open.com/jsoup/whitelist-sanitizer.htm">消除不受信任的html (來防止xss攻擊)</a></li></ol></div></div></div>
<div><b>jsoup</b> HTML parser: copyright ? 2009 - 2011 <a href="http://www.open-open.com/"><b>Jonathan Hedley</b></a></div></div>
- Introduction
- 爬蟲相關技能介紹
- 爬蟲簡單介紹
- 爬蟲涉及到的知識點
- 爬蟲用途
- 爬蟲流程介紹
- 需求描述
- Http請求處理
- http基礎知識介紹
- http狀態碼
- httpheader
- java原生態處理http
- URL類
- 獲取URL請求狀態
- 模擬Http請求
- apache httpclient
- Httpclient1
- httpclient2
- httpclient3
- httpclient4
- httpclient5
- httpclient6
- okhttp
- OKhttp使用教程
- 技術使用
- java執行javascript
- 網頁解析
- Xpath介紹
- HtmlCleaner
- HtmlCleaner介紹
- HtmlCleaner使用
- HtmlParser
- HtmlParser介紹
- Jsoup
- 解析和遍歷一個HTML文檔
- 解析一個HTML字符串
- 解析一個body片斷
- 從一個URL加載一個Document
- 從一個文件加載一個文檔
- 使用DOM方法來遍歷一個文檔
- 使用選擇器語法來查找元素
- 從元素抽取屬性,文本和HTML
- 處理URLs
- 示例程序 獲取所有鏈接
- 設置屬性的值
- 設置一個元素的HTML內容
- 消除不受信任的HTML (來防止XSS攻擊)
- 正則表達式
- elasticsearch筆記
- 下載安裝elasticsearch
- 檢查es服務健康