BeautifulSoup 教程 · ZetCode 中文系列教程

# BeautifulSoup 教程 > 原文： [http://zetcode.com/python/beautifulsoup/](http://zetcode.com/python/beautifulsoup/) BeautifulSoup 教程是 BeautifulSoup Python 庫的入門教程。這些示例查找標簽，遍歷文檔樹，修改文檔和刮取網頁。 ## BeautifulSoup BeautifulSoup 是用于解析 HTML 和 XML 文檔的 Python 庫。它通常用于網頁抓取。 BeautifulSoup 將復雜的 HTML 文檔轉換為復雜的 Python 對象樹，例如標記，可導航字符串或注釋。 ## 安裝 BeautifulSoup 我們使用`pip3`命令安裝必要的模塊。 ```py $ sudo pip3 install lxml ``` 我們需要安裝 BeautifulSoup 使用的`lxml`模塊。 ```py $ sudo pip3 install bs4 ``` 上面的命令將安裝 BeautifulSoup。 ## HTML 文件在示例中，我們將使用以下 HTML 文件： `index.html` ```py <!DOCTYPE html> <html> <head> <title>Header</title> <meta charset="utf-8"> </head> <body> <h2>Operating systems</h2> <ul id="mylist" style="width:150px"> <li>Solaris</li> <li>FreeBSD</li> <li>Debian</li> <li>NetBSD</li> <li>Windows</li> </ul> <p> FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms. </p> <p> Debian is a Unix-like computer operating system that is composed entirely of free software. </p> </body> </html> ``` ## BeautifulSoup 簡單示例在第一個示例中，我們使用 BeautifulSoup 模塊獲取三個標簽。 `simple.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.h2) print(soup.head) print(soup.li) ``` 該代碼示例將打印三個標簽的 HTML 代碼。 ```py from bs4 import BeautifulSoup ``` 我們從`bs4`模塊導入`BeautifulSoup`類。 `BeautifulSoup`是從事工作的主要類。 ```py with open("index.html", "r") as f: contents = f.read() ``` 我們打開`index.html`文件并使用`read()`方法讀取其內容。 ```py soup = BeautifulSoup(contents, 'lxml') ``` 創建了`BeautifulSoup`對象； HTML 數據將傳遞給構造器。第二個選項指定解析器。 ```py print(soup.h2) print(soup.head) ``` 在這里，我們打印兩個標簽的 HTML 代碼：`h2`和`head`。 ```py print(soup.li) ``` 有多個`li`元素；該行打印第一個。 ```py $ ./simple.py <h2>Operating systems</h2> <head> <title>Header</title> <meta charset="utf-8"/> </head> <li>Solaris</li> ``` 這是輸出。 ## BeautifulSoup 標簽，名稱，文本標記的`name`屬性給出其名稱，`text`屬性給出其文本內容。 `tags_names.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print("HTML: {0}, name: {1}, text: {2}".format(soup.h2, soup.h2.name, soup.h2.text)) ``` 該代碼示例打印`h2`標簽的 HTML 代碼，名稱和文本。 ```py $ ./tags_names.py HTML: <h2>Operating systems</h2>, name: h2, text: Operating systems ``` 這是輸出。 ## BeautifulSoup 遍歷標簽使用`recursiveChildGenerator()`方法，我們遍歷 HTML 文檔。 `traverse_tree.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') for child in soup.recursiveChildGenerator(): if child.name: print(child.name) ``` 該示例遍歷文檔樹并打印所有 HTML 標記的名稱。 ```py $ ./traverse_tree.py html head title meta body h2 ul li li li li li p p ``` 在 HTML 文檔中，我們有這些標簽。 ## BeautifulSoup 子元素使用`children`屬性，我們可以獲取標簽的子級。 `get_children.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') root = soup.html root_childs = [e.name for e in root.children if e.name is not None] print(root_childs) ``` 該示例檢索`html`標記的子代，將它們放置在 Python 列表中，然后將其打印到控制臺。由于`children`屬性還返回標簽之間的空格，因此我們添加了一個條件，使其僅包含標簽名稱。 ```py $ ./get_children.py ['head', 'body'] ``` `html`標簽有兩個子元素：`head`和`body`。 ## BeautifulSoup 后繼元素使用`descendants`屬性，我們可以獲得標簽的所有后代（所有級別的子級）。 `get_descendants.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') root = soup.body root_childs = [e.name for e in root.descendants if e.name is not None] print(root_childs) ``` 該示例檢索`body`標記的所有后代。 ```py $ ./get_descendants.py ['h2', 'ul', 'li', 'li', 'li', 'li', 'li', 'p', 'p'] ``` 這些都是`body`標簽的后代。 ## BeautifulSoup 網頁抓取請求是一個簡單的 Python HTTP 庫。它提供了通過 HTTP 訪問 Web 資源的方法。 `scraping.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup import requests as req resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') print(soup.title) print(soup.title.text) print(soup.title.parent) ``` 該示例檢索一個簡單網頁的標題。它還打印其父級。 ```py resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') ``` 我們獲取頁面的 HTML 數據。 ```py print(soup.title) print(soup.title.text) print(soup.title.parent) ``` 我們檢索標題的 HTML 代碼，其文本以及其父級的 HTML 代碼。 ```py $ ./scraping.py <title>Something.</title> Something. <head><title>Something.</title></head> ``` 這是輸出。 ## BeautifulSoup 美化代碼使用`prettify()`方法，我們可以使 HTML 代碼看起來更好。 `prettify.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup import requests as req resp = req.get("http://www.something.com") soup = BeautifulSoup(resp.text, 'lxml') print(soup.prettify()) ``` 我們美化了一個簡單網頁的 HTML 代碼。 ```py $ ./prettify.py <html> <head> <title> Something. </title> </head> <body> Something. </body> </html> ``` 這是輸出。 ## BeautifulSoup 通過 ID 查找元素使用`find()`方法，我們可以通過各種方式（包括元素 ID）查找元素。 `find_by_id.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') #print(soup.find("ul", attrs={ "id" : "mylist"})) print(soup.find("ul", id="mylist")) ``` 該代碼示例查找具有`mylist` ID 的`ul`標簽。帶注釋的行是執行相同任務的另一種方法。 ## BeautifulSoup 查找所有標簽使用`find_all()`方法，我們可以找到滿足某些條件的所有元素。 `find_all.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') for tag in soup.find_all("li"): print("{0}: {1}".format(tag.name, tag.text)) ``` 該代碼示例查找并打印所有`li`標簽。 ```py $ ./find_all.py li: Solaris li: FreeBSD li: Debian li: NetBSD ``` 這是輸出。 `find_all()`方法可以獲取要搜索的元素列表。 `find_all2.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tags = soup.find_all(['h2', 'p']) for tag in tags: print(" ".join(tag.text.split())) ``` 該示例查找所有`h2`和`p`元素并打印其文本。 `find_all()`方法還可以使用一個函數，該函數確定應返回哪些元素。 `find_by_fun.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup def myfun(tag): return tag.is_empty_element with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tags = soup.find_all(myfun) print(tags) ``` 該示例打印空元素。 ```py $ ./find_by_fun.py [<meta charset="utf-8"/>] ``` 文檔中唯一的空元素是`meta`。也可以使用正則表達式查找元素。 `regex.py` ```py #!/usr/bin/python3 import re from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') strings = soup.find_all(string=re.compile('BSD')) for txt in strings: print(" ".join(txt.split())) ``` 該示例打印包含`"BSD"`字符串的元素的內容。 ```py $ ./regex.py FreeBSD NetBSD FreeBSD is an advanced computer operating system used to power modern servers, desktops, and embedded platforms. ``` 這是輸出。 ## BeautifulSoup CSS 選擇器通過`select()`和`select_one()`方法，我們可以使用一些 CSS 選擇器來查找元素。 `select_nth_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.select("li:nth-of-type(3)")) ``` 本示例使用 CSS 選擇器來打印第三個`li`元素的 HTML 代碼。 ```py $ ./select_nth_tag.py <li>Debian</li> ``` 這是第三個`li`元素。 CSS 中使用`#`字符通過 ID 屬性選擇標簽。 `select_by_id.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') print(soup.select_one("#mylist")) ``` 該示例打印具有`mylist` ID 的元素。 ## BeautifulSoup 附加元素 `append()`方法將新標簽附加到 HTML 文檔。 `append_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') newtag = soup.new_tag('li') newtag.string='OpenBSD' ultag = soup.ul ultag.append(newtag) print(ultag.prettify()) ``` 該示例附加了一個新的`li`標簽。 ```py newtag = soup.new_tag('li') newtag.string='OpenBSD' ``` 首先，我們使用`new_tag()`方法創建一個新標簽。 ```py ultag = soup.ul ``` 我們獲得對`ul`標簽的引用。 ```py ultag.append(newtag) ``` 我們將新創建的標簽附加到`ul`標簽。 ```py print(ultag.prettify()) ``` 我們以整齊的格式打印`ul`標簽。 ## BeautifulSoup 插入元素 `insert()`方法在指定位置插入標簽。 `insert_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') newtag = soup.new_tag('li') newtag.string='OpenBSD' ultag = soup.ul ultag.insert(2, newtag) print(ultag.prettify()) ``` 該示例將第三個位置的`li`標簽插入`ul`標簽。 ## BeautifulSoup 替換文字 `replace_with()`替換元素的文本。 `replace_text.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') tag = soup.find(text="Windows") tag.replace_with("OpenBSD") print(soup.ul.prettify()) ``` 該示例使用`find()`方法查找特定元素，并使用`replace_with()`方法替換其內容。 ## BeautifulSoup 刪除元素 `decompose()`方法從樹中刪除標簽并銷毀它。 `decompose_tag.py` ```py #!/usr/bin/python3 from bs4 import BeautifulSoup with open("index.html", "r") as f: contents = f.read() soup = BeautifulSoup(contents, 'lxml') ptag2 = soup.select_one("p:nth-of-type(2)") ptag2.decompose() print(soup.body.prettify()) ``` 該示例刪除了第二個`p`元素。在本教程中，我們使用了 Python BeautifulSoup 庫。您可能也會對以下相關教程感興趣： [Pyquery 教程](/python/pyquery)， [Python 教程](/lang/python/)， [Python 列表推導](/articles/pythonlistcomprehensions/)， [OpenPyXL 教程](/articles/openpyxl/)，Python Requests 教程和 [Python CSV 教程](/python/csv/)。