遍歷文檔樹 · Beautiful Soup 4.2.0 中文文檔

# 遍歷文檔樹還拿”愛麗絲夢游仙境”的文檔來做例子: ``` html_doc = """ <html><head><title>The Dormouse's story</title></head> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) ``` 通過這段例子來演示怎樣從文檔的一段內容找到另一段內容 ## 子節點一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節點.Beautiful Soup提供了許多操作和遍歷子節點的屬性. 注意: Beautiful Soup中字符串節點不支持這些屬性,因為字符串沒有子節點 ### tag的名字操作文檔樹最簡單的方法就是告訴它你想獲取的tag的name.如果想獲取 `<head>` 標簽,只要用 `soup.head` : ``` soup.head # <head><title>The Dormouse's story</title></head> soup.title # <title>The Dormouse's story</title> ``` 這是個獲取tag的小竅門,可以在文檔樹的tag中多次調用這個方法.下面的代碼可以獲取`<body>`標簽中的第一個``標簽: ``` soup.body.b # The Dormouse's story ``` 通過點取屬性的方式只能獲得當前名字的第一個tag: ``` soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> ``` 如果想要得到所有的`<a>`標簽,或是通過名字得到比一個tag更多的內容的時候,就需要用到 `Searching the tree` 中描述的方法,比如: find_all() ``` soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ``` ### .contents 和 .children tag的 `.contents` 屬性可以將tag的子節點以列表的方式輸出: ``` head_tag = soup.head head_tag # <head><title>The Dormouse's story</title></head> head_tag.contents [<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents # [u'The Dormouse's story'] ``` `BeautifulSoup` 對象本身一定會包含子節點,也就是說`<html>`標簽也是 `BeautifulSoup` 對象的子節點: ``` len(soup.contents) # 1 soup.contents[0].name # u'html' ``` 字符串沒有 `.contents` 屬性,因為字符串沒有子節點: ``` text = title_tag.contents[0] text.contents # AttributeError: 'NavigableString' object has no attribute 'contents' ``` 通過tag的 `.children` 生成器,可以對tag的子節點進行循環: ``` for child in title_tag.children: print(child) # The Dormouse's story ``` ### .descendants `.contents` 和 `.children` 屬性僅包含tag的直接子節點.例如,`<head>`標簽只有一個直接子節點`<title>` ``` head_tag.contents # [<title>The Dormouse's story</title>] ``` 但是`<title>`標簽也包含一個子節點:字符串 “The Dormouse’s story”,這種情況下字符串 “The Dormouse’s story”也屬于`<head>`標簽的子孫節點. `.descendants` 屬性可以對所有tag的子孫節點進行遞歸循環 \[5\] : ``` for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story ``` 上面的例子中, `<head>`標簽只有一個子節點,但是有2個子孫節點:`<head>`節點和`<head>`的子節點, `BeautifulSoup` 有一個直接子節點(`<html>`節點),卻有很多子孫節點: ``` len(list(soup.children)) # 1 len(list(soup.descendants)) # 25 ``` ### .string 如果tag只有一個 `NavigableString` 類型子節點,那么這個tag可以使用 `.string` 得到子節點: ``` title_tag.string # u'The Dormouse's story' ``` 如果一個tag僅有一個子節點,那么這個tag也可以使用 `.string` 方法,輸出結果與當前唯一子節點的 `.string` 結果相同: ``` head_tag.contents # [<title>The Dormouse's story</title>] head_tag.string # u'The Dormouse's story' ``` 如果tag包含了多個子節點,tag就無法確定 `.string` 方法應該調用哪個子節點的內容, `.string` 的輸出結果是 `None` : ``` print(soup.html.string) # None ``` ### .strings 和 stripped_strings 如果tag中包含多個字符串 \[2\] ,可以使用 `.strings` 來循環獲取: ``` for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n' ``` 輸出的字符串中可能包含了很多空格或空行,使用 `.stripped_strings` 可以去除多余空白內容: ``` for string in soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...' ``` 全部是空格的行會被忽略掉,段首和段末的空白會被刪除 ## 父節點繼續分析文檔樹,每個tag或字符串都有父節點:被包含在某個tag中 ### .parent 通過 `.parent` 屬性來獲取某個元素的父節點.在例子“愛麗絲”的文檔中,<head>標簽是<title>標簽的父節點: ``` title_tag = soup.title title_tag # <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head> ``` 文檔title的字符串也有父節點:`<title>`標簽 ``` title_tag.string.parent # <title>The Dormouse's story</title> ``` 文檔的頂層節點比如`<html>`的父節點是 `BeautifulSoup` 對象: ``` html_tag = soup.html type(html_tag.parent) # <class 'bs4.BeautifulSoup'> ``` `BeautifulSoup` 對象的 `.parent` 是None: ``` print(soup.parent) # None ``` ### .parents 通過元素的 `.parents` 屬性可以遞歸得到元素的所有父輩節點,下面的例子使用了 `.parents` 方法遍歷了`<a>`標簽到根節點的所有節點. ``` link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> for parent in link.parents: if parent is None: print(parent) else: print(parent.name) # p # body # html # [document] # None ``` ## 兄弟節點看一段簡單的例子: ``` sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>") print(sibling_soup.prettify()) # <html> # <body> # <a> # # text1 # # <c> # text2 # </c> # </a> # </body> # </html> ``` 因為``標簽和`<c>`標簽是同一層:他們是同一個元素的子節點,所以``和`<c>`可以被稱為兄弟節點.一段文檔以標準格式輸出時,兄弟節點有相同的縮進級別.在代碼中也可以使用這種關系. ### .next_sibling 和 .previous_sibling 在文檔樹中,使用 `.next_sibling` 和 `.previous_sibling` 屬性來查詢兄弟節點: ``` sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # text1 ``` ``標簽有 `.next_sibling` 屬性,但是沒有 `.previous_sibling` 屬性,因為``標簽在同級節點中是第一個.同理,`<c>`標簽有 `.previous_sibling` 屬性,卻沒有 `.next_sibling` 屬性: ``` print(sibling_soup.b.previous_sibling) # None print(sibling_soup.c.next_sibling) # None ``` 例子中的字符串“text1”和“text2”不是兄弟節點,因為它們的父節點不同: ``` sibling_soup.b.string # u'text1' print(sibling_soup.b.string.next_sibling) # None ``` 實際文檔中的tag的 `.next_sibling` 和 `.previous_sibling` 屬性通常是字符串或空白. 看看“愛麗絲”文檔: ``` <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> ``` 如果以為第一個`<a>`標簽的 `.next_sibling` 結果是第二個`<a>`標簽,那就錯了,真實結果是第一個`<a>`標簽和第二個`<a>`標簽之間的頓號和換行符: ``` link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> link.next_sibling # u',\n' ``` 第二個`<a>`標簽是頓號的 `.next_sibling` 屬性: ``` link.next_sibling.next_sibling # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> ``` ### .next_siblings 和 .previous_siblings 通過 `.next_siblings` 和 `.previous_siblings` 屬性可以對當前節點的兄弟節點迭代輸出: ``` for sibling in soup.a.next_siblings: print(repr(sibling)) # u',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # u'; and they lived at the bottom of a well.' # None for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling)) # ' and\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u',\n' # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # u'Once upon a time there were three little sisters; and their names were\n' # None ``` ## 回退和前進看一下“愛麗絲” 文檔: ``` <html><head><title>The Dormouse's story</title></head> The Dormouse's story ``` HTML解析器把這段字符串轉換成一連串的事件: “打開`<html>`標簽”,”打開一個`<head>`標簽”,”打開一個`<title>`標簽”,”添加一段字符串”,”關閉`<title>`標簽”,”打開``標簽”,等等.Beautiful Soup提供了重現解析器初始化過程的方法. ### .next_element 和 .previous_element `.next_element` 屬性指向解析過程中下一個被解析的對象(字符串或tag),結果可能與 `.next_sibling` 相同,但通常是不一樣的. 這是“愛麗絲”文檔中最后一個`<a>`標簽,它的 `.next_sibling` 結果是一個字符串,因為當前的解析過程 \[2\] 因為當前的解析過程因為遇到了`<a>`標簽而中斷了: ``` last_a_tag = soup.find("a", id="link3") last_a_tag # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> last_a_tag.next_sibling # '; and they lived at the bottom of a well.' ``` 但這個`<a>`標簽的 `.next_element` 屬性結果是在`<a>`標簽被解析之后的解析內容,不是`<a>`標簽后的句子部分,應該是字符串”Tillie”: ``` last_a_tag.next_element # u'Tillie' ``` 這是因為在原始文檔中,字符串“Tillie” 在分號前出現,解析器先進入`<a>`標簽,然后是字符串“Tillie”,然后關閉`</a>`標簽,然后是分號和剩余部分.分號與`<a>`標簽在同一層級,但是字符串“Tillie”會被先解析. `.previous_element` 屬性剛好與 `.next_element` 相反,它指向當前被解析的對象的前一個解析對象: ``` last_a_tag.previous_element # u' and\n' last_a_tag.previous_element.next_element # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ``` ### .next_elements 和 .previous_elements 通過 `.next_elements` 和 `.previous_elements` 的迭代器就可以向前或向后訪問文檔的解析內容,就好像文檔正在被解析一樣: ``` for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # ... # u'...' # u'\n' # None ```