[python知識] 爬蟲知識之BeautifulSoup庫安裝及簡單介紹 · Python學習系列

##一. 前言在前面的幾篇文章中我介紹了如何通過Python分析源代碼來爬取博客、維基百科InfoBox和圖片，其文章鏈接如下： [[python學習] 簡單爬取維基百科程序語言消息盒](http://blog.csdn.net/eastmount/article/details/44342559) [[Python學習] 簡單網絡爬蟲抓取博客文章及思想介紹](http://blog.csdn.net/eastmount/article/details/39770543) [[python學習] 簡單爬取圖片網站圖庫中圖片](http://blog.csdn.net/eastmount/article/details/44492787) 其中核心代碼如下： ~~~ # coding=utf-8 import urllib import re #下載靜態HTML網頁 url='http://www.csdn.net/' content = urllib.urlopen(url).read() open('csdn.html','w+').write(content) #獲取標題 title_pat=r'(?<=<title>).*?(?=</title>)' title_ex=re.compile(title_pat,re.M|re.S) title_obj=re.search(title_ex, content) title=title_obj.group() print title #獲取超鏈接內容 href = r'<a href=.*?>(.*?)</a>' m = re.findall(href,content,re.S|re.M) for text in m: print unicode(text,'utf-8') break #只輸出一個url ~~~ 輸出結果如下： ~~~ >>> CSDN.NET - 全球最大中文IT社區，為IT專業技術人員提供最全面的信息傳播和服務平臺登錄 >>> ~~~ 圖片下載的核心代碼如下： ~~~ import os import urllib class AppURLopener(urllib.FancyURLopener): version = "Mozilla/5.0" urllib._urlopener = AppURLopener() url = "http://creatim.allyes.com.cn/imedia/csdn/20150228/15_41_49_5B9C9E6A.jpg" filename = os.path.basename(url) urllib.urlretrieve(url , filename) ~~~ 但是上面這種分析HTML來爬取網站內容的方法存在很多弊端，譬如： 1.正則表達式被HTML源碼所約束，而不是取決于更抽象的結構；網頁結構中很小的改動可能會導致程序的中斷。 2.程序需要根據實際HTML源碼分析內容，可能會遇到字符實體如&之類的HTML特性，需要指定處理如、圖標超鏈接、下標等不同內容。 3.正則表達式并不是完全可讀的，更復雜的HTML代碼和查詢表達式會變得很亂。正如《Python基礎教程(第2版)》采用兩種解決方案：第一個是使用Tidy(Python庫)的程序和XHTML解析；第二個是使用BeautifulSoup庫。 ##二. 安裝及介紹Beautiful Soup庫 Beautiful Soup是用Python寫的一個HTML/XML的解析器，它可以很好的處理不規范標記并生成剖析樹(parse tree)。它提供簡單又常用的導航navigating，搜索以及修改剖析樹的操作。它可以大大節省你的編程時間。正如書中所說“那些糟糕的網頁不是你寫的，你只是試圖從中獲得一些數據。現在你不用關心HTML是什么樣子的，解析器幫你實現”。下載地址： [http://www.crummy.com/software/BeautifulSoup/](http://www.crummy.com/software/BeautifulSoup/) [http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/](http://www.crummy.com/software/BeautifulSoup/bs4/download/4.3/) 安裝過程如下圖所示：python setup.py install ![](https://box.kancloud.cn/2016-02-23_56cc2eb445c9e.jpg) 具體使用方法建議參照中文： [http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html](http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html) 其中BeautifulSoup的用法簡單講解下，使用“愛麗絲夢游仙境”的官方例子： ~~~ #!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ #獲取BeautifulSoup對象并按標準縮進格式輸出 soup = BeautifulSoup(html_doc) print(soup.prettify()) ~~~ 輸出內容按照標準的縮進格式的結構輸出如下： ~~~ <html> <head> <title> The Dormouse's story </title> </head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. ... </body> </html> ~~~ 下面是BeautifulSoup庫簡單快速入門介紹：(參考：[官方文檔](http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html)) ~~~ '''獲取title值''' print soup.title # <title>The Dormouse's story</title> print soup.title.name # title print unicode(soup.title.string) # The Dormouse's story '''獲取值''' print soup.p # The Dormouse's story print soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> '''從文檔中找到<a>的所有標簽鏈接''' print soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie print soup.find(id='link3') # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> ~~~ 如果想獲取文章中所有文字內容，代碼如下： ~~~ '''從文檔中獲取所有文字內容''' print soup.get_text() # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ... ~~~ 同時在這過程中你可能會遇到兩個典型的錯誤提示： 1.ImportError: No module named BeautifulSoup 當你成功安裝BeautifulSoup 4庫后，“from BeautifulSoup import BeautifulSoup”可能會遇到該錯誤。 ![](https://box.kancloud.cn/2016-02-23_56cc2eb4611b2.jpg) 其中的原因是BeautifulSoup 4庫改名為bs4，需要使用“from bs4 import BeautifulSoup”導入。 2.TypeError: an integer is required 當你使用“print soup.title.string”獲取title的值時，可能會遇到該錯誤。如下： ![](https://box.kancloud.cn/2016-02-23_56cc2eb472a33.jpg) 它應該是IDLE的BUG，當使用命令行Command沒有任何錯誤。參考：[stackoverflow](http://stackoverflow.com/questions/28849615/why-they-didnt-work-when-i-scrape-the-string-in-html-by-using-beautifulsoup)。同時可以通過下面的代碼解決該問題： print unicode(soup.title.string) print str(soup.title.string) ##三. Beautiful Soup常用方法介紹 Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:Tag、NavigableString、BeautifulSoup、Comment 1.Tag標簽 tag對象與XML或HTML文檔中的tag相同，它有很多方法和屬性。其中最重要的屬性name和attribute。用法如下： ~~~ #!/usr/bin/python # -*- coding: utf-8 -*- from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story """ soup = BeautifulSoup(html) tag = soup.p print tag # The Dormouse's story print type(tag) # <class 'bs4.element.Tag'> print tag.name # p 標簽名字 print tag['class'] # [u'title'] print tag.attrs # {u'class': [u'title'], u'id': u'start'} ~~~ 使用BeautifulSoup每個tag都有自己的名字，可以通過.name來獲取；同樣一個tag可能有很多個屬性，屬性的操作方法與字典相同，可以直接通過“.attrs”獲取屬性。至于修改、刪除操作請參考文檔。 2.NavigableString 字符串常被包含在tag內，BeautifulSoup用NavigableString類來包裝tag中的字符串。一個NavigableString字符串與Python中的Unicode字符串相同，并且還支持包含在遍歷文檔樹和搜索文檔樹中的一些特性，通過unicode()方法可以直接將NavigableString對象轉換成Unicode字符串。 ~~~ print unicode(tag.string) # The Dormouse's story print type(tag.string) # <class 'bs4.element.NavigableString'> tag.string.replace_with("No longer bold") print tag # No longer bold ~~~ 這是獲取“The Dormouse's story”中tag = soup.p的值，其中tag中包含的字符串不能編輯，但可通過函數replace_with()替換。 NavigableString 對象支持遍歷文檔樹和搜索文檔樹中定義的大部分屬性, 并非全部。尤其是一個字符串不能包含其它內容(tag能夠包含字符串或是其它tag)，字符串不支持 .contents 或 .string 屬性或 find() 方法。如果想在Beautiful Soup之外使用 NavigableString 對象，需要調用 unicode() 方法，將該對象轉換成普通的Unicode字符串，否則就算Beautiful Soup已方法已經執行結束，該對象的輸出也會帶有對象的引用地址。這樣會浪費內存。 3.Beautiful Soup對象該對象表示的是一個文檔的全部內容，大部分時候可以把它當做Tag對象，它支持遍歷文檔樹和搜索文檔樹中的大部分方法。注意：因為BeautifulSoup對象并不是真正的HTML或XML的tag，所以它沒有name和 attribute屬性，但有時查看它的.name屬性可以通過BeautifulSoup對象包含的一個值為[document]的特殊實行.name實現——soup.name。 Beautiful Soup中定義的其它類型都可能會出現在XML的文檔中：CData , ProcessingInstruction , Declaration , Doctype 。與 Comment 對象類似，這些類都是 NavigableString的子類，只是添加了一些額外的方法的字符串獨享。 4.Command注釋 Tag、NavigableString、BeautifulSoup幾乎覆蓋了html和xml中的所有內容，但是還有些特殊對象容易讓人擔心——注釋。Comment對象是一個特殊類型的NavigableString對象。 ~~~ markup = "" soup = BeautifulSoup(markup) comment = soup.b.string print type(comment) # <class 'bs4.element.Comment'> print unicode(comment) # Hey, buddy. Want to buy a used parser? ~~~ 介紹完這四個對象后，下面簡單介紹遍歷文檔樹和搜索文檔樹及常用的函數。 5.遍歷文檔樹一個Tag可能包含多個字符串或其它的Tag，這些都是這個Tag的子節點。BeautifulSoup提供了許多操作和遍歷子節點的屬性。引用官方文檔中愛麗絲例子：操作文檔最簡單的方法是告訴你想獲取tag的name，如下： ~~~ soup.head # <head><title>The Dormouse's story</title></head> soup.title # <title>The Dormouse's story</title> soup.body.b # The Dormouse's story ~~~ 注意：通過點取屬性的放是只能獲得當前名字的第一個Tag，同時可以在文檔樹的tag中多次調用該方法如soup.body.b獲取<body>標簽中第一個標簽。如果想得到所有的<a>標簽，使用方法find_all()，在前面的Python爬取維基百科等HTML中我們經常用到它+正則表達式的方法。 ~~~ soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] ~~~ 子節點：在分析HTML過程中通常需要分析tag的子節點，而tag的 .contents 屬性可以將tag的子節點以列表的方式輸出。字符串沒有.contents屬性，因為字符串沒有子節點。 ~~~ head_tag = soup.head head_tag # <head><title>The Dormouse's story</title></head> head_tag.contents [<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] title_tag # <title>The Dormouse's story</title> title_tag.contents # [u'The Dormouse's story'] ~~~ 通過tag的 .children 生成器,可以對tag的子節點進行循環： ~~~ for child in title_tag.children: print(child) # The Dormouse's story ~~~ 子孫節點：同樣?.descendants 屬性可以對所有tag的子孫節點進行遞歸循環： ~~~ for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story ~~~ 父節點：通過 .parent 屬性來獲取某個元素的父節點.在例子“愛麗絲”的文檔中,<head>標簽是<title>標簽的父節點，換句話就是增加一層標簽。注意：文檔的頂層節點比如<html>的父節點是 BeautifulSoup 對象，BeautifulSoup 對象的 .parent 是None。 ~~~ title_tag = soup.title title_tag # <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head> title_tag.string.parent # <title>The Dormouse's story</title> ~~~ 兄弟節點：因為標簽和<c>標簽是同一層：他們是同一個元素的子節點，所以和<c>可以被稱為兄弟節點。一段文檔以標準格式輸出時，兄弟節點有相同的縮進級別.在代碼中也可以使用這種關系。 ~~~ sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>") print(sibling_soup.prettify()) # <html> # <body> # <a> # # text1 # # <c> # text2 # </c> # </a> # </body> # </html> ~~~ 在文檔樹中,使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節點。標簽有.next_sibling 屬性，但是沒有.previous_sibling 屬性，因為標簽在同級節點中是第一個。同理<c>標簽有.previous_sibling 屬性，卻沒有.next_sibling 屬性： ~~~ sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # text1 ~~~ 介紹到這里基本就可以實現我們的BeautifulSoup庫爬取網頁內容，而網頁修改、刪除等內容建議大家閱讀文檔。下一篇文章就再次爬取維基百科的程序語言的內容吧！希望文章對大家有所幫助，如果有錯誤或不足之處，還請海涵！建議大家閱讀官方文檔和《Python基礎教程》書。 (By：Eastmount 2015-3-25 下午6點 ?[http://blog.csdn.net/eastmount/](http://blog.csdn.net/eastmount/))**