Python爬蟲進階四之PySpider的用法 · Python爬蟲

## 審時度勢 PySpider 是一個我個人認為非常方便并且功能強大的爬蟲框架，支持多線程爬取、JS動態解析，提供了可操作界面、出錯重試、定時爬取等等的功能，使用非常人性化。本篇內容通過跟我做一個好玩的 PySpider 項目，來理解 PySpider 的運行流程。 ## 招兵買馬具體的安裝過程請查看本節講述 [安裝](http://cuiqingcai.com/2443.html) 嗯，安裝好了之后就與我大干一番吧。 ## 鴻鵠之志我之前寫過的一篇文章 [抓取淘寶MM照片](http://cuiqingcai.com/1001.html) 由于網頁改版，爬取過程中需要的 URL 需要 JS 動態解析生成，所以之前用的 urllib2 不能繼續使用了，在這里我們利用 PySpider 重新實現一下。所以現在我們需要做的是抓取淘寶MM的個人信息和圖片存儲到本地。 ## 審時度勢爬取目標網站：[https://mm.taobao.com/json/request_top_list.htm?page=1](https://mm.taobao.com/json/request_top_list.htm?page=1)，大家打開之后可以看到許多淘寶MM的列表。列表有多少？ [https://mm.taobao.com/json/request_top_list.htm?page=10000](https://mm.taobao.com/json/request_top_list.htm?page=10000)，第10000頁都有，看你想要多少。我什么也不知道。隨機點擊一位 MM 的姓名，可以看到她的基本資料。 [![](https://box.kancloud.cn/2016-05-29_574a8e657956f.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160326-4@2x.png) 可以看到圖中有一個個性域名，我們復制到瀏覽器打開。[mm.taobao.com/tyy6160](https://mm.taobao.com/tyy6160) [![](https://box.kancloud.cn/2016-05-29_574a8e65a6b69.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160326-5@2x.png) 嗯，往下拖，海量的 MM 圖片都在這里了，怎么辦你懂得，我們要把她們的照片和個人信息都存下來。 **P.S. 注意圖中進度條！你猜有多少圖片～** ## 利劍出鞘安裝成功之后，跟我一步步地完成一個網站的抓取，你就會明白 PySpider 的基本用法了。命令行下執行 ~~~ pyspider all ~~~ 這句命令的意思是，運行 pyspider 并啟動它的所有組件。 [![](https://box.kancloud.cn/2016-05-29_574a8e65d178e.jpg)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/E6632A0A-9067-4B97-93A2-5DEF23FB4CD8.jpg) 可以發現程序已經正常啟動，并在 5000 這個端口運行。 ## 一觸即發接下來在瀏覽器中輸入?[http://localhost:5000](http://localhost:5000/)，可以看到 PySpider 的主界面，點擊右下角的 Create，命名為 taobaomm，當然名稱你可以隨意取，繼續點擊 Create。 [![](https://box.kancloud.cn/2016-05-29_574a8e65f34a5.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-0@2x.png) 這樣我們會進入到一個爬取操作的頁面。 [![](https://box.kancloud.cn/2016-05-29_574a8e66208d6.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-1@2x.png) 整個頁面分為兩欄，左邊是爬取頁面預覽區域，右邊是代碼編寫區域。下面對區塊進行說明：左側綠色區域：這個請求對應的 JSON 變量，在 PySpider 中，其實每個請求都有與之對應的 JSON 變量，包括回調函數，方法名，請求鏈接，請求數據等等。綠色區域右上角Run：點擊右上角的 run 按鈕，就會執行這個請求，可以在左邊的白色區域出現請求的結果。左側 enable css selector helper: 抓取頁面之后，點擊此按鈕，可以方便地獲取頁面中某個元素的 CSS 選擇器。左側 web: 即抓取的頁面的實時預覽圖。左側 html: 抓取頁面的 HTML 代碼。左側 follows: 如果當前抓取方法中又新建了爬取請求，那么接下來的請求就會出現在 follows 里。左側 messages: 爬取過程中輸出的一些信息。右側代碼區域: 你可以在右側區域書寫代碼，并點擊右上角的 Save 按鈕保存。右側 WebDAV Mode: 打開調試模式，左側最大化，便于觀察調試。 ## 乘勝追擊依然是上一節的那個網址，[https://mm.taobao.com/json/request_top_list.htm?page=1](https://mm.taobao.com/json/request_top_list.htm?page=1)，其中 page 參數代表頁碼。所以我們暫時抓取前 30?頁。頁碼到最后可以隨意調整。首先我們定義基地址，然后定義爬取的頁碼和總頁碼。 ~~~ from pyspider.libs.base_handler import * class Handler(BaseHandler): ????crawl_config = { ????} ???? ????def __init__(self): ????????self.base_url = 'https://mm.taobao.com/json/request_top_list.htm?page=' ????????self.page_num = 1 ????????self.total_num = 30 ????@every(minutes=24 * 60) ????def on_start(self): ????????while self.page_num self.total_num: ????????????url = self.base_url + str(self.page_num) ????????????print url ????????????self.crawl(url, callback=self.index_page) ????????????self.page_num += 1 ????@config(age=10 * 24 * 60 * 60) ????def index_page(self, response): ????????for each in response.doc('a[href^="http"]').items(): ????????????self.crawl(each.attr.href, callback=self.detail_page) ????@config(priority=2) ????def detail_page(self, response): ????????return { ????????????"url": response.url, ????????????"title": response.doc('title').text(), ????????} ~~~ 點擊 save 保存代碼，然后點擊左邊的 run，運行代碼。 [![](https://box.kancloud.cn/2016-05-29_574a8e66439bc.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-2@2x.png) 運行后我們會發現 follows 出現了 30 這個數字，說明我們接下來有 30 個新請求，點擊可查看所有爬取列表。另外控制臺也有輸出，將所有要爬取的 URL 打印了出來。然后我們點擊左側任意一個綠色箭頭，可以繼續爬取這個頁面。例如點擊第一個 URL，來爬取這個 URL [![](https://box.kancloud.cn/2016-05-29_574a8e6666080.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-3@2x.png) 點擊之后，再查看下方的 web 頁面，可以預覽實時頁面，這個頁面被我們爬取了下來，并且回調到 index_page 函數來處理，目前 index_page 函數我們還沒有處理，所以是繼續構件了所有的鏈接請求。 [![](https://box.kancloud.cn/2016-05-29_574a8e668c90a.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-4@2x.png) 好，接下來我們怎么辦？當然是進入到 MM 到個人頁面去爬取了。 ## 如火如荼爬取到了 MM 的列表，接下來就要進入到 MM 詳情頁了，修改 index_page 方法。 ~~~ def index_page(self, response): ????for each in response.doc('.lady-name').items(): ????????self.crawl(each.attr.href, callback=self.detail_page) ~~~ 其中 response 就是剛才爬取的列表頁，response 其實就相當于列表頁的 html 代碼，利用 doc 函數，其實是調用了 PyQuery，用 CSS 選擇器得到每一個MM的鏈接，然后重新發起新的請求。比如，我們這里拿到的 each.attr.href 可能是?[mm.taobao.com/self/model_card.htm?user_id=687471686](http://mm.taobao.com/self/model_card.htm?user_id=687471686)，在這里繼續調用了 crawl 方法，代表繼續抓取這個鏈接的詳情。 ~~~ self.crawl(each.attr.href, callback=self.detail_page) ~~~ 然后回調函數就是 detail_page，爬取的結果會作為 response 變量傳過去。detail_page 接到這個變量繼續下面的分析。 [![](https://box.kancloud.cn/2016-05-29_574a8e66bb06a.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-7@2x.png) 好，我們繼續點擊 run 按鈕，開始下一個頁面的爬取。得到的結果是這樣的。 [![](https://box.kancloud.cn/2016-05-29_574a8e66db4dc.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-5@2x.png) 哦，有些頁面沒有加載出來，這是為什么？在之前的文章說過，這個頁面比較特殊，右邊的頁面使用 JS 渲染生成的，而普通的抓取是不能得到 JS 渲染后的頁面的，這可麻煩了。然而，幸運的是，PySpider 提供了動態解析 JS 的機制。友情提示：可能有的小伙伴不知道 PhantomJS，可以參考 [爬蟲JS動態解析](http://cuiqingcai.com/2599.html) 因為我們在前面裝好了 PhantomJS，所以，這時候就輪到它來出場了。在最開始運行 PySpider 的時候，使用了`pyspider all`命令，這個命令是把 PySpider 所有的組件啟動起來，其中也包括 PhantomJS。所以我們代碼怎么改呢？很簡單。 ~~~ def index_page(self, response): ????for each in response.doc('.lady-name').items(): ????????self.crawl(each.attr.href, callback=self.detail_page, fetch_type='js') ~~~ 只是簡單地加了一個 fetch_type=’js’，點擊綠色的返回箭頭，重新運行一下。可以發現，頁面已經被我們成功加載出來了，簡直不能更帥！ [![](https://box.kancloud.cn/2016-05-29_574a95d7e2ab5.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160325-9@2x.png) 看下面的個性域名，所有我們需要的 MM 圖片都在那里面了，所以我們需要繼續抓取這個頁面。 ## 勝利在望好，繼續修改 detail_page 方法，然后增加一個 domain_page 方法，用來處理每個 MM 的個性域名。 ~~~ def detail_page(self, response): ????domain = 'https:' + response.doc('.mm-p-domain-info li > span').text() ????print domain ????self.crawl(domain, callback=self.domain_page) ???????????????? def domain_page(self, response): ????pass ~~~ 好，繼續重新 run，預覽一下頁面，終于，我們看到了 MM 的所有圖片。 [![](https://box.kancloud.cn/2016-05-29_574a95da9507f.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160326-0@2x.png) 嗯，你懂得！ ## 只欠東風好，照片都有了，那么我們就偷偷地下載下來吧～完善 domain_page 代碼，實現保存簡介和遍歷保存圖片的方法。在這里，PySpider 有一個特點，所有的 request 都會保存到一個隊列中，并具有去重和自動重試機制。所以，我們最好的解決方法是，把每張圖片的請求都寫成一個 request，然后成功后用文件寫入即可，這樣會避免圖片加載不全的問題。曾經在之前文章寫過圖片下載和文件夾創建的過程，在這里就不多贅述原理了，直接上寫好的工具類，后面會有完整代碼。 ~~~ import os class Deal: ????def __init__(self): ????????self.path = DIR_PATH ????????if not self.path.endswith('/'): ????????????self.path = self.path + '/' ????????if not os.path.exists(self.path): ????????????os.makedirs(self.path) ????def mkDir(self, path): ????????path = path.strip() ????????dir_path = self.path + path ????????exists = os.path.exists(dir_path) ????????if not exists: ????????????os.makedirs(dir_path) ????????????return dir_path ????????else: ????????????return dir_path ????def saveImg(self, content, path): ????????f = open(path, 'wb') ????????f.write(content) ????????f.close() ????def saveBrief(self, content, dir_path, name): ????????file_name = dir_path + "/" + name + ".txt" ????????f = open(file_name, "w+") ????????f.write(content.encode('utf-8')) ????def getExtension(self, url): ????????extension = url.split('.')[-1] ????????return extension ~~~ 這里面包含了四個方法。 > mkDir：創建文件夾，用來創建 MM 名字對應的文件夾。 > > saveBrief: 保存簡介，保存 MM 的文字簡介。 > > saveImg: 傳入圖片二進制流以及保存路徑，存儲圖片。 > > getExtension: 獲得鏈接的后綴名，通過圖片 URL 獲得。然后在 domain_page 中具體實現如下 ~~~ ????def domain_page(self, response): ????????name = response.doc('.mm-p-model-info-left-top dd > a').text() ????????dir_path = self.deal.mkDir(name) ????????brief = response.doc('.mm-aixiu-content').text() ????????if dir_path: ????????????imgs = response.doc('.mm-aixiu-content img').items() ????????????count = 1 ????????????self.deal.saveBrief(brief, dir_path, name) ????????????for img in imgs: ????????????????url = img.attr.src ????????????????if url: ????????????????????extension = self.deal.getExtension(url) ????????????????????file_name = name + str(count) + '.' + extension ????????????????????count += 1 ????????????????????self.crawl(img.attr.src, callback=self.save_img, ?????????????????????????????? save={'dir_path': dir_path, 'file_name': file_name}) ????def save_img(self, response): ????????content = response.content ????????dir_path = response.save['dir_path'] ????????file_name = response.save['file_name'] ????????file_path = dir_path + '/' + file_name ????????self.deal.saveImg(content, file_path) ~~~ 以上方法首先獲取了頁面的所有文字，然后調用了 saveBrief 方法存儲簡介。然后遍歷了 MM 所有的圖片，并通過鏈接獲取后綴名，和 MM 的姓名以及自增計數組合成一個新的文件名，調用 saveImg 方法保存圖片。 ## 爐火純青好，基本的東西都寫好了。接下來。繼續完善一下代碼。第一版本完成。 **版本一功能：按照淘寶MM姓名分文件夾，存儲MM的 txt 文本簡介以及所有美圖至本地。** 可配置項： > * PAGE_START: 列表開始頁碼 > * PAGE_END: 列表結束頁碼 > * DIR_PATH: 資源保存路徑 ~~~ #!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on 2016-03-25 00:59:45 # Project: taobaomm from pyspider.libs.base_handler import * PAGE_START = 1 PAGE_END = 30 DIR_PATH = '/var/py/mm' class Handler(BaseHandler): ????crawl_config = { ????} ????def __init__(self): ????????self.base_url = 'https://mm.taobao.com/json/request_top_list.htm?page=' ????????self.page_num = PAGE_START ????????self.total_num = PAGE_END ????????self.deal = Deal() ????def on_start(self): ????????while self.page_num self.total_num: ????????????url = self.base_url + str(self.page_num) ????????????self.crawl(url, callback=self.index_page) ????????????self.page_num += 1 ????def index_page(self, response): ????????for each in response.doc('.lady-name').items(): ????????????self.crawl(each.attr.href, callback=self.detail_page, fetch_type='js') ????def detail_page(self, response): ????????domain = response.doc('.mm-p-domain-info li > span').text() ????????if domain: ????????????page_url = 'https:' + domain ????????????self.crawl(page_url, callback=self.domain_page) ????def domain_page(self, response): ????????name = response.doc('.mm-p-model-info-left-top dd > a').text() ????????dir_path = self.deal.mkDir(name) ????????brief = response.doc('.mm-aixiu-content').text() ????????if dir_path: ????????????imgs = response.doc('.mm-aixiu-content img').items() ????????????count = 1 ????????????self.deal.saveBrief(brief, dir_path, name) ????????????for img in imgs: ????????????????url = img.attr.src ????????????????if url: ????????????????????extension = self.deal.getExtension(url) ????????????????????file_name = name + str(count) + '.' + extension ????????????????????count += 1 ????????????????????self.crawl(img.attr.src, callback=self.save_img, ?????????????????????????????? save={'dir_path': dir_path, 'file_name': file_name}) ????def save_img(self, response): ????????content = response.content ????????dir_path = response.save['dir_path'] ????????file_name = response.save['file_name'] ????????file_path = dir_path + '/' + file_name ????????self.deal.saveImg(content, file_path) import os class Deal: ????def __init__(self): ????????self.path = DIR_PATH ????????if not self.path.endswith('/'): ????????????self.path = self.path + '/' ????????if not os.path.exists(self.path): ????????????os.makedirs(self.path) ????def mkDir(self, path): ????????path = path.strip() ????????dir_path = self.path + path ????????exists = os.path.exists(dir_path) ????????if not exists: ????????????os.makedirs(dir_path) ????????????return dir_path ????????else: ????????????return dir_path ????def saveImg(self, content, path): ????????f = open(path, 'wb') ????????f.write(content) ????????f.close() ????def saveBrief(self, content, dir_path, name): ????????file_name = dir_path + "/" + name + ".txt" ????????f = open(file_name, "w+") ????????f.write(content.encode('utf-8')) ????def getExtension(self, url): ????????extension = url.split('.')[-1] ????????return extension ~~~ 粘貼到你的 PySpider 中運行吧～其中有一些知識點，我會在后面作詳細的用法總結。大家可以先體會一下代碼。 [![](https://box.kancloud.cn/2016-05-29_574a95dabb81b.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160326-1@2x.png) 保存之后，點擊下方的 run，你會發現，海量的 MM 圖片已經涌入你的電腦啦～ [![](image/574a87c1abc52.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160326-2@2x.png)[![](https://box.kancloud.cn/2016-05-29_574a95dad78f2.png)](http://qiniu.cuiqingcai.com/wp-content/uploads/2016/03/QQ20160326-3@2x.png) 需要解釋？需要我也不解釋！ ## **更多戰情** P.S. 在之后還會有后續版本，在此發下預告： **版本二：**將 MM 的個人資料（如姓名、年齡、身高等數據）存入 MySQL 或 MongoDB 數據庫，圖片鏈接一并存儲。 **版本三：**將圖片直接存儲到七牛云存儲，通過數據庫建立信息展示平臺。具體開發動態可以關注我的 GitHub 項目 [TaobaoMM – GitHub](https://github.com/cqcre/TaobaoMM) ## 尚方寶劍如果想了解 PySpider 的更多內容，可以查看官方文檔。 [官方文檔](http://docs.pyspider.org/en/latest/Quickstart/)