使用 PhantomJS 渲染帶 JS 的頁面 · pyspider文檔

[TOC] 在上兩篇教程中，我們學習了怎么從 HTML 中提取信息，也學習了怎么處理一些請求復雜的頁面。但是有一些頁面，它實在太復雜了，無論是分析 API 請求的地址，還是渲染時進行了加密，讓直接抓取請求非常麻煩。這時候就是 [PhantomJS](http://phantomjs.org/) 大顯身手的時候了。在使用 `PhantomJS` 之前，你需要安裝它（[安裝文檔](http://phantomjs.org/download.html)）。當你安裝了之后，在運行 `all` 模式的 `pyspider` 時就會自動啟用了。當然，你也可以在 demo.pyspider.org 上嘗試。 # 使用 PhantomJS 當 `pyspider` 連上 `PhantomJS` 代理后，你就能通過在 `self.crawl` 中添加 `fetch_type='js' `的參數，開啟使用 `PhantomJS` 抓取。例如，在教程二中，我們嘗試抓取的 http://movie.douban.com/explore 就可以通過 `PhantomJS` 直接抓取： ~~~ class Handler(BaseHandler): def on_start(self): self.crawl('http://movie.douban.com/explore', fetch_type='js', callback=self.phantomjs_parser) def phantomjs_parser(self, response): return [{ "title": "".join( s for s in x('p').contents() if isinstance(s, basestring) ).strip(), "rate": x('p strong').text(), "url": x.attr.href, } for x in response.doc('a.item').items()] ~~~ * 我在這里使用了一些 `PyQuery` 的 `API`，你可以在 `PyQuery complete API` 獲得完整的 `API` 手冊。 # 在頁面上執行自定義腳本你會發現，在上面我們使用 `PhantomJS` 抓取的豆瓣熱門電影只有 20 條。當你點擊『加載更多』時，能獲得更多的熱門電影。為了獲得更多的電影，我們可以使用 `self.crawl` 的 `js_script` 參數，在頁面上執行一段腳本，點擊加載更多： ~~~ def on_start(self): self.crawl('http://movie.douban.com/explore#more', fetch_type='js', js_script=""" function() { setTimeout("$('.more').click()", 1000); }""", callback=self.phantomjs_parser) ~~~ * 這個腳本默認在頁面加載結束后執行，你可以通過 js_run_at 參數修改這個行為 * 由于是 AJAX 異步加載的，在頁面加載完成時，第一頁的電影可能還沒有加載完，所以我們用 setTimeout 延遲 1 秒執行。 * 你可以間隔一定時間，多次點擊，這樣可以加載更多頁。 * 由于相同 URL （實際是相同 taskid）的任務會被去重，所以這里為 URL 加了一個 #more 上面兩個例子，都可以在 http://demo.pyspider.org/debug/tutorial_douban_explore 中找到。中文原文： http://blog.binux.me/2015/01/pyspider-tutorial-level-3-render-with-phantomjs/