7.2 Splash的使用 · python3爬蟲筆記

## 1.說明 Lua腳本語法:[http://www.runoob.com/lua/lua-basic-syntax.html](http://www.runoob.com/lua/lua-basic-syntax.html) Lua下載:[https://github.com/rjpcomputing/luaforwindows/releases](https://github.com/rjpcomputing/luaforwindows/releases) 官方文檔: * [https://splash.readthedocs.io/en/stable/scripting-ref.html](https://splash.readthedocs.io/en/stable/scripting-ref.html) * [https://splash.readthedocs.io/en/stable/scripting-element-object.html](https://splash.readthedocs.io/en/stable/scripting-element-object.html) * [https://splash.readthedocs.io/en/stable/api.html\#render-json](https://splash.readthedocs.io/en/stable/api.html#render-json) * [https://splash.readthedocs.io/en/stable/api.html\#render-png](https://splash.readthedocs.io/en/stable/api.html#render-png) * [https://splash.readthedocs.io/en/stable/api.html\#render-html](https://splash.readthedocs.io/en/stable/api.html#render-html) ### 2.實例 ![](/assets/7.2.2.png) 腳本語言內容: ``` function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(0.5)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end ``` wait:等待 html:返回頁面的源碼 png:返回頁面的截圖 HAR:返回頁面的HAR信息 ### 3.Splash Lua腳本 #### 入口及返回值 ``` function main(splash, args) assert(splash:go("https://www.baidu.com")) assert(splash:wait(0.5)) local title = splash:evaljs("document.title") return { title = title } end ``` 通過spalsh:evaljs方法傳入javascript腳本 Splash默認調用main方法方法的返回值為字典形式或者是字符串形式，最后都會轉化為一個Splash HTTP Response ``` function main(splash, args) return { hello = "world" } end ``` #### 異步處理 ``` function main(splash, args) local example_urls = {"www.baidu.com", "www.taobao.com", "www.zhihu.com"} local urls = args.urls or example_urls local results = {} for index, url in ipairs(urls) do local ok, reason = splash:go("http://" .. url) if ok then splash:wait(2) results[url] = splash:png() end end return results end ``` ![](/assets/7.2.3.png) wait:等待的描述 ..:字符串拼接操作符 ipairs:操作字典進行迭代 #### 4.Splash對象屬性 #### args splash對象的args屬性可以獲取加載時配置的參數 ``` function main(spalsh,args) local url = args.url return url end ``` 運行結果: ``` Splash Response: "https://www.qidian.com/" ``` #### js\_endabled js\_endabled屬性是splash的javascript執行開關，可以將其配置為True或False來控制是否可以執行JavaScript代碼，默認為True ``` function main(splash,args) splash:go(args.url) splash.js_enabled = false local title = splash:evaljs("document.title") return { title = title, } end ``` 禁用之后，調用evaljs方法執行javascript代碼，會拋出異常 ``` { "description": "Error happened while executing Lua script", "info": { "message": "[string \"function main(splash,args)\r...\"]:4: unknown JS error: None", "line_number": 4, "source": "[string \"function main(splash,args)\r...\"]", "splash_method": "evaljs", "type": "JS_ERROR", "error": "unknown JS error: None", "js_error_message": null }, "error": 400, "type": "ScriptError" } ``` 一般情況下默認開啟 #### resoure\_timeout 設置加載的超時時間，單位為秒數。如果設置為0或nil就表示不檢測超時 ``` function main(splash,args) splash.resource_timeout = 0.1 splash:go(args.url) return splash:png() end ``` 設置超時為0.1秒，如果在0.1秒之內沒有得到響應就會拋出異常 ![](/assets/7.2.4.png) ### images\_enabled 設置圖片是否加載 ``` function main(splash,args) splash.images_enabled = false splash:go(args.url) return splash:png() end ``` #### plugins\_enabled 控制瀏覽器插件是否開啟，如Flash。默認情況下事是False/不開啟 ``` splash.plugins_enabled = flase/true ``` #### scroll\_position 控制頁面的滾動偏移 splash.scroll\_position = {x=x,y=y} ``` function main(splash,args) splash.images_enabled = true splash:go(args.url) splash:wait(3) splash.scroll_position = {y=200} return splash:png() end ``` ### 5.Splash對象方法 #### go go:用來請求某個鏈接的方法，可以模擬GET和POST請求，同時支持傳入Headers、From Data等數據 ``` ok, reason = splash:go{url, baseurl=nil, headers=nil, http_method="GET", body=nil, formdata=nil} ``` 參數說明: * url:請求的url * baseurl:資源加載相對路徑 * headers:請求的headers * http\_method:默認是get，同時支持post * body:post的時候的表單數據，使用的Content-type為application * fromdata:post的時候表單數據，使用的Content-type為application/x-www-form-urlencoded 返回的結果是結果ok和原因reason的組合，如果ok為空，代表網頁加載出現了錯誤，此時reason變量中包含了錯誤的原因，否則證明頁面加載成功 ``` function main(splash,args) local ok,reason = splash:go{"http://httpbin.org/post",http_method="POST",body="name-Germey"} if ok then return splash:html() end end ``` 模擬post請求，并傳入POST的表單數據，如果成功，則返回用頁面源代碼運行結果: ``` <html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{ "args": {}, "data": "", "files": {}, "form": { "name-Germey": "" }, "headers": { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en,*", "Connection": "close", "Content-Length": "11", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "Origin": "null", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/602.1 (KHTML, like Gecko) splash Version/9.0 Safari/602.1" }, "json": null, "origin": "58.16.27.216", "url": "http://httpbin.org/post" } </pre></body></html> ``` #### wait 控制頁面等待時間 ``` ok,reason = splash:wait(time,cancel_on_redirect=false,cancle_on_error=True) ``` 參數說明: * time:等待秒數 * cancel\_on\_redirect:默認Fasle,如果發生了重定向就停止等待，并返回重定向結果 * cancel\_on\_error:默認False,如果加載發生了加載錯誤就停止等待 ``` function main(splash,args) splash:go(args.url) splash:wait(2) return splash:html() end ``` #### jsfunc 直接調用javascript定義的方法，需要用雙中括號包圍，相等于實現了javascript方法到lua腳本的轉換 ``` function main(splash, args) local get_div_count = splash:jsfunc([[ function () { var body = document.body; var divs = body.getElementsByTagName('div'); return divs.length; } ]]) splash:go("https://www.baidu.com") return ("There are %s DIVs"):format( get_div_count()) end ``` 運行結果: ``` Splash Response: "There are 23 DIVs" ``` #### evaljs 執行javascript代碼并返回最后一條語句的返回結果 ``` result = splash:evaljs(js) ``` #### runjs 執行JavaScript代碼類似于evaljs功能類似，但偏向于執行某些動作或聲明某些方法evaljs偏向于獲取某些執行結果 ``` function main(splash,args) splash:go("https;//www.baidu.com") splash:runjs("foo = function(){return 'bar'}") local result = splash:evaljs("foo()") return result end ``` 運行結果: ``` Splash Response: "bar" ``` #### autoload 設置每個頁面訪問時自動加載的對象 ``` ok,reason = splash:autoload{source_or_url,source=nil,url=nil} ``` 參數說明: * source\_or\_url:JavaScript代碼或者JavaScript庫鏈接 * source，JavaScript代碼 * url，JavaScript庫鏈接只負責加載JavaScript代碼或庫，不執行任何操作，如果要執行操作可以調用evaljs或runjs方法 ``` function main(splash,args) splash:autoload([[ function get_documetn_title(){ return document.title; } ]]) splash:go("https://www.baidu.com") return splash:evaljs("get_documetn_title()") end ``` 運行結果: ``` Splash Response: "百度一下，你就知道" ``` 加載某些方法庫，如JQuery ``` function main(splash,args) splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js") splash:go("https://www.baidu.com") local version = splash:evaljs("$.fn.jquery") return "JQuery version: " .. version end ``` 運行結果： ``` Splash Response: "JQuery version: 1.10.2" ``` #### call\_later 可以通過設置定時任務和延遲時間實現任務延時執行，并且可以在執行前通過 cancel 方法重新執行定時任務 ``` function main(splash,args) local snapshots = {} local timer = splash:call_later(function() snapshots['a'] = splash:png() splash:wait(1.0) snapshots["b"] = splash:png() end,0.2) splash:go("https://www.taobao.com") splash:wait(3.0) return snapshots end ``` ![](/assets/7.2.5.png) #### http\_get {#httpget} 此方法可以模擬發送 HTTP 的 GET 請求，使用方法如下： ``` response = splash:http_get{url, headers=nil, follow_redirects=true} ``` 參數說明如下： * url，請求URL。 * headers，可選參數，默認為空，請求的 Headers。 * follow\_redirects，可選參數，默認為 True，是否啟動自動重定向。 #### http\_post {#httppost} 和 http\_get 方法類似，此方法是模擬發送一個 POST 請求，不過多了一個參數 body，使用方法如下 ``` response = splash:http_post{url, headers=nil, follow_redirects=true, body=nil} ``` 參數說明如下： * url，請求URL。 * headers，可選參數，默認為空，請求的 Headers。 * follow\_redirects，可選參數，默認為 True，是否啟動自動重定向。 * body，可選參數，默認為空，即表單數據。 ``` function main(splash, args) local treat = require("treat") local json = require("json") local response = splash:http_post{"http://httpbin.org/post", body=json.encode({name="Germey"}), headers={["content-type"]="application/json"} } return { html=treat.as_string(response.body), url=response.url, status=response.status } end ``` #### set\_content {#setcontent} 此方法可以用來設置頁面的內容 ``` function main(splash) assert(splash:set_content("<html><body><h1>hello Angle</h1></body></html>")) return splash:png() end ``` #### html {#html} 此方法可以用來獲取網頁的源代碼 #### png {#png} 此方法可以用來獲取 PNG 格式的網頁截圖 #### jpeg {#jpeg} 此方法可以用來獲取 JPEG 格式的網頁截圖 #### har {#har} 此方法可以用來獲取頁面加載過程描述 #### url {#url} 此方法可以獲取當前正在訪問的 URL #### get\_cookies {#getcookies} 此方法可以獲取當前頁面的 Cookies #### add\_cookie {#addcookie} 此方法可以為當前頁面添加 Cookie ``` cookies = splash:add_cookie{name, value, path=nil, domain=nil, expires=nil, httpOnly=nil, secure=nil} ``` ``` function main(splash) splash:add_cookie{"sessionid", "asdadasd", "/", domain="http://example.com"} splash:go("http://example.com/") return splash:html() end ``` #### clear\_cookies {#clearcookies} 此方法可以清除所有的 Cookies ``` splash:clear_cookies() ``` #### get\_viewport\_size {#getviewportsize} 此方法可以獲取當前瀏覽器頁面的大小，即寬高 #### set\_viewport\_size {#setviewportsize} 此方法可以設置當前瀏覽器頁面的大小，即寬高 ``` spalsh:set_viewport_size(x,y) ``` #### set\_viewport\_full {#setviewportfull} 此方法可以設置瀏覽器全屏顯示 #### set\_user\_agent {#setuseragent} 此方法可以設置瀏覽器的 User-Agent ``` splash:set_user_agent("Splash") ``` #### set\_custom\_headers {#setcustomheaders} 此方法可以設置請求的 Headers ``` set_custom_headers() 此方法可以設置請求的 Headers ``` #### select {#select} select 方法可以選中符合條件的第一個節點，如果有多個節點符合條件，則只會返回一個，其參數是 CSS 選擇器 ``` function main(splash,args) splash:go(args.url) input = splash:select("#kw") input:send_text('Splash') splash:wait(3) return splash:png() end ``` ![](/assets/7.2.6.png) #### select\_all {#selectall} 此方法可以選中所有的符合條件的節點，其參數是 CSS 選擇器 ``` function main(splash) local treat = require('treat') assert(splash:go("http://quotes.toscrape.com/")) assert(splash:wait(0.5)) local texts = splash:select_all('.quote .text') local results = {} for index, text in ipairs(texts) do results[index] = text.node.innerHTML end return treat.as_array(results) end ``` #### mouse\_click {#mouseclick} 此方法可以模擬鼠標點擊操作，傳入的參數為坐標值 x、y，也可以直接選中某個節點直接調用此方法 ``` function main(splash,args) splash:go(args.url) input = splash:select("#kw") input:send_text('Splash') submit = splash:select("#su") submit:mouse_click() splash:wait(3) return splash:png() end ``` ![](/assets/7.2.7.png) ### 6.Splash API調用 ### render.html 用于獲取javascript渲染頁面的HTML代碼 ``` curl http://localhost:8050/render.html?url=https://www.baidu.com ``` ``` import requests url = "http://192.168.99.100:8050/render.html?url=https://www.baidu.com&wait=5" response = requests.get(url) print(response.text) ``` ### render.png 獲取網頁截圖參數: * height:高 * width:寬 ``` curl http://localhost:8050/render.png?url=https://www.taobao.com&wait=5&width=1000&height=700 ``` ``` import requests url = "http://192.168.99.100:8050/render.png?url=https://www.taobao.com&wait=10&width=1000&height=700" response = requests.get(url) with open('taobao.png','wb') as f: f.write(response.content) ``` 圖片:[https://splash.readthedocs.io/en/stable/api.html\#render-png](https://splash.readthedocs.io/en/stable/api.html#render-png) ### render.har {#renderhar} 此接口用于獲取頁面加載的 HAR 數據 ``` curl http://localhost:8050/render.har?url=https://www.jd.com&wait=5 ``` 返回結果非常多，是一個 Json 格式的數據，里面包含了頁面加載過程中的 HAR 數據。 ### render.json {#renderjson} 此接口包含了前面接口的所有功能，返回結果是 Json 格式 ``` curl http://localhost:8050/render.json?url=https://httpbin.org ``` 更多參數:[https://splash.readthedocs.io/en/stable/api.html\#render-json](https://splash.readthedocs.io/en/stable/api.html#render-json) ### execute {#execute} 此接口可以實現和 Lua 腳本的對接 ``` function main(splash) return "hello" end ``` ``` curl http://localhost:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend ``` ``` import requests from urllib.parse import quote lua = ''' function main(splash) return 'hello' end ''' url = "http://192.168.99.100:8050/execute?lua_source=" + quote(lua) response = requests.get(url) print(response.text) ``` ``` import requests from urllib.parse import quote lua = ''' function main(splash, args) local treat = require("treat") local response = splash:http_get("http://httpbin.org/get") return { html=treat.as_string(response.body), url=response.url, status=response.status } end ''' url = 'http://localhost:8050/execute?lua_source=' + quote(lua) response = requests.get(url) print(response.text) ```