3.1.1 發送請求 · python3爬蟲筆記

# 3.1.1 發送請求 ## 1.相關鏈接 * 官方文檔 * [https://docs.python.org/3/library/urllib.request.html](https://docs.python.org/3/library/urllib.request.html) * [https://docs.python.org/3/library/urllib.request.html\#basehandler-objects](https://docs.python.org/3/library/urllib.request.html#basehandler-objects) 測試網站:[http://httpbin.org/post](http://httpbin.org/post) ## 2. urlopen\(\) {#1-urlopen} urllib.request 模塊提供了最基本的構造 HTTP 請求的方法，利用它可以模擬瀏覽器的一個請求發起過程，同時它還帶有處理authenticaton（授權驗證），redirections（重定向\)，cookies（瀏覽器Cookies）以及其它內容。例子:抓取百度首頁 ```text import urllib.request response = urllib.request.urlopen("http://www.baidu.com") print(response.read().decode('utf-8')) print(type(response)) ``` 運行結果: ![](https://box.kancloud.cn/962981e2512174a2a6842ecc71fdebc4_676x366.png) 輸出的類型: ```text <class 'http.client.HTTPResponse'> ``` 通過輸出結果可以發現它是一個 HTTPResposne 類型的對象，主要包含的方法有 read\(\)、readinto\(\)、getheader\(name\)、getheaders\(\)、fileno\(\) 等方法和 msg、version、status、reason、debuglevel、closed 等屬性。可以利用response對象調用這些屬性，例子: ```text import urllib.request response = urllib.request.urlopen("http://www.baidu.com") print(response.status) # 狀態碼 print(response.getheaders()) # 響應頭信息 print(response.getheader('Server')) # headers中server的值 ``` 輸出結果為: ```text 200 [('Bdpagetype', '1'), ('Bdqid', '0xef591dd800056531'), ('Cache-Control', 'private'), ('Content-Type', 'text/html'), ('Cxy_all', 'baidu+16f7eb85af21b1161c1ef2120b208c5a'), ('Date', 'Mon, 30 Jul 2018 06:01:58 GMT'), ('Expires', 'Mon, 30 Jul 2018 06:01:28 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=ADA3E646B4E0A6AE8F0BA17B395B76A3:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=ADA3E646B4E0A6AE8F0BA17B395B76A3; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1532930518; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'delPer=0; expires=Wed, 22-Jul-2048 06:01:28 GMT'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=1441_25810_26458_21121_18559_26350_26920_22160; path=/; domain=.baidu.com'), ('Vary', 'Accept-Encoding'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')] BWS/1.1 ``` urlopen\(\)函數的API: `urllib.request.urlopen（url，data = None，[ timeout，] *，cafile = None，capath = None，cadefault = False，context = None ）` ## data參數 data 參數是可選的，如果要添加 data，它要是字節流編碼格式的內容，即 bytes 類型，通過 bytes\(\) 方法可以進行轉化，另外如果傳遞了這個 data 參數，它的請求方式就不再是 GET 方式請求，而是 POST。模擬一個post請求: ```text import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({"name":"angle"}),encoding="utf-8") response = urllib.request.urlopen("http://httpbin.org/post",data=data) print(response.read().decode('utf-8')) ``` 運行結果如下: ```text { "args": {}, "data": "", "files": {}, "form": { "name": "angle" }, "headers": { "Accept-Encoding": "identity", "Connection": "close", "Content-Length": "10", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Python-urllib/3.6" }, "json": null, "origin": "220.197.208.229", "url": "http://httpbin.org/post" } ``` ## timeout參數 timeout 參數可以設置超時時間，單位為秒，意思就是如果請求超出了設置的這個時間還沒有得到響應，就會拋出異常，如果不指定，就會使用全局默認時間。它支持 HTTP、HTTPS、FTP 請求。實例: ```text import urllib.request response = urllib.request.urlopen("http://httpbin.org/get",timeout=0.1) print(response.read().decode("utf-8")) ``` 運行結果如下: ```text .... During handling of the above exception, another exception occurred: .... File "E:\Python36\lib\urllib\request.py", line 1320, in do_open raise URLError(err) urllib.error.URLError: <urlopen error timed out> ``` 設置超時時間為0.1秒，程序超過0.1秒沒有響應，就會拋出URLError異常，屬于urllib.error模塊，錯誤的原因是超時可以利用try/except語句來跳過長時間未響應的頁面 ```text import urllib.request import urllib.error import socket try: response = urllib.request.urlopen("http://httpbin.org/get",timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason,socket.timeout): print("TIME OUT") ``` 運行結果: ```text TIME OUT ``` ## 其他參數還有 context 參數，它必須是 ssl.SSLContext 類型，用來指定 SSL 設置。 cafile 和 capath 兩個參數是指定 CA 證書和它的路徑，這個在請求 HTTPS 鏈接時會有用。 cadefault 參數現在已經棄用了，默認為 False。 ## 2.Request 例子: ```text import urllib.request request = urllib.request.Request("https://python.org") response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) ``` urlopen\(\) 方法的參數不再是一個 URL，而是一個 Request 類型的對象，通過構造這個這個數據結構，一方面我們可以將請求獨立成一個對象，另一方面可配置參數更加豐富和靈活 Request函數API: ```text class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None) ``` * url 參數是請求 URL，這個是必傳參數，其他的都是可選參數。 * data 參數如果要傳必須傳 bytes（字節流）類型的，如果是一個字典，可以先用 urllib.parse 模塊里的 urlencode\(\) 編碼。 * headers 參數是一個字典，這個就是 Request Headers ，可以在構造 Request 時通過 headers 參數直接構造，也可以通過調用 Request 實例的 add\_header\(\) 方法來添加。添加 Request Headers 最常用的用法就是通過修改 User-Agent 來偽裝瀏覽器，默認的 User-Agent 是 Python-urllib，我們可以通過修改它來偽裝瀏覽器，比如要偽裝火狐瀏覽器，你可以把它設置為： ```text Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11 ``` * origin\_req\_host 參數指的是請求方的 host 名稱或者 IP 地址。 * unverifiable 參數指的是這個請求是否是無法驗證的，默認是False。意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。例如我們請求一個 HTML 文檔中的圖片，但是我們沒有自動抓取圖像的權限，這時 unverifiable 的值就是 True。 * method 參數是一個字符串，它用來指示請求使用的方法，比如GET，POST，PUT等等。實例: ```text from urllib import request,parse url = 'http://httpbin.org/post' # 偽造請求頭 headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Host': 'httpbin.org' } # 構造參數 dict = { 'name':"angle", } # 轉換為字節流 data = bytes(parse.urlencode(dict),encoding='utf-8') req = request.Request(url=url,data=data,headers=headers,method='POST') response = request.urlopen(req) print(response.read().decode('utf-8')) ``` 運行結果: ```text { "args": {}, "data": "", "files": {}, "form": { "name": "angle" }, "headers": { "Accept-Encoding": "identity", "Connection": "close", "Content-Length": "10", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)" }, "json": null, "origin": "220.197.208.229", "url": "http://httpbin.org/post" } ``` 通過四個參數構造了一個 Request，url 即請求 URL，在headers 中指定了 User-Agent 和 Host，傳遞的參數 data 用了 urlencode\(\) 和 bytes\(\) 方法來轉成字節流，并指定了請求方式為 POST。另外一種添加headers的方法：利用add\_header\(\)方法來添加headers ```text req = request.Request(url=url,data=data,method='POST') req.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)') ``` 注意add\_header\(\):add\_header\(key,value\) ## 3. 高級用法 {#3-高級用法} 利用Handler 處理Cookies 處理，代理設置等操作 urllib.request 模塊里的 BaseHandler類，是所有其他 Handler 的父類，提供了最基本的 Handler 的方法，例如 default\_open\(\)、protocol\_request\(\) 方法等。接下來就有各種 Handler 子類繼承這個 BaseHandler 類，舉例幾個如下： * HTTPDefaultErrorHandler 用于處理 HTTP 響應錯誤，錯誤都會拋出 HTTPError 類型的異常。 * HTTPRedirectHandler 用于處理重定向。 * HTTPCookieProcessor 用于處理 Cookies。 * ProxyHandler 用于設置代理，默認代理為空。 * HTTPPasswordMgr 用于管理密碼，它維護了用戶名密碼的表。 * HTTPBasicAuthHandler 用于管理認證，如果一個鏈接打開時需要認證，那么可以用它來解決認證問題。 * 另外還有其他的 Handler 類，在這不一一列舉了，詳情可以參考官方文檔： [https://docs.python.org/3/library/urllib.request.html\#urllib.request.BaseHandler](https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler) 另外一個比較重要的類就是 OpenerDirector，可以稱之為 Opener，之前用過 urlopen\(\) 這個方法，實際上它就是 Urllib提供的一個 Opener。那么為什么要引入 Opener ？因為需要實現更高級的功能，之前使用的 Request、urlopen\(\) 相當于類庫封裝好了極其常用的請求方法，利用它們兩個就可以完成基本的請求，但是現在不一樣了，需要實現更高級的功能，所以需要深入一層進行配置，使用更底層的實例來完成我們的操作。所以，在這里就用到了比調用 urlopen\(\) 的對象的更普遍的對象，也就是 Opener。 Opener 可以使用 open\(\) 方法，返回的類型和 urlopen\(\) 如出一轍。那么它和 Handler 有什么關系？簡而言之，就是利用 Handler 來構建 Opener。 ### 認證 {#認證} 有些網站在打開時它就彈出了一個框，直接提示輸入用戶名和密碼，認證成功之后才能查看頁面： ![](https://box.kancloud.cn/704a30d1c927252998e78b3ee839d1df_1530x1238.png) 請求這樣的頁面需要借助于 HTTPBasicAuthHandler 就可以完成實例: ```text from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener from urllib.error import URLError username = 'username' password = 'password' url = 'http://localhost:5000/' # 實例化HTTPPasswordMgrWithDefaultRealm 對象 p = HTTPPasswordMgrWithDefaultRealm() # 利用HTTPPasswordMgrWithDefaultRealm 對象添加相關信息，這樣就建立了認證的Handler p.add_password(None,url,username,password) auth_handler = HTTPBasicAuthHandler(p) # 構建一個opener，在發送請求時就認證成功了 opener = build_opener(auth_handler) try: # 打開網址 result = opener.open(url) # 源碼 html = result.read().decode('utf-8') print(html) except URLError as e: print(e.reason) ``` 首先實例化了一個 HTTPBasicAuthHandler 對象，參數是 HTTPPasswordMgrWithDefaultRealm 對象，它利用 add\_password\(\) 添加進去用戶名和密碼，這樣我們就建立了一個處理認證的 Handler。接下來利用 build\_opener\(\) 方法來利用這個 Handler 構建一個 Opener，那么這個 Opener 在發送請求的時候就相當于已經認證成功了。接下來利用 Opener 的 open\(\) 方法打開鏈接，就可以完成認證了，在這里獲取到的結果就是認證后的頁面源碼內容。 ### 代理 {#代理} 添加代理 ```text from urllib.error import URLError from urllib.request import ProxyHandler,build_opener # 建立代理池 proxy_handler = ProxyHandler({ 'http': 'http://127.0.0.1:5000', 'https': 'https://127.0.0.1:5000' }) opener = build_opener(proxy_handler) try: response = opener.open('https://www.baidu.com') print(response.read().decode('utf-8')) except URLError as e: print(e.reason) ``` 在本地搭建一個代理，運行在9743端口上使用了 ProxyHandler，ProxyHandler 的參數是一個字典\(協議類型:代理ip\)，，可以添加多個代理。然后利用 build\_opener\(\) 方法利用這個 Handler 構造一個 Opener，然后發送請求即可。 ## Cookies 獲取相關網站的cookies ```text import http.cookiejar,urllib.request # 聲明一個CookieJar對象 cookie = http.cookiejar.CookieJar() # 構建一個handler handler = urllib.request.HTTPCookieProcessor(cookie) # 構建一個opener opener = urllib.request.build_opener(handler) # 打開網站 response = opener.open('http://www.baidu.com') for item in cookie: print(item.name+"="+item.value) ``` 運行結果如下: ```text BAIDUID=6579690C07419CE00E162042A638AEAE:FG=1 BIDUPSID=6579690C07419CE00E162042A638AEAE H_PS_PSSID=26524_1436_26909_21078_26925_20928 PSTM=1532934005 BDSVRTM=0 BD_HOME=0 delPer=0 ``` 將cookies存儲為文本格式實例: ```text import urllib.request,http.cookiejar filename = "cookies.txt" # 如果存儲為文本格式，需要用到MozillaCookieJar cookies = http.cookiejar.MozillaCookieJar(filename) # 構建一個handler handler = urllib.request.HTTPCookieProcessor(cookies) # 構建一個opener opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") cookies.save(ignore_discard=True,ignore_expires=True) ``` CookieJar需要換成 MozillaCookieJar，生成文件時需要用到它，它是 CookieJar 的子類，可以用來處理 Cookies 和文件相關的事件，讀取和保存 Cookies，它可以將 Cookies 保存成 Mozilla 型瀏覽器的 Cookies 的格式。運行之后會有一個cookies.txt文件生成，內容如下: ```text # Netscape HTTP Cookie File # http://curl.haxx.se/rfc/cookie_spec.html # This is a generated file! Do not edit. .baidu.com TRUE / FALSE 3680417901 BAIDUID 2DD475CE73FF8C93B7B1F0798D7A5426:FG=1 .baidu.com TRUE / FALSE 3680417901 BIDUPSID 2DD475CE73FF8C93B7B1F0798D7A5426 .baidu.com TRUE / FALSE H_PS_PSSID 1420_21101_26350_26923_26809 .baidu.com TRUE / FALSE 3680417901 PSTM 1532934257 www.baidu.com FALSE / FALSE BDSVRTM 0 www.baidu.com FALSE / FALSE BD_HOME 0 www.baidu.com FALSE / FALSE 2479014200 delPer 0 ``` 另外還有一個 LWPCookieJar，同樣可以讀取和保存 Cookies，但是保存的格式和 MozillaCookieJar 的不一樣，它會保存成與 libwww-perl\(LWP\) 的 Cookies 文件格式。要保存成 LWP 格式的 Cookies 文件，可以在聲明時就改為： ```text cookie = http.cookiejar.LWPCookieJar(filename) ``` 運行結果如下: ```text #LWP-Cookies-2.0 Set-Cookie3: BAIDUID="16458736324AC0E3ECA4EECD48D8DC8C:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-08-17 10:20:34Z"; version=0 Set-Cookie3: BIDUPSID=16458736324AC0E3ECA4EECD48D8DC8C; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-08-17 10:20:34Z"; version=0 Set-Cookie3: H_PS_PSSID=1432_26458_21099_20930; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0 Set-Cookie3: PSTM=1532934390; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2086-08-17 10:20:34Z"; version=0 Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0 Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0 Set-Cookie3: delPer=0; path="/"; domain="www.baidu.com"; expires="2048-07-22 07:05:45Z"; version=0 ``` 從cookies.txt讀取存儲的cookies 以LWPCookieJar為例: ```text import urllib.request,http.cookiejar cookies = http.cookiejar.LWPCookieJar() cookies.load(filename='cookies.txt',ignore_expires=True,ignore_discard=True) handler = urllib.request.HTTPCookieProcessor(cookies) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') # print(response.read().decode('utf-8')) for cookie in cookies: print(cookie) ``` 利用load\(\)方法讀取本地Cookies文件，獲取到了cookies的內容運行結果如下: ```text <Cookie BAIDUID=16458736324AC0E3ECA4EECD48D8DC8C:FG=1 for .baidu.com/> <Cookie BIDUPSID=16458736324AC0E3ECA4EECD48D8DC8C for .baidu.com/> <Cookie H_PS_PSSID=1432_26458_21099_20930 for .baidu.com/> <Cookie PSTM=1532934390 for .baidu.com/> <Cookie BDSVRTM=0 for www.baidu.com/> <Cookie BD_HOME=0 for www.baidu.com/> <Cookie delPer=0 for www.baidu.com/> ```