urllib3庫的基本使用 · 蟲師de江湖

[TOC] # urllib3庫的基本使用 > urllib3庫是Python中的HTTP協議客戶端，功能豐富而強大。 ## 安裝urllib3庫 ```bash pip install urllib3 ``` 查看`urllib3`版本： ```Python #!/usr/bin/env python3 import urllib3 print(urllib3.__version__) ``` ## 編寫urllib3示例 > 下面開始使用urllib3庫，編寫一下比較常見的用法示例 ### 訪問HTTP協議頁面 ```Python #!/usr/bin/env python3 import urllib3 http = urllib3.PoolManager() url = 'http://www.baidu.com' resp = http.request('GET', url) print(resp.status) if resp.status == 200: print(resp.data.decode('utf-8')) ``` ### 使用`stream`流模式下載二進制文件 ```Python import urllib3 import certifi url = 'https://docs.oracle.com/javase/specs/jls/se14/jls14.pdf' filename = url.split('/')[-1] http = urllib3.PoolManager(ca_certs = certifi.where()) try: # 設置 preload_content = False 將開啟流傳輸模式 # resp = http.request('GET', url, preload_content = False) with open(filename, 'wb') as f: for chunk in resp.stream(4096): f.write(chunk) finally: # 流傳輸模式，需要手動釋放鏈接 resp.release_conn() ``` ### 設置超時`timeout` > 通過設置超時時間(單位秒，float類型) ```Python import urllib3 http = urllib3.PoolManager() url = 'https://www.baidu.com' try: #resp = http.request('GET', url, timeout=0.5, retries=False) resp = http.request('GET', url, timeout=urllib3.Timeout(connect=0.5, read=3.0)) print(resp.status) if resp.status == 200: print(resp.data.decode('utf-8')) except urllib3.exceptions.ConnectTimeoutError: print('連接超時') ``` ### 訪問HTTPS協議頁面 >在urllib3提供客戶端TLS / SSL的驗證。為此，我們需要下載certifi模塊。它為我們提供了精心挑選的根證書的集合，用于在驗證TLS主機的身份和驗證SSL證書的可信賴性。安裝 certifi模塊： ``` pip install certifi ``` 查看證書文件位置： ```Python import certifi print(certifi.where()) ``` 同樣，編寫一個訪問HTTPS頁面的例子： ```Python #!/usr/bin/env python3 import urllib3 import certifi url = 'https://httpbin.org/anything' http = urllib3.PoolManager(ca_certs=certifi.where()) resp = http.request('GET', url) print(resp.status) ``` ### 參數查詢信息-GET ```Python #!/usr/bin/env python3 import urllib3 import certifi http = urllib3.PoolManager(ca_certs=certifi.where()) payload = { 'name': 'Peter', 'age': 23 } url = 'https://httpbin.org/get' resp = http.request('GET', url, fields=payload) print(resp.data.decode('utf-8')) ``` 執行結果： ```JSON { "args": { "age": "23", "name": "Peter" }, "headers": { "Accept-Encoding": "identity", "Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f0bbbdc-915aa646d14ee320d98bc4e3" }, "origin": "127.0.0.1", "url": "https://httpbin.org/get?name=Peter&age=23" } ``` ### POST 提交web表單 ```Python #!/usr/bin/env python3 import urllib3 import certifi http = urllib3.PoolManager(ca_certs = certifi.where()) payload = { 'name': 'Peter', 'age': 23 } url = 'https://httpbin.org/post' resp = http.request('POST', url, fields = payload) print(resp.data.decode('utf-8')) ``` 執行結果： ```sh $ python ./post_request.py { "args": {}, "data": "", "files": {}, "form": { "age": "18", "name": "Peter" }, "headers": { "Accept-Encoding": "identity", "Content-Length": "214", "Content-Type": "multipart/form-data; boundary=281e3c05a41e5ec834f98cf2b673113a", "Host": "httpbin.org", "X-Amzn-Trace-Id": "Root=1-5f0c0f9d-f490eb9ad301417e59f7b6a2" }, "json": null, "origin": "127.0.0.1", "url": "https://httpbin.org/post" } ``` 可以看到`form`表單中有了我們`POST`提交的數據了。 ### POST 發送JSON數據 ```Python import urllib3 import certifi import json http = urllib3.PoolManager(ca_certs = certifi.where()) payload = { 'name':'Peter', 'age': 23 } encoded_data = json.dumps(payload).encode('utf-8') header = { 'Content-Type': 'application/json' } url = 'https://httpbin.org/post' resp = http.request('POST', url, headers =header, body = encoded_data) print(resp.data.decode('utf-8')) ``` 執行結果： ``` $ python ./post_json.py { "args": {}, "data": "{\"name\": \"Peter\", \"age\": 23}", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "28", "Content-Type": "application/json", "Host": "httpbin.org", "User-Agent": "python-requests/2.24.0", "X-Amzn-Trace-Id": "Root=1-5f0c1120-a6c6b65c1eb9dbdc18e21420" }, "json": { "age": 23, "name": "Peter" }, "origin": "127.0.0.1", "url": "https://httpbin.org/post" } ``` 我們可以看到 `headers`中多了`"Content-Type": "application/json"`字段， `json` 中有數據了。 ### 使用代理訪問 > urllib3支持配置代理訪問服務器，HTTP協議代理使用`ProxyManager`類，`SOCKS4`和`SOCKS5`協議使用的是`SOCKSProxyManager` 注：這里出現的公網IP地址都不是真實有效的，僅用于驗證效果。 #### HTTP/HTTPS協議代理的使用 > 首先我需要從代理池`http://localhost:5010/get`中獲取一個代理地址，然后再使用它來訪問'https://httpbin.org/ip' 代碼如下： ```Python import urllib3 import json proxy_addr = 'http://88.198.201.112:8888' print(f'代理地址：{proxy_addr}') proxy = urllib3.ProxyManager(proxy_addr) resp = proxy.request('GET', 'https://httpbin.org/ip') print(resp.data.decode('utf-8')) ``` **執行一下的結果：** ``` $ python ./get_proxy.py 代理地址：http://88.198.201.112:8888 { "origin": "88.198.201.112" } ``` 可以看到`httpbin.org`服務器返回的源IP地址是代理地址，而不再是我個人的公網地址了，這樣可以解決同IP地址訪問過多而被限制情況。 #### SOCKS5協議代理的使用 > 使用前可能需要安裝`PySocks`包才可以使用 ```Bash pip install 'urllib3[socks]' ``` 示例代碼如下： ```Python from urllib3.contrib.socks import SOCKSProxyManager import json proxy_addr = 'socks5://127.0.0.1:1080' print(f'SOCKS5代理地址:{proxy_addr}') proxy = SOCKSProxyManager(proxy_addr) resp = proxy.request('GET', 'https://httpbin.org/ip') print(resp.data.decode('utf-8')) url = 'https://www.google.com' resp = proxy.request('GET', url) print(f'返回狀態碼:{resp.status}') ``` **執行結果：** ```Bash $ python ./get_socks.py SOCKS5代理地址:socks5://127.0.0.1:1080 { "origin": "88.198.201.112" } 返回狀態碼:200 ``` ---