1.Requests的常用方法 · Python3 爬蟲實戰

## **Requests的常用方法** ### Requests庫常用的函數方法 ``` requests.get() 獲取Html的主要方法，模擬發送get請求 requests.post() 向html提交post請求方法 requests.put()????????????向html提交put請求方法 requests.patch??????????? 向html?提交局部修改的請求 requests.delete()???????? 向html?提交刪除的請求 ``` ### 1.Get請求 ~~~ import requests import json r = requests.get('http://httpbin.org/get') html = r.text html2 = json.loads(html) print(html) print(type(html),type(html2)) print(html["url"]) print(html2["url"]) 運行結果如下： { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", Traceback (most recent call last): "User-Agent": "python-requests/2.22.0" }, "origin": "114.248.162.218, 114.248.162.218", File "F:/Desktop/Project/課件代碼/1.py", line 8, in <module> "url": "https://httpbin.org/get" } print(html["url"]) TypeError: string indices must be integers <class 'str'> <class 'dict'> ~~~ ### 2.POST請求 ~~~ import requests data = {'name': 'germey', 'age': '22'} r = requests.post("http://httpbin.org/post", data=data) print(r.text) 運行結果 { "args": {}, "data": "", "files": {}, "form": { "age": "22", "name": "germey" }, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "18", "Content-Type": "application/x-www-form-urlencoded", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "json": null, "origin": "114.248.162.218, 114.248.162.218", "url": "https://httpbin.org/post" } ~~~ ### 3.添加header ~~~ import requests r1 = requests.get("https://www.zhihu.com/explore") print(r1.text) headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac oS X 10 11 _4) AppleWebKit/537. 36 (KHTML, like Gecko)' } r2 = requests.get("https://www.zhihu.com/explore",headers=headers) print(r2.text) 運行結果 <html> <head><title>400 Bad Request</title></head> <body bgcolor="white"> <center><h1>400 Bad Request</h1></center> <hr><center>openresty</center> </body> </html> ============== <!doctype html> <html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">發現 - 知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="有問題，上知乎。知乎，可信賴的問答社區，以讓每個人高效獲得可信賴的解答為使命。知乎憑借認真、專業和友善的社區氛圍，結構化、易獲得的優質內容，基于問答的內容生產方式和獨特的社區機制，吸引、聚集了各行各業中大量的親歷者、內行人、領域專家、領域愛好者，將高質量的內容透過人的節點來成規模地生產和分享。用戶通過問答等交流方式建立信任和連接，打造和提升個人影響力，并發現、獲得新機會。"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png" sizes="152x152"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-120.b3e6278d.png" sizes="120x120"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-76.7a750095.png" sizes="76x76"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-60.a4a761d4.png" sizes="60x60"/><link rel="shortcut icon" type="image/x-icon" href="https://static.zhihu.com/static/favicon.ico"/><link rel="search" type="application/opensearchdescription+xml" href="https://static.zhihu.com/static/search.xml" title="知乎"/><link rel="dns-prefetch" href="//static.zhimg.com"/><link rel="dns-prefetch" href="//pic1.zhimg.com"/><link rel="dns-prefetch" href="//pic2.zhimg.com"/><link rel="dns-prefetch" href="//pic3.zhimg.com"/><link rel="dns-prefetch" href="//pic4.zhimg.com"/><style> .u-safeAreaInset-top { height: constant(safe-area-inset-top) !important; height: env(safe-area-inset-top) !important; } .u-safeAreaInset-bottom { height: constant(safe-area-inset-bottom) !important; height: env(safe-area-inset-bottom) !important; } ~~~ ### 4.文件上傳 ~~~ import requests files = {'file': open('favicon.png', 'rb')} r = requests. post("http://httpbin.org/post", files=files) print(r.text) 運行結果 { "args": {}, "data": "", "files": { "file": "data:application/octet-stream;base64,iVBORw0KGgoAAAANSUhEUgAAAhwAAAECCAMAAACCFP44AAAACXBIWXMAAAsTAAALEwEAmpwYAAAKTWlDQ1BQaG90b3Nob3AgSUNDIHByb2ZpbGUAAHjanVN3WJP3Fj7f92UPVkLY8LGXbIEAIiOsCMgQWaIQkgBhhBASQMWFiApWFBURnEhVxILVCkidiOKgKLhnQYqIWotVXDjuH9yntX167+3t+9f7vOec5/zOec8PgBESJpHmomoAOVKFPDrYH49PS" }, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "8024", "Content-Type": "multipart/form-data; boundary=ae576c1072214f7675389b19c437283d", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "json": null, "origin": "114.248.162.218, 114.248.162.218", "url": "https://httpbin.org/post" } ~~~ ### 5.代理設置對于某些網站，在測試的時候請求幾次，能正常獲取內容。但是一- 旦開始大規模爬取，對于大規模且頻繁的請求，網站可能會彈出驗證碼，或者跳轉到登錄認證頁面，更甚者可能會直接封禁客戶端的IP，導致一定時間段內無法訪問。那么，為了防止這種情況發生，我們需要設置代理來解決這個問題，這就需要用到proxies參數。可以用這樣的方式設置: ~~~ import requests proxies = { "http": "http://sun:qq123456.@192.168.66.211:520", } r1 = requests.get('http://httpbin.org/get') r2 = requests.get('http://httpbin.org/get',proxies=proxies) print(r1.text) print(r2.text) 運行結果： { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "origin": "114.248.162.218, 114.248.162.218", "url": "https://httpbin.org/get" } { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.22.0" }, "origin": "175.98.194.165, 175.98.194.165", "url": "https://httpbin.org/get" } ~~~ ### 超時設置在本機網絡狀況不好或者服務器網絡響應太慢甚至無響應時，我們可能會等待特別久的時間才可能收到響應，甚至到最后收不到響應而報錯。為了防止服務器不能及時響應，應該設置一個超時時間，即超過了這個時間還沒有得到響應，那就報錯。這需要用到timeout參數。這個時間的計算是發出請求到服務器返回響應的時間。示例如下: ~~~ #設置超時 import requests r = requests.get("https://www.taobao.com", timeout = 0.0001) print(r.status_code) 運行結果 requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.taobao.com', port=443): Read timed out. (read timeout=0.0001) #永不超時 import requests r = requests.get("https://www.taobao.com", timeout = 1) print(r.status_code) r = requests.get( 'https://www.google.com',timeout=None) print(r.text) ~~~ ### 會話保持在requests中，如果直接利用get()或post()等方法的確可以做到模擬網頁的請求，但是這實際上是相當于不同的會話，也就是說相當于你用了兩個瀏覽器打開了不同的頁面。設想這樣一個場景，第一個請求利用post()方法登錄了某個網站，第二次想獲取成功登錄后的自己的個人信息，你又用了一次get()方法去請求個人信息頁面。實際上，這相當于打開了兩個瀏覽器, 是兩個完全不相關的會話，能成功獲取個人信息嗎?那當然不能。有小伙伴可能說了，我在兩次請求時設置一樣的cookies 不就行了?可以，但這樣做起來顯得很煩瑣，我們有更簡單的解決方法。其實解決這個問題的主要方法就是維持同--個會話，也就是相當于打開一個新的瀏覽器選項卡而不是新開- - 個瀏覽器。但是我又不想每次設置cookies, 那該怎么辦呢?這時候就有了新的利器--- Session 對象。利用它，我們可以方便地維護一一個會話，而且不用擔心cookies 的問題，它會幫我們自動處理好。 ~~~ get測試： import requests requests .get('http://httpbin.org/cookies/set/number/123456789') r = requests .get('http://httpbin.org/cookies') print(r.text) 運行結果： { "cookies": {} } 使用會話進行測試： import requests s = requests.Session() s.get('http://httpbin.org/cookies/set/number/123456789') r = s.get('http://httpbin.org/cookies') print(r.text) 運行結果： { "cookies": { "number": "123456789" } } ~~~