6.4 分析Ajax爬取今日頭條街拍美圖 · python3爬蟲筆記

### 1.說明爬取今日頭條街拍美圖，并保存到MongoDB中 ### 2.準備 [安裝requests庫](/1kai-fa-huan-jing-pei-zhi/12-qing-qiu-ku-de-an-zhuang/121-requestsde-an-zhuang.md) ### 3.抓取分析鏈接:[https://www.toutiao.com/search/?keyword=街拍](https://www.toutiao.com/search/?keyword=街拍) ![](/assets/6.4-1.png)打開開發者工具 > Network面板>選中XHR，篩選ajax請求分析[https://www.toutiao.com/search\_content/?offset=0&format=json&keyword=街拍&autoload=true&count=20&cur\_tab=1&from=search\_tab](https://www.toutiao.com/search_content/?offset=0&format=json&keyword=街拍&autoload=true&count=20&cur_tab=1&from=search_tab)鏈接可以看到有幾個參數，往下不停刷新，可以得到幾個重要參數的含義 offset:偏移量，每次刷新后，從第幾條開始顯示的數據 keyword:搜索關鍵字 count:顯示的數據條數 ### 4. 實戰演練 {#3-實戰演練} ``` import requests,os from urllib.parse import urlencode from multiprocessing import Pool from hashlib import md5 baseurl = "https://www.toutiao.com/search_content/?" headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36', 'x-requested-with': 'XMLHttpRequest', } GROUP_START = 1 GROUP_END = 20 def get_page(offset=0): params = { 'offset': offset, 'format': 'json', 'keyword': '街拍', 'autoload': 'true', 'count': '20', 'cur_tab': '1', 'from': 'search_tab', } url = baseurl + urlencode(params) try: response = requests.get(url,headers=headers) if response.status_code == 200: return response.json() except requests.ConnectionError as e: print(e.args) def get_images(response): if response: items = response.get("data") if items: for item in items: title = item.get("title") images = item.get("image_list") if title and images: for image in images: yield { 'image': image.get("url"), 'title': title, } def save_images(item): title = item.get("title") if not os.path.exists(title): os.mkdir(title) try: image = item.get("image") response = requests.get("http:"+image) if response.status_code == 200: # 讀取二進制流數據 content = response.content # 利用md5函數判斷重復 filepath = "{0}/{1}.{2}".format(title,md5(content).hexdigest(),'jpg') if not os.path.exists(filepath): with open(filepath,'wb' ) as f: f.write(response.content) else: print("Already Download {}".format(filepath)) except requests.ConnectionError as e: print("Failed to Save Image") def main(offset=0): response = get_page(offset) for item in get_images(response): print(item) save_images(item) if __name__ == "__main__": # 開啟進程池 pool = Pool() groups = [x*20 for x in range(GROUP_START,GROUP_END+1)] print(groups) pool.map(main,groups) pool.close() pool.join() ``` ![](/assets/6.4-10.png)