chapter28_階段考核4_爬蟲下載網易汽車 · python 快速入門

## 實戰要求 >[info] 通過前面章節的學習，我們學會了HTTP網絡請求，學會了HTML解析，學會了文件讀寫，學會了多線程。 > >基于這幾個知識點，我們嘗試寫一個爬蟲，將網易汽車網站上的圖片都下載到本地吧！ > >網易汽車新能源選車首頁：http://product.auto.163.com/newpower/#newindex >1. 將每款新能源汽車的詳情頁（一張）介紹圖下載到本地，命名格式：品牌_型號_最低價_最高價.jpg >2. 使用多線程，提高程序效率 ## 說明 * 請獨立完成實戰要求，完成后再參考下面的示例代碼 * 如果覺得自己的代碼*更加優雅，更加高效*，歡迎留言**，與大家一起**分享**哦~ :-: 一起來挑戰吧~ ## 參考代碼: ```python #!/usr/bin/env python # -*- coding: utf-8 -*- from pyquery import PyQuery as pq import requests import queue import math import time import threading class WyCard(object): def __init__(self,card_type): self.list_url={ "newpower":"http://product.auto.163.com/energy_api/getEnergySeriesList.action?orderType=0&size=20&page=" }.get(card_type) self.q=queue.Queue() def put_q_list(self): """ 向queue隊列中插入汽車信息 :return: """ resp=requests.get(self.list_url+"1") page_count=1 if resp.status_code==200: total=resp.json().get("total") card_list = resp.json().get("list") page_count = math.ceil(int(total)/20) self.q.put(card_list) else: print("列表頁訪問失敗:",self.list_url+"1") if page_count>1: # 翻頁獲取汽車列表 for i in range(1,page_count): i+=1 resp=requests.get(self.list_url+str(i)) if resp.status_code==200: card_list = resp.json().get("list") self.q.put(card_list) else: print("列表頁訪問失敗:",self.list_url+str(i)) # queue列表末尾，添加一個特殊字符，用于讓取消息的線程知道已經到了隊列末尾，不再有新汽車加入隊列了。 self.q.put("END") def get_q_list(self): """ 從queue隊列中獲取汽車信息，獲得汽車品牌，型號，價格，詳情頁地址；并且通過詳情頁地址，獲取詳情頁中的圖片，將圖片下載到本地 :return: """ while True: data = self.q.get() if data=="END": print("END....") # 注意這里使用了一個小技巧，每個線程結束時，向queue中插入一個字符串“END”，這樣其它線程拿到“END”時，也就知道該退出線程了。 # 相當于當某個線程結束后，通知其他線程也可以結束了。 self.q.put("END") break for item in data: url = item.get("url") brand_name = item.get("brand_name") name = item.get("name") price_min = item.get("price_min") price_max = item.get("price_max") title=brand_name+"_"+name+"_"+price_min+"萬_"+price_max+"萬.jpg" self._download_card(url,title) def _download_card(self,series_url,title): """ 下載圖片到本地 :param series_url: :param title: :return: """ resp = requests.get(series_url) if resp.status_code == 200: d = pq(resp.text) img_src = d('#car_pic img').attr("src") resp = requests.get(img_src) if resp.status_code == 200: with open(title,"wb") as fp: print("正在下載:",title,flush=True) fp.write(resp.content) else: print("圖片訪問失敗：",img_src) else: print("頁面訪問失敗：",series_url) def main(card_type): """ 主函數，通過多線程，實現一個線程去獲取汽車列表，多個線程去下載圖片。 :param card_type: :return: """ card=WyCard(card_type) thread_list=[] # 創建一個獲取汽車列表的線程 thread_list.append(threading.Thread(target=card.put_q_list,args=())) # 創建多個線程去下載圖片，這里設定10個線程 get_thread_count=10 for i in range(get_thread_count): thread_list.append(threading.Thread(target=card.get_q_list,args=())) # 開啟所有線程 for t in thread_list: t.start() # 等待所有線程完成 for t in thread_list: t.join() if __name__ == '__main__': start_time=time.time() main("newpower") print("total_time:",time.time()-start_time) ``` **邏輯分析：** 1. 訪問首頁 http://product.auto.163.com/newpower/#newindex 首先想到的是獲取列表頁數，我們點擊翻頁時，會發現url并沒有發生改變，也就是說我們不能通過直接訪問url來進行翻頁了。 2. 查找翻頁數，通過查看列表頁源碼，并不能找得到翻頁這一欄的元素 ![](https://box.kancloud.cn/9e5be56e8401d016e99f20610b3c4b10_999x332.jpg) ![](https://box.kancloud.cn/a2087ef133590d4947e7f112c4b0ee4b_853x353.jpg) 從這里，我們已經可以確定，我們不能通過網頁解析的方式來獲取汽車列表數據與翻頁數了。 3. 那么它是如何實行翻頁的呢？通過network分析，我們可以發現，當我們點擊翻頁時，會發出一個XHR請求，返回汽車列表，從請求參數中，我們可以找到一個叫page的參數值，因此，我們可以確定我們可以通過這個接口請求獲取得到汽車列表總數，通過page參數獲取每一頁的列表。 GET http://product.auto.163.com/energy_api/getEnergySeriesList.action?orderType=0&size=20&page=1 4. 通過分析汽車列表返回信息，我們可以直接獲取到汽車品牌，汽車型號，汽車詳情頁URL等關鍵信息 5. 通過詳情頁URL，訪問詳情頁，通過html解析，可以獲取到汽車詳情頁中的圖片，然后將圖片下載到本地 6. 使用一個線程去獲取汽車列表，并且插入到queue中，多個線程從queue中拿到汽車列表，循環列表獲取到具體汽車信息，并且保存相關信息到本地 7. 調整下載圖片的線程（總共有156個圖片），可以發現單線程會比較慢，但是也不是線程數越大就越快。以我的電腦為例，1個線程則需要耗時52s，10個線程需要耗時7s，100個線程需要耗時8s <hr style="margin-top:100px"> :-: ![](https://box.kancloud.cn/2ff0bc02ec938fef8b6dd7b7f16ee11d_258x258.jpg) ***微信掃一掃，關注“python測試開發圈”，了解更多測試教程！***