### **第一步:創建項目**
~~~
scrapy startproject douyu
~~~
### **第二步:創建爬蟲**
~~~
scrapy genspider douyucdn http://capi.douyucdn.cn
~~~
### **第三步:編寫items.py,明確需要提取的數據**
~~~
import scrapy
class DouyuItem(scrapy.Item):
nickname = scrapy.Field()
headimg = scrapy.Field()
~~~
### **第四步:編寫spiders/xxx.py 編寫爬蟲文件,處理請求和響應,以及提取數據(yeild item)**
~~~
import scrapy
import json
from douyu.items import DouyuItem
class DouyucdnSpider(scrapy.Spider):
name = 'douyucdn'
allowed_domains = ['douyucdn.cn']
baseUrl='http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset='
offset=0
start_urls = [baseUrl+str(offset)]
def parse(self, response):
data_list=json.loads(response.body)['data']
if not len(data_list):
return
for data in data_list:
item=DouyuItem()
item['headimg']=data['vertical_src']
item['nickname']=data['nickname']
yield item
self.offset+=20
yield scrapy.Request(self.baseUrl+str(self.offset),callback=self.parse)
~~~
### **第五步:編寫pipelines.py管道文件,處理spider返回item數據**
~~~
from scrapy.pipelines.images import ImagesPipeline
from douyu.settings import IMAGES_STORE as images_store
import scrapy
import os
class DouyuPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
imgUrl=item['headimg']
yield scrapy.Request(imgUrl)
def item_completed(self, results, item, info):
#取出results中的文件地址
image_path=[x["path"] for ok , x in results if ok]
#然后拼接一下具體路徑,這里需要引入settings.py中的IMAGES_STORE的值
old_path=images_store+image_path[0]
new_path=images_store+'named/'+item['nickname']+'.jpg'
os.rename(old_path,new_path)
return item
~~~
### **第六步:編寫settings.py,啟動管理文件,以及其他相關設置**
> 因為要偽裝成手機訪問,所以要指定user-agent,可以到http://www.fynas.com/ua 中找到自己想要偽裝的手機信息
~~~
USER_AGENT = 'Mozilla/5.0 (iPhone 84; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.0 MQQBrowser/7.8.0 Mobile/14G60 Safari/8536.25 MttCustomUA/2 QBWebViewType/1 WKType/1'
~~~
> 因為我們要把主播的照片保存到本地,所以需要指定保存的地址
~~~
IMAGES_STORE = "C:/Users/Administrator/Desktop/douyu/images/"
~~~
> 因為涉及到圖片處理,所以需要應用到第三方庫Pillow,所以如果之前沒有安裝過,需要先安裝一下,不然會有關于Pil的報錯
~~~
pip install Pillow
~~~
因為有些網站會做robot過濾,所以要把robot關掉
~~~
ROBOTSTXT_OBEY = False
~~~
然后寫一下管道名稱:
~~~
ITEM_PIPELINES = {
'douyu.pipelines.DouyuPipeline': 300,
}
~~~
### **第七步:執行爬蟲**
備注:
如何提取下面這段代碼中的path值?
~~~
results='[(True, {'url': 'https://rpic.douyucdn.cn/live-cover/appCovers/2018/02/01/4189383_20180201171138_big.jpg', 'path': 'full/811a893386a55177f36abcde290eaf16933e5888.jpg', 'checksum': '0fd2746c8711d9eb6c7bc3db138f0ac4'})]'
~~~
用下面的方法
~~~
path=[x["path"] for ok ,x in results if ok ]
~~~