模擬登錄 · Python爬蟲

scrapy實現登錄有兩種思路： 1. 直接攜帶cookie登錄；應用場景：（1）cookie過期時間很長,常見于一些不規范的網站（2）能在cookie過期之前把所有的數據拿到（3）配合其他程序使用，比如其使用selenium把登陸之后的cookie獲取到保存到本地，scrapy發送請求之前先讀取本地cookie 2. 找到登錄的url，發送post請求存儲cookie；例：登錄github **1. 直接攜帶cookie登錄** （1）創建爬蟲項目 ```shell > scrapy startproject git > cd git > scrapy genspider git1 github.com ``` （2）配置`settings.py` ```python # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False ``` （3）先手動登錄到github，復制cookie ![](https://img.kancloud.cn/7a/ae/7aae9e676664575c584101b6874be222_1303x445.jpg) （4）重寫 `start_requests` 方法 ```python import scrapy class Git1Spider(scrapy.Spider): name = 'git1' allowed_domains = ['github.com'] # 注意：請求的url應該是 https://github.com/你的github用戶名 start_urls = ['https://github.com/你的github用戶名'] def parse(self, response): # 登錄前github上的title是 GitHub . GitHub # 登錄成功后為用戶名 . GitHub # 輸出用戶名 · GitHub，說明登錄成功 print(response.xpath('/html/head/title/text()').extract_first())) pass def start_requests(self): """ 重寫該方法 """ url = self.start_urls[0] cookie = '_ga=GA1.2.534025100（cookie太長了這里省略不寫了）...3D' # 1. 將cookie轉換為字典 cookies = {data.split('=')[0]: data.split('=')[-1] for data in cookie.split(';')} # 2. 攜帶cookies發送請求 yield scrapy.Request( url=url, callback=self.parse, cookies=cookies ) ``` **2. 找到的url，攜帶相關參數發送post請求** 其分析過程這里就省略了，下面只提供了scrapy中用于發送 POST 請求的代碼。 ```python import scrapy class Git2Spider(scrapy.Spider): name = 'git2' allowed_domains = ['github.com'] start_urls = ['http://github.com/login'] def parse(self, response): # 1. 解析出登錄需要的所有參數 post_data = {} # 2. 找到登錄的url，提交請求 # 發送 POST請求可以調用scrapy.FormRequest # 或者 scrapy.Request(url, method='POST') yield scrapy.FormRequest( url='https://github.com/session', # github提交表單的地址 callback=self.login_github, # 登錄成功后的解析函數 formdata=post_data # 進行登錄時所需要的參數 ) pass ```