CrawlSpider · Python爬蟲

CrawlSpider可以更簡單地實現翻頁請求，利用 `Rule(LinkExtractor...)` 捕捉符合規則的url，然后調用一個解析器解析該url。 <br/> 步驟如下： **1. 到項目目錄下創建crawlSpider** ``` # scrapy genspider -t crawl <爬蟲名稱> <域名> > scrapy genspider -t crawl ct_liks www.wxapp-union.com ``` 執行上面的命令后，將自動生成如下的`ct_liks.py`文件，如下： ```python import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CtLiksSpider(CrawlSpider): name = 'ct_liks' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) # 這個方法名是可隨便更改的 def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() return item ``` `ct_liks.py`文件是可以手動創建的，只是太麻煩了。 <br/> **2. 在`ct_liks.py`文件中定義 rules** ```python """ @Date 2021/4/9 """ import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CtLiksSpider(CrawlSpider): name = 'ct_liks' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/'] rules = ( # 1. 捕捉http://www.wxapp-union.com/頁面的類似的 https://www.wxapp-union.com/article-7002-1.html 鏈接 # 如果多個Rule都滿?某?個URL，會從rules中選擇第?個滿?的進?操作 Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item'), # 你可以定義多條rule # Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item2'), ) # 2. 每當有一個url符合www.wxapp-union.com/article-\d+-1.html規則，則parse_item自動被調用一次 def parse_item(self, response): title = response.xpath("//title").extract_first() print(title) ``` <br/> LinkExtractor和Rule還有如下參數可選： ```python class LxmlLinkExtractor(FilteringLinkExtractor): def __init__( self, allow=(), # 允許的url。所有滿足這個正則表達式的url都會被提取。 deny=(), # 禁止的url。所有滿足這個正則表達式的url都不會被提取。 allow_domains=(), # 允許的域名。只有在這個里面指定的域名的url才會被提取。 deny_domains=(), # 禁止的域名。所有在這個里面指定的域名的url都不會被提取 restrict_xpaths=(), # 嚴格的xpath。和allow共同過濾鏈接。 tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, deny_extensions=None, restrict_css=(), strip=True, restrict_text=None, ): class Rule: def __init__( self, link_extractor=None, # 就是一個LinkExtractor對象 callback=None, # 滿足這個規則的url，應該要執行的回調函數。 # 因為 CrawlSpider使用了parse作為回調函數，因此不要覆蓋parse作為自己的回調函數 cb_kwargs=None, follow=None, # 指定根據該規則從response中提取的鏈接是否需要跟進。 # 不指定callback函數的請求下，如果follow為True，滿足該rule的URL還會繼續被請求 process_links=None, # 從link_extractor中獲取到鏈接后會傳遞給這個函數，用來過濾不需要爬取的鏈接。 process_request=None, errback=None, ): ```