向調度器發送請求 · TUNA-daily

## 1. Spider類 1. spider類 ~~~ # 回調parse解析 scrapy.Request(url=self.url+str(self.offset),callback=self.parse) ~~~ ~~~ # -*- coding: utf-8 -*- import scrapy from mySpider.items import tecentItem import os class TunaSpider(scrapy.Spider): name = 'tecent' # 爬蟲識別名稱，唯一且不同的爬蟲有不一樣的名字 allowed_domains = ['hr.tencent.com'] # 限制搜索域名范圍 url = 'http://hr.tencent.com/position.php?&start=' offset = 0 start_urls = [url + str(offset)] # 爬蟲入口url # 解析方法，每個初始URL完成下載后被調用，調用時傳入每個URL的傳回的 # response對象作為唯一的參數 def parse(self, response): teacher_list = response.xpath('//tr[@class="odd"]|//tr[@class="even"]') for each in teacher_list: item = tecentItem() # 不加extract() 結果為xpath匹配對象 try: position_name = each.xpath('./td[1]/a/text()').extract()[0] position_type = each.xpath('./td[2]/text()').extract()[0] # title location = each.xpath('./td[4]/text()').extract()[0] # info time = each.xpath('./td[5]/text()').extract()[0] detail = each.xpath('./td[1]/a/@href').extract()[0] print("職位名稱" + position_name) print("職位：" + position_type) print("工作地點:" + location) print("發布時間" + time) print("詳情：" + detail) item['position_type'] = position_type item['position_name'] = position_name item['location'] = location item['publish_time'] = time item['detail'] = "http://hr.tencent.com/" + detail yield item except: pass if self.offset < 2250: self.offset +=10 print("第幾頁" + self.url+str(self.offset)) else: os._exit(0) yield scrapy.Request(url=self.url+str(self.offset),callback=self.parse) ~~~ ## 2. CrawlSpider 爬蟲類 ### 2.1 實例 * 建立一個CrawlSpider的爬蟲子類的模板 ~~~ scrapy genspider -t crawl dongguan 'wz.sun0769.com' # scrapy genspider -t crawl 固定寫法 # dongguan 'wz.sun0769.com' 爬蟲名域限定名 ~~~ 1. 這個CrawlSpider類和Spider類似，但是不一樣的是，我們解析的函數一定不能是parse（），因為parse（）函數此時是框架業務邏輯的實現，所以在rules上調用其他函數 ~~~ # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from mySpider.items import dongguan class DongguanSpider(CrawlSpider): name = 'dongguan' # 爬蟲標識名 allowed_domains = ['wz.sun0769.com'] # 限定爬取網頁的域 # 爬蟲開始頁，與Spider類不同的是，它的首頁值提取符合規則的連接，真正開始爬取數據從rules爬取 start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=0'] # 從rlues開始提取 rules = ( # 只提取復合規則的頁面鏈接，不做分析，所以跟頁面但是沒有，follow是對網易深一層的爬取，false表示不提取連接，也不請求頁面上的連接 Rule(LinkExtractor(allow=r'type=4&page=\d+'), follow=True), Rule(LinkExtractor(allow=r'question/\d+/\d+\.shtml'), callback='parse_item', follow=False) ) def parse_item(self, response): item = dongguan() question = response.xpath('//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()').extract()[0] state = response.xpath('//div[@class="audit"]//span/text()').extract()[0] print("問題：" + question) print("處理狀態：" + state) item['question'] = question item['state'] = state yield item ~~~ * Spider版本處理多請求 ~~~ # -*- coding: utf-8 -*- import scrapy from newdongguan.items import NewdongguanItem class DongdongSpider(scrapy.Spider): name = 'xixi' allowed_domains = ['wz.sun0769.com'] url = 'http://wz.sun0769.com/index.php/question/questionType?type=4&page=' offset = 0 start_urls = [url + str(offset)] # parse方法用來處理頁面 def parse(self, response): # 每一頁里的所有帖子的鏈接集合 links = response.xpath('//div[@class="greyframe"]/table//td/a[@class="news14"]/@href').extract() # 迭代取出集合里的鏈接 for link in links: # 提取列表里每個帖子的鏈接，發送請求放到請求隊列里,并調用self.parse_item來處理 yield scrapy.Request(link, callback = self.parse_item) # 頁面終止條件成立前，會一直自增offset的值，并發送新的頁面請求，調用parse方法處理 if self.offset <= 71160: self.offset += 30 # 發送請求放到請求隊列里，調用self.parse處理response yield scrapy.Request(self.url + str(self.offset), callback = self.parse) # 處理每頁面具體鏈接的 def parse_item(self, response): item = NewdongguanItem() # 標題 item['title'] = response.xpath('//div[contains(@class, "pagecenter p3")]//strong/text()').extract()[0] # 編號 item['number'] = item['title'].split(' ')[-1].split(":")[-1] # 內容，先使用有圖片情況下的匹配規則，如果有內容，返回所有內容的列表集合 content = response.xpath('//div[@class="contentext"]/text()').extract() # 如果沒有內容，則返回空列表，則使用無圖片情況下的匹配規則 if len(content) == 0: content = response.xpath('//div[@class="c1 text14_2"]/text()').extract() item['content'] = "".join(content).strip() else: item['content'] = "".join(content).strip() # 鏈接 item['url'] = response.url # 交給管道 yield item ~~~ * item和pipeline都和是Spider一樣，還有驅動爬蟲程序開始的命令也一樣 ### 2.2 LinkExtractors > class scrapy.linkextractors.LinkExtractor > Link Extractors 的目的很簡單: 提取鏈接? > 每個LinkExtractor有唯一的公共方法是 extract_links()，它接收一個 Response 對象，并返回一個 scrapy.link.Link 對象。 > Link Extractors要實例化一次，并且 extract_links 方法會根據不同的 response 調用多次提取鏈接? ~~~ class scrapy.linkextractors.LinkExtractor( allow = (), deny = (), allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), tags = ('a','area'), attrs = ('href'), canonicalize = True, unique = True, process_value = None ) ~~~ 主要參數： allow：滿足括號中“正則表達式”的值會被提取，如果為空，則全部匹配。 deny：與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取。 allow_domains：會被提取的鏈接的domains。 deny_domains：一定不會被提取鏈接的domains。 restrict_xpaths：使用xpath表達式，和allow共同作用過濾鏈接。 ### 2.3 rules rules 在rules中包含一個或多個Rule對象，每個Rule對爬取網站的動作定義了特定操作。如果多個rule匹配了相同的鏈接，則根據規則在本集合中被定義的順序，第一個會被使用。 ~~~ class scrapy.spiders.Rule( link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None ) ~~~ > link_extractor：是一個Link Extractor對象，用于定義需要提取的鏈接。 > callback：從link_extractor中每獲取到鏈接時，參數所指定的值作為回調函數，該回調函數接受一個response作為其第一個參數。 > 注意：當編寫爬蟲規則時，避免使用parse作為回調函數。由于CrawlSpider使用parse方法來實現其邏輯，如果覆蓋了 parse方法，crawl spider將會運行失敗。 > follow：是一個布爾(boolean)值，指定了根據該規則從response提取的鏈接是否需要跟進。如果callback為None，follow 默認設置為True ，否則默認為False。 > process_links：指定該spider中哪個的函數將會被調用，從link_extractor中獲取到鏈接列表時將會調用該函數。該方法主要用來過濾。注意：有的函數在爬蟲時，可能返回虛假網址，我們需要寫一個方法來修改這些虛假網址，改成正確的 > process_request：指定該spider中哪個的函數將會被調用，該規則提取到每個request時都會調用該函數。 (用來過濾request) 例如：以下網站反爬蟲把？和&調換了，所以我們要把他換回來 ~~~ class DongdongSpider(CrawlSpider): name = 'dongdong' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page='] # 每一頁的匹配規則 pagelink = LinkExtractor(allow=("type=4")) # 每一頁里的每個帖子的匹配規則 contentlink = LinkExtractor(allow=(r"/html/question/\d+/\d+.shtml")) rules = ( # 本案例的url被web服務器篡改，需要調用process_links來處理提取出來的url Rule(pagelink, process_links = "deal_links"), # 沒有callback ，follow默認為True Rule(contentlink, callback = "parse_item") ) # links 是當前response里提取出來的鏈接列表 def deal_links(self, links): for each in links: each.url = each.url.replace("?","&").replace("Type&","Type?") return links # 返回links ~~~