快代理整站爬取 · TUNA-daily

## 1. spider 繼承CrawlSpider類，定義網頁提取規則，對“下一頁進行提取” ~~~ # -*- coding: utf-8 -*- from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from xinlang.items import ProxyIpItem class KuaiDaiLi(CrawlSpider): name = 'kuaidaili' # 爬蟲標識名 allowed_domains = ['kuaidaili.com'] # 限定爬取網頁的域 # 爬蟲開始頁，與Spider類不同的是，它的首頁值提取符合規則的連接，真正開始爬取數據從rules爬取 start_urls = ['https://www.kuaidaili.com/free/inha/1/'] # 從rlues開始提取 rules = ( # 只提取復合規則的頁面鏈接，不做分析，所以跟頁面但是沒有 Rule(LinkExtractor(allow=r'free/inha/\d+'), follow=True,callback='parse_item'), ) def parse_item(self, response): trList = response.xpath("//tbody//tr") for i in trList: ip = i.xpath("./td[1]/text()").extract()[0] port = i.xpath("./td[2]/text()").extract()[0] type = i.xpath("./td[4]/text()").extract()[0] position = i.xpath("./td[5]/text()").extract()[0] response_time = i.xpath("./td[6]/text()").extract()[0] item = ProxyIpItem() item['ip'] = ip item['port'] = port item['type'] = type item['position'] = position item['reponseTime'] = response_time yield item ~~~ ## 2. 下載中間件 User-Agent 使用fake_useragent隨機獲取agent ### 2.1 自定義中間件 ~~~ class RandomAgentMiddleWare(object): """This middleware allows spiders to override the user_agent""" def __init__(self,crawler ): self.ua = UserAgent() # 取到其定義的獲取Useragent的方法 self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # 返回一個中間件對象 @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def getAgent(): userAgent = getattr(self.ua,self.ua_type) print("userAgent:{0}".format(userAgent)) return userAgent # 對request設置useragent request.headers.setdefault(b'User-Agent', getAgent()) ~~~ ### 2.2 配置下載中間件 settings.py文件，這里的權值（中間件對應的數字543）設置大一點是可以的，以免中間件的設置被scrapy默認的中間件覆蓋（大就后執行唄！） ~~~ DOWNLOADER_MIDDLEWARES = { 'xinlang.middlewares.RandomAgentMiddleWare': 543, } ~~~ ## 3. pipeline 存入mysql ### 3.1 自定義pipeline ~~~ class MysqlPipeline(object): # 采用同步的機制寫入mysql def __init__(self): self.conn = pymysql.connect('192.168.56.130', 'root', 'tuna', 'proxyip', charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() # 處理item def process_item(self, item, spider): insert_sql = """ insert into kuaidaili(ip, port, ip_position, ip_type,response_time) VALUES (%s, %s, %s, %s, %s) """ self.cursor.execute(insert_sql, (item["ip"], item["port"], item["position"], item["type"],item["reponseTime"])) self.conn.commit() ~~~ ### 3.2 配置pipeline settings.py文件，這里的權值設置大一點是可以的，以免中間件的設置被scrapy默認的中間件覆蓋（大就后執行唄！） ~~~ ITEM_PIPELINES = { 'xinlang.pipelines.MysqlPipeline': 300, } ~~~ ## 4. 注意的問題在剛開始爬取快代理的時候，不知道為啥老報錯，就一頓debug，發現debug時可以正常爬取；突然想到爬蟲最基礎的一條反爬蟲策略：限制ip在一定時間內的訪問次數。咋整：只能配置一下下載速度了，在settings.py文件中對DOWNLOAD_DELAY進行配置，它的意思就是延遲我的請求（request） ~~~ # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 ~~~ 好了可以爬了