<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ??一站式輕松地調用各大LLM模型接口,支持GPT4、智譜、豆包、星火、月之暗面及文生圖、文生視頻 廣告
                ## 1. spider 繼承CrawlSpider類,定義網頁提取規則,對“下一頁進行提取” ~~~ # -*- coding: utf-8 -*- from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from xinlang.items import ProxyIpItem class KuaiDaiLi(CrawlSpider): name = 'kuaidaili' # 爬蟲標識名 allowed_domains = ['kuaidaili.com'] # 限定爬取網頁的域 # 爬蟲開始頁,與Spider類不同的是,它的首頁值提取符合規則的連接,真正開始爬取數據從rules爬取 start_urls = ['https://www.kuaidaili.com/free/inha/1/'] # 從rlues開始提取 rules = ( # 只提取復合規則的頁面鏈接,不做分析,所以跟頁面但是沒有 Rule(LinkExtractor(allow=r'free/inha/\d+'), follow=True,callback='parse_item'), ) def parse_item(self, response): trList = response.xpath("//tbody//tr") for i in trList: ip = i.xpath("./td[1]/text()").extract()[0] port = i.xpath("./td[2]/text()").extract()[0] type = i.xpath("./td[4]/text()").extract()[0] position = i.xpath("./td[5]/text()").extract()[0] response_time = i.xpath("./td[6]/text()").extract()[0] item = ProxyIpItem() item['ip'] = ip item['port'] = port item['type'] = type item['position'] = position item['reponseTime'] = response_time yield item ~~~ ## 2. 下載中間件 User-Agent 使用fake_useragent隨機獲取agent ### 2.1 自定義中間件 ~~~ class RandomAgentMiddleWare(object): """This middleware allows spiders to override the user_agent""" def __init__(self,crawler ): self.ua = UserAgent() # 取到其定義的獲取Useragent的方法 self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # 返回一個中間件對象 @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): def getAgent(): userAgent = getattr(self.ua,self.ua_type) print("userAgent:{0}".format(userAgent)) return userAgent # 對request設置useragent request.headers.setdefault(b'User-Agent', getAgent()) ~~~ ### 2.2 配置下載中間件 settings.py文件,這里的權值(中間件對應的數字543)設置大一點是可以的,以免中間件的設置被scrapy默認的中間件覆蓋(大就后執行唄!) ~~~ DOWNLOADER_MIDDLEWARES = { 'xinlang.middlewares.RandomAgentMiddleWare': 543, } ~~~ ## 3. pipeline 存入mysql ### 3.1 自定義pipeline ~~~ class MysqlPipeline(object): # 采用同步的機制寫入mysql def __init__(self): self.conn = pymysql.connect('192.168.56.130', 'root', 'tuna', 'proxyip', charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() # 處理item def process_item(self, item, spider): insert_sql = """ insert into kuaidaili(ip, port, ip_position, ip_type,response_time) VALUES (%s, %s, %s, %s, %s) """ self.cursor.execute(insert_sql, (item["ip"], item["port"], item["position"], item["type"],item["reponseTime"])) self.conn.commit() ~~~ ### 3.2 配置pipeline settings.py文件,這里的權值設置大一點是可以的,以免中間件的設置被scrapy默認的中間件覆蓋(大就后執行唄!) ~~~ ITEM_PIPELINES = { 'xinlang.pipelines.MysqlPipeline': 300, } ~~~ ## 4. 注意的問題 在剛開始爬取快代理的時候,不知道為啥老報錯,就一頓debug,發現debug時可以正常爬取;突然想到爬蟲最基礎的一條反爬蟲策略:限制ip在一定時間內的訪問次數。 咋整: 只能配置一下下載速度了,在settings.py文件中對DOWNLOAD_DELAY進行配置,它的意思就是延遲我的請求(request) ~~~ # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 ~~~ 好了 可以爬了
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看