<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ??碼云GVP開源項目 12k star Uniapp+ElementUI 功能強大 支持多語言、二開方便! 廣告
                將 [CrawSpider](http://www.hmoore.net/king_om/py_1/2229599) 改寫成分布式爬蟲。 <br/> 步驟如下: **1. 先創建普通的CrawSpider** ``` # scrapy genspider -t crawl <爬蟲名稱> <域名> > scrapy genspider -t crawl ct_liks www.wxapp-union.com ``` <br/> **2. 將普通的CrawlSpider改寫成分布式的** ```python """ ct_liks.py 文件名 """ import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule # 1. 導入RedisCrawlSpider from scrapy_redis.spiders import RedisCrawlSpider # 2. 繼承 RedisCrawlSpider # class CtLiksSpider(CrawlSpider): class CtLiksSpider(RedisCrawlSpider): name = 'ct_liks' # 3. 注銷allowed_domains 和 start_urls # allowed_domains = ['www.wxapp-union.com'] # start_urls = ['http://www.wxapp-union.com/'] # 4. 添加redis_key redis_key = "ct_start_url" rules = ( Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item') ) # 5. 在 __init__中定義allowed_domains def __init__(self, *args, **kwargs): domain = kwargs.pop('domain', '') # 多個允許的域采用 , 分割 self.allowed_domains = list(filter(None, domain.split(','))) super(CtLiksSpider, self).__init__(*args, **kwargs) def parse_item(self, response): title = response.xpath("//title").extract_first() print(title) ``` <br/> **3. ` settings.py`中設置分布式相關配置** ```python ###### 添加如下配置 ######### # 設置重復過濾器的模塊 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 設置調度器,調度器具備與redis數據庫交互的功能 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 設置當爬蟲結束時是否保持redis數據庫中的去重集合與任務對象 # True: 保持 # False: 不保持,任務結束就會清空數據庫 SCHEDULER_PERSIST = True #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue" #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue" #SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack" ITEM_PIPELINES = { 'example.pipelines.ExamplePipeline': 300, # 當開啟該管道,該管道將會自動把數據存儲到redis數據庫中 'scrapy_redis.pipelines.RedisPipeline': 400, } # 設置redis數據庫 REDIS_URL = "redis://localhost:6379" # 或者采用下面這種方式設置 # REDIS_HOST = 'localhost' # REDIS_PORT = 6379 LOG_LEVEL = 'DEBUG' # Introduce an artifical delay to make use of parallelism. to speed up the # crawl. DOWNLOAD_DELAY = 1 ``` <br/> **4. 啟動爬蟲** ```shell # domain用 , 分割 > scrapy runspider ct_liks.py domain='www.baidu.com,taobao.com' ``` <br/> **5. 往Redis數據庫中放入 `start_urls`** ```shell > lpush ct_start_url http://www.wxapp-union.com/ ``` 當爬蟲根據 `redis_key` 讀取到 `start_urls`后就開啟工作了。
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看