<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ThinkChat2.0新版上線,更智能更精彩,支持會話、畫圖、視頻、閱讀、搜索等,送10W Token,即刻開啟你的AI之旅 廣告
                CrawlSpider可以更簡單地實現翻頁請求,利用 `Rule(LinkExtractor...)` 捕捉符合規則的url,然后調用一個解析器解析該url。 <br/> 步驟如下: **1. 到項目目錄下創建crawlSpider** ``` # scrapy genspider -t crawl <爬蟲名稱> <域名> > scrapy genspider -t crawl ct_liks www.wxapp-union.com ``` 執行上面的命令后,將自動生成如下的`ct_liks.py`文件,如下: ```python import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CtLiksSpider(CrawlSpider): name = 'ct_liks' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) # 這個方法名是可隨便更改的 def parse_item(self, response): item = {} #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get() #item['name'] = response.xpath('//div[@id="name"]').get() #item['description'] = response.xpath('//div[@id="description"]').get() return item ``` `ct_liks.py`文件是可以手動創建的,只是太麻煩了。 <br/> **2. 在`ct_liks.py`文件中定義 rules** ```python """ @Date 2021/4/9 """ import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CtLiksSpider(CrawlSpider): name = 'ct_liks' allowed_domains = ['www.wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/'] rules = ( # 1. 捕捉http://www.wxapp-union.com/頁面的類似的 https://www.wxapp-union.com/article-7002-1.html 鏈接 # 如果多個Rule都滿?某?個URL,會從rules中選擇第?個滿?的進?操作 Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item'), # 你可以定義多條rule # Rule(LinkExtractor(allow=r'www.wxapp-union.com/article-\d+-1.html'), callback='parse_item2'), ) # 2. 每當有一個url符合www.wxapp-union.com/article-\d+-1.html規則,則parse_item自動被調用一次 def parse_item(self, response): title = response.xpath("//title").extract_first() print(title) ``` <br/> LinkExtractor和Rule還有如下參數可選: ```python class LxmlLinkExtractor(FilteringLinkExtractor): def __init__( self, allow=(), # 允許的url。所有滿足這個正則表達式的url都會被提取。 deny=(), # 禁止的url。所有滿足這個正則表達式的url都不會被提取。 allow_domains=(), # 允許的域名。只有在這個里面指定的域名的url才會被提取。 deny_domains=(), # 禁止的域名。所有在這個里面指定的域名的url都不會被提取 restrict_xpaths=(), # 嚴格的xpath。和allow共同過濾鏈接。 tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, deny_extensions=None, restrict_css=(), strip=True, restrict_text=None, ): class Rule: def __init__( self, link_extractor=None, # 就是一個LinkExtractor對象 callback=None, # 滿足這個規則的url,應該要執行的回調函數。 # 因為 CrawlSpider使用了parse作為回調函數,因此不要覆蓋parse作為自己的回調函數 cb_kwargs=None, follow=None, # 指定根據該規則從response中提取的鏈接是否需要跟進。 # 不指定callback函數的請求下,如果follow為True,滿足該rule的URL還會繼續被請求 process_links=None, # 從link_extractor中獲取到鏈接后會傳遞給這個函數,用來過濾不需要爬取的鏈接。 process_request=None, errback=None, ): ```
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看