<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ??一站式輕松地調用各大LLM模型接口,支持GPT4、智譜、豆包、星火、月之暗面及文生圖、文生視頻 廣告
                一般爬蟲的邏輯是:給定起始頁面,發起訪問,分析頁面包含的所有其他鏈接,然后將這些鏈接放入隊列,再逐次訪問這些隊列,直至邊界條件結束。為了針對列表頁+詳情頁這種模式,需要對鏈接抽取(link extractor)的邏輯進行限定。好在scrapy已經提供,關鍵是你知道這個接口,并靈活運用 ## 1 . url 特定截取 ~~~ rules = (Rule(SgmlLinkExtractor(allow=('category/20/index_\d+\.html'), restrict_xpaths=("//div[@class='left']"))), Rule(SgmlLinkExtractor(allow=('a/\d+/\d+\.html'), restrict_xpaths=("//div[@class='left']")), callback='parse_item'), ) ~~~ > 1. Rule是在定義抽取鏈接的規則,上面的兩條規則分別對應列表頁的各個分頁頁面和詳情頁,關鍵點在于通過**restrict_xpath**來限定只從頁面特定的部分來抽取接下來將要爬取的鏈接。 > 2. follow用途: 第一:這是我爬取豆瓣新書的規則 rules = (Rule(LinkExtractor(allow=(r’^https://book.douban.com/subject/[0-9]*/’),),callback=’parse_item’,follow=False), ),在這條規則下,我只會爬取定義的start_urls中的和規則符合的鏈接。假設我把follow修改為True,那么爬蟲會start_urls爬取的頁面中在尋找符合規則的url,如此循環,直到把全站爬取完畢。 第二:rule無論有無callback,都由同一個_parse_response函數處理,只不過他會判斷是否有follow和callback ## 2. CrawlSpider詳解 CrawlSpider基于Spider,但是可以說是為全站爬取而生。 簡要說明 > CrawlSpider是爬取那些具有一定規則網站的常用的爬蟲,它基于Spider并有一些獨特屬性 > rules: 是Rule對象的集合,用于匹配目標網站并排除干擾 > parse_start_url: 用于爬取起始響應,必須要返回Item,Request中的一個。 > 因為rules是Rule對象的集合,所以這里也要介紹一下Rule。它有幾個參數:link_extractor、callback=None、cb_kwargs=None、follow=None、process_links=None、process_request=None > 其中的link_extractor既可以自己定義,也可以使用已有LinkExtractor類,主要參數為: > allow:滿足括號中“正則表達式”的值會被提取,如果為空,則全部匹配。 > deny:與這個正則表達式(或正則表達式列表)不匹配的URL一定不提取。 > allow_domains:會被提取的鏈接的domains。 > deny_domains:一定不會被提取鏈接的domains。 > restrict_xpaths:使用xpath表達式,和allow共同作用過濾鏈接。還有一個類似的restrict_css 下面是官方提供的例子,我將從源代碼的角度開始解讀一些常見問題: ~~~ import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item ~~~ 問題:CrawlSpider如何工作的? 因為CrawlSpider繼承了Spider,所以具有Spider的所有函數。 首先由start_requests對start_urls中的每一個url發起請求(make_requests_from_url),這個請求會被parse接收。在Spider里面的parse需要我們定義,但CrawlSpider定義parse去解析響應(self._parse_response(response, self.parse_start_url, ~~~ cb_kwargs={}, follow=True)) _parse_response根據有無callback,follow和self.follow_links執行不同的操作 def _parse_response(self, response, callback, cb_kwargs, follow=True): ##如果傳入了callback,使用這個callback解析頁面并獲取解析得到的reques或item if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item ## 其次判斷有無follow,用_requests_to_follow解析響應是否有符合要求的link。 if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item ~~~ 其中_requests_to_follow又會獲取link_extractor(這個是我們傳入的LinkExtractor)解析頁面得到的link(link_extractor.extract_links(response)),對url進行加工(process_links,需要自定義),對符合的link發起Request。使用.process_request(需要自定義)處理響應。 問題:CrawlSpider如何獲取rules? CrawlSpider類會在init方法中調用_compile_rules方法,然后在其中淺拷貝rules中的各個Rule獲取要用于回調(callback),要進行處理的鏈接(process_links)和要進行的處理請求(process_request) ~~~ def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, six.string_types): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) ~~~ 那么Rule是怎么樣定義的呢? ~~~ class Rule(object): def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity): self.link_extractor = link_extractor self.callback = callback self.cb_kwargs = cb_kwargs or {} self.process_links = process_links self.process_request = process_request if follow is None: self.follow = False if callback else True else: self.follow = follow ~~~ 因此LinkExtractor會傳給link_extractor。 有callback的是由指定的函數處理,沒有callback的是由哪個函數處理的? 由上面的講解可以發現_parse_response會處理有callback的(響應)respons。 cb_res = callback(response, **cb_kwargs) or () 而_requests_to_follow會將self._response_downloaded傳給callback用于對頁面中匹配的url發起請求(request)。 r = Request(url=link.url, callback=self._response_downloaded) 如何在CrawlSpider進行模擬登陸 因為CrawlSpider和Spider一樣,都要使用start_requests發起請求,用從Andrew_liu大神借鑒的代碼說明如何模擬登陸: ##替換原來的start_requests,callback為 ~~~ def start_requests(self): return [Request("http://www.zhihu.com/#signin", meta = {'cookiejar' : 1}, callback = self.post_login)] def post_login(self, response): print 'Preparing login' #下面這句話用于抓取請求網頁后返回網頁中的_xsrf字段的文字, 用于成功提交表單 xsrf = Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0] print xsrf #FormRequeset.from_response是Scrapy提供的一個函數, 用于post表單 #登陸成功后, 會調用after_login回調函數 return [FormRequest.from_response(response, #"http://www.zhihu.com/login", meta = {'cookiejar' : response.meta['cookiejar']}, headers = self.headers, formdata = { '_xsrf': xsrf, 'email': '1527927373@qq.com', 'password': '321324jia' }, callback = self.after_login, dont_filter = True )] #make_requests_from_url會調用parse,就可以與CrawlSpider的parse進行銜接了 def after_login(self, response) : for url in self.start_urls : yield self.make_requests_from_url(url) ~~~ 最后貼上Scrapy.spiders.CrawlSpider的源代碼,以便檢查 ~~~ """ This modules implements the CrawlSpider which is the recommended spider to use for scraping typical web sites that requires crawling pages. See documentation in docs/topics/spiders.rst """ import copy import six from scrapy.http import Request, HtmlResponse from scrapy.utils.spider import iterate_spider_output from scrapy.spiders import Spider def identity(x): return x class Rule(object): def __init__(self, link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=identity): self.link_extractor = link_extractor self.callback = callback self.cb_kwargs = cb_kwargs or {} self.process_links = process_links self.process_request = process_request if follow is None: self.follow = False if callback else True else: self.follow = follow class CrawlSpider(Spider): rules = () def __init__(self, *a, **kw): super(CrawlSpider, self).__init__(*a, **kw) self._compile_rules() def parse(self, response): return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True) def parse_start_url(self, response): return [] def process_results(self, response, results): return results def _requests_to_follow(self, response): if not isinstance(response, HtmlResponse): return seen = set() for n, rule in enumerate(self._rules): links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen] if links and rule.process_links: links = rule.process_links(links) for link in links: seen.add(link) r = Request(url=link.url, callback=self._response_downloaded) r.meta.update(rule=n, link_text=link.text) yield rule.process_request(r) def _response_downloaded(self, response): rule = self._rules[response.meta['rule']] return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow) def _parse_response(self, response, callback, cb_kwargs, follow=True): if callback: cb_res = callback(response, **cb_kwargs) or () cb_res = self.process_results(response, cb_res) for requests_or_item in iterate_spider_output(cb_res): yield requests_or_item if follow and self._follow_links: for request_or_item in self._requests_to_follow(response): yield request_or_item def _compile_rules(self): def get_method(method): if callable(method): return method elif isinstance(method, six.string_types): return getattr(self, method, None) self._rules = [copy.copy(r) for r in self.rules] for rule in self._rules: rule.callback = get_method(rule.callback) rule.process_links = get_method(rule.process_links) rule.process_request = get_method(rule.process_request) @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = super(CrawlSpider, cls).from_crawler(crawler, *args, **kwargs) spider._follow_links = crawler.settings.getbool( 'CRAWLSPIDER_FOLLOW_LINKS', True) return spider def set_crawler(self, crawler): super(CrawlSpider, self).set_crawler(crawler) self._follow_links = crawler.settings.getbool('CRAWLSPIDER_FOLLOW_LINKS', True) ~~~
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看