<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                合規國際互聯網加速 OSASE為企業客戶提供高速穩定SD-WAN國際加速解決方案。 廣告
                ## 1. 通常防止爬蟲被反主要有以下幾個策略: > 1. 置User-Agent(隨機切換User-Agent,模擬不同用戶的瀏覽器信息) > 2. ookies(也就是不啟用cookies middleware,不向Server發送cookies,有些網站通過cookie的使用發現爬蟲行為) > 3. 過COOKIES_ENABLED 控制 CookiesMiddleware 開啟或關閉 > 4. 遲下載(防止訪問過于頻繁,設置為 2秒 或更高) > 5. gle Cache 和 Baidu Cache:如果可能的話,使用谷歌/百度等搜索引擎服務器頁面緩存獲取頁面數據。 > 6. P地址池:VPN和代理IP,現在大部分網站都是根據IP來ban的。 > 7. Crawlera(專用于爬蟲的代理組件),正確配置和設置下載中間件后,項目所有的request都是通過crawlera發出。 > ## 2. 爬蟲遇到的問題 ### 2.1 scrapy抓取到的html和瀏覽器的不一致 用xpath在瀏覽器上可以卡取到數據,但是程序就不行,弄了半天發現瀏覽器調試下的HTML和scrapy抓取到的不一樣 `print(response.text)`打印一下獲取到的數據,和瀏覽器不一樣,改xpath匹配規則 ### 2.2 獲取多個標簽的文字 tr標簽及其子標簽中的文字 ~~~ content = response.xpath("//div[@class='pcb']//div[@class='t_fsz']/table[1]//tr")[0].xpath('string(.)').extract()[0] ~~~ 例:獲取table標簽下的所有文字 ![](https://box.kancloud.cn/c10a5c1fba0b7a35f249bb4643059da6_1634x769.png) ~~~ info_table = response.xpath("//table[@class='specificationBox']") if len(info_table)>=1 : instructions = info_table[0].xpath('string(.)').extract()[0] ~~~ ### 2.3 cookie 有的網站,會把網頁重定向,此時應該讓我們的爬蟲跟著重定向,做以下設置 ~~~ HTTPERROR_ALLOWED_CODES = [301] REDIRECT_ENABLED = True ~~~ ### 2.4 js渲染 ![](https://box.kancloud.cn/3a3ab484a9d919cefed75e6e91d3fddc_484x346.png) 如上面的商品價格,沒有瀏覽器幫我們進行技術渲染,scrapy是得不到結果的,所以選擇Selenium+PhantomJS(模擬無界面瀏覽器) 1. 自定義下載中間件,處理request ~~~ class PhantomJSMiddleware(object): def __init__(self,crawler ): self.webDriver = webdriver.PhantomJS() @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): if 'product' in request.url: self.webDriver.get(request.url) content = self.webDriver.page_source.encode('utf-8') print("===========使用selenium下載!!!!!!!!!") # 返回response,這樣框架就不會去重復的下載了 return HtmlResponse(request.url, encoding='utf-8', body=content, request=request) ~~~ 2. 設置settings.py文件 把自定義的下載中間件加入 ~~~ DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, 'scrapyredis.middlewares.RandomAgentMiddleWare': 543, 'scrapyredis.middlewares.PhantomJSMiddleware': 545, } ~~~ ### 2.5 使用selenium+PhantomJS(設置請求頭) 其他組件設置的UserAgent,不會設置在PhantomJS發起的請求中,所以要單獨設置請求頭,否則就像下邊這樣爬蟲被識別 ![](https://box.kancloud.cn/1f679ed88d5d7c613b1f61080c299544_1685x595.png) ~~~ class PhantomJSMiddleware(object): def __init__(self,crawler): self.ua = UserAgent() # 取到其定義的獲取Useragent的方法 self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random") self.fdfs = FDFSClient() crawler.spider.logger.info("=========PhantomJS瀏覽器初始化!!!") def getAgent(): userAgent = getattr(self.ua, self.ua_type) crawler.spider.logger.info("=====設置代理userAgent:{0}".format(userAgent)) return userAgent dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = (getAgent()) self.webDriver = webdriver.PhantomJS(desired_capabilities=dcap) @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_request(self, request, spider): # 詳細處理頁 if len(spider.detailList)>=1: for detail in spider.detailList: if detail in request.url: spider.logger.info("===========使用selenium下載") self.webDriver.get(request.url) content = self.webDriver.page_source.encode('utf-8') # spider.logger.info("===========保存快照") # screenshotData = self.webDriver.get_screenshot_as_png() # screenshotUri = self.fdfs.save(screenshotData) # request.meta['screenshot'] = screenshotUri return HtmlResponse(request.url, encoding='utf-8', body=content, request=request) def process_response(self, request, response, spider): # print(response.text) return response ~~~ 解決 ![](https://box.kancloud.cn/1b13f15960869e8051841059c5618e5d_1866x639.png)
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看