Scrapy 教程 · php筆記

## Scrapy 教程 **last update: 2022-06-06 10:23:11** ---- [TOC=3,8] ---- [Scrapy | A Fast and Powerful Scraping and Web Crawling Framework](https://scrapy.org/) https://github.com/scrapy/scrapy https://github.com/scrapy-plugins [Scrapy 教程 — Scrapy 2.5.0 文檔](https://www.osgeo.cn/scrapy/intro/tutorial.html) ---- ### 準備虛擬環境 venv > 為一個應用創建一套“隔離”的 Python 運行環境，使用不同的虛擬環境可以解決不同應用的依賴沖突問題。 ```shell # 創建虛擬環境 python -m venv venv # 激活虛擬環境 source venv/bin/activate ``` [12. 虛擬環境和包 — Python 3.11.3 文檔](https://docs.python.org/zh-cn/3/tutorial/venv.html#tut-venv) [virtualenv Lives!](https://hynek.me/articles/virtualenv-lives/) **windows 環境**：以 **管理員身份** 運行 Windows PowerShell ： ```shell PS D:\web\tutorial-env> set-executionpolicy remotesigned PS D:\web\tutorial-env> get-executionpolicy RemoteSigned PS D:\web\tutorial-env> .\Scripts\activate ``` 設置 PyCharm 終端自動激活虛擬環境：工具 > 終端：勾選【激活 virtualenv】 ---- ### conda venv 虛擬環境能解決不同項目間的包版本沖突問題，但是如果我們需要不同的 python 版本呢？ conda 可以幫助我們方便的安裝管理不同版本的 python 和 pip。下載 https://www.anaconda.com/products/individual （國內鏡像 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/?C=M&O=D） [conda 管理多版本python-蒲公英云](https://dandelioncloud.cn/article/details/1526009310379524098) [Anaconda安裝使用教程解決多Python版本問題_anaconda安裝多個python版本_print('小白碼')的博客-CSDN博客](https://blog.csdn.net/qq_50048105/article/details/113859376) [Python3 的安裝 | 靜覓](https://cuiqingcai.com/30035.html) [相關環境安裝](https://setup.scrape.center/) ---- ### DecryptLogin 安裝與使用 ```shell pip3 install DecryptLogin ``` todo ... ---- ### Scrapy 安裝 ```shell pip3 install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy -V Scrapy 2.8.0 - no active project ``` see: [Scrapy 的安裝 | 靜覓](https://setup.scrape.center/scrapy) ---- ### Scrapy 使用 https://github.com/orgs/Python3WebSpider/repositories?q=scrapy&type=all&language=&sort= https://github.com/orgs/Python3WebSpider/repositories?q=Pyppeteer+&type=all&language=&sort= [【2022 年】崔慶才 Python3 網絡爬蟲學習教程 | 靜覓](https://cuiqingcai.com/17777.html) #### 創建項目 ```shell scrapy startproject tutorial ``` ~~~ tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items definition file middlewares.py # project middlewares file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py ~~~ ---- #### 創建蜘蛛 ~~~shell cd tutorial scrapy genspider quotes quotes.toscrape.com ~~~ 上面的命令會生成如下文件 tutorial/tutorial/spiders/quotes.py ~~~python import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["https://quotes.toscrape.com"] def parse(self, response): pass ~~~ ---- #### 使用 item item 可以幫我們規范數據字段 tutorial/tutorial/items.py ```python # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class QuoteItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field() author = scrapy.Field() tags = scrapy.Field() ``` #### 運行蜘蛛現在修改下我們的蜘蛛 tutorial/tutorial/spiders/quotes.py ： ```python import scrapy from ..items import QuoteItem class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["https://quotes.toscrape.com"] def parse(self, response): for quote in response.css('div.quote'): item = QuoteItem() item['text'] = quote.css('span.text::text').get() item['author'] = quote.css('small.author::text').get() item['tags'] = quote.css('div.tags a.tag::text').getall() yield item ``` 運行： ```shell scrapy crawl quotes -O quotes.json ``` 結果如下 tutorial/quotes.json : ~~~json [ {"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]}, {"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"]}, {"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]}, {"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]}, {"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]}, {"text": "“Try not to become a man of success. Rather become a man of value.”", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]}, {"text": "“It is better to be hated for what you are than to be loved for what you are not.”", "author": "André Gide", "tags": ["life", "love"]}, {"text": "“I have not failed. I've just found 10,000 ways that won't work.”", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]}, {"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]}, {"text": "“A day without sunshine is like, you know, night.”", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]} ] ~~~ ----- #### 使用 itemloader [scrapy 之 itemloader - 知乎](https://zhuanlan.zhihu.com/p/59905612/) 新建文件 tutorial/tutorial/itemloader.py ```python from scrapy.loader import ItemLoader from scrapy.loader.processors import TakeFirst, Join, Compose class BaseLoader(ItemLoader): pass class quoteLoader(BaseLoader): pass ``` 修改文件 tutorial/tutorial/spiders/quotes.py ```python import scrapy from ..items import QuoteItem from ..itemloader import QuoteLoader class QuotesSpider(scrapy.Spider): name = "quotes" allowed_domains = ["quotes.toscrape.com"] start_urls = ["https://quotes.toscrape.com"] # def parse(self, response): # for quote in response.css('div.quote'): # item = QuoteItem() # item['text'] = quote.css('span.text::text').get() # item['author'] = quote.css('small.author::text').get() # item['tags'] = quote.css('div.tags a.tag::text').getall() # yield item def parse(self, response): quotes = response.css('.quote') for quote in quotes: loader = QuoteLoader(item=QuoteItem(), selector=quote) loader.add_css('text', '.text::text') loader.add_css('author', '.author::text') loader.add_css('tags', '.tag::text') yield loader.load_item() ``` Item Loader 在每個字段中都包含了一個輸入處理器和一個輸出處理器? 再次執行蜘蛛，發現 text 成了列表，接下來需要利用 Item Loader 的輸入/輸出處理器? 修改 tutorial/tutorial/itemloader.py ```python class BaseLoader(ItemLoader): default_output_processor = TakeFirst() ``` 此時 text 就和之前一樣了。下面將介紹一些內置的的處理器。 **Identity** Identity 是最簡單的 Processor，不進行任何處理，直接返回原來的數據 **TakeFirst** akeFirst 返回列表的第一個非空值，類似 extract_first() 的功能，常用作 Output Processor ```python processor = TakeFirst() print(processor(['', 1, 2, 3])) # 1 ``` **Join** Join 方法相當于字符串的 join() 方法，可以把列表拼合成字符串，字符串默認使用空格分隔 ```python processor = Join(',') print(processor(['one', 'two', 'three'])) # one,two,three ``` **Compose** Compose 是用給定的多個函數的組合而構造的 Processor，每個輸入值被傳遞到第一個函數，其輸出再傳遞到第二個函數，依次類推，直到最后一個函數返回整個處理器的輸出 ```python processor = Compose(str.upper, lambda s: s.strip()) print(processor(' hello world')) # HELLO WORLD ``` **MapCompose** 與 Compose 類似，MapCompose 可以迭代處理一個列表輸入值 ```python processor = MapCompose(str.upper, lambda s: s.strip()) print(processor(['Hello', 'World', 'Python'])) # ['HELLO', 'WORLD', 'PYTHON'] # 被處理的內容是一個可迭代對象，MapCompose 會將該對象遍歷然后依次處理。 ``` **SelectJmes** SelectJmes 可以查詢 JSON ，傳入 Key ，返回查詢所得的 Value 。不過需要先安裝 `pip install jmespath` 庫才可以使用它： ```python from scrapy.loader.processors import SelectJmes processor = SelectJmes('foo') print(processor({'foo': 'bar'})) # bar ``` **有兩種方式使用處理器：** 1. `xxx_in` 為聲明處理 `xxx` 字段的輸入處理器，`xxx_out` 為聲明處理 `xxx` 字段的輸出處理器 2. `default_input_processor` 和 `default_output_processor` 屬性聲明默認的輸入/輸出處理器。修改文件 tutorial/tutorial/itemloader.py ```python from scrapy.loader import ItemLoader from scrapy.loader.processors import Identity, TakeFirst, Join, Compose class BaseLoader(ItemLoader): default_output_processor = TakeFirst() class quoteLoader(BaseLoader): tags_out = Identity() ``` 發現只有 tags 是取多個，其它都是取一個。 ---- ### Scrapy 特性 **過濾重復請求** 默認情況下，Scrapy 過濾掉對已經訪問過的URL的重復請求，避免了由于編程錯誤而太多地訪問服務器的問題。這可以通過設置進行配置 [DUPEFILTER_CLASS](https://www.osgeo.cn/scrapy/topics/settings.html#std-setting-DUPEFILTER_CLASS) ---- ### 常見問題 **如何爬取更多鏈接？** 雖然爬蟲是**從一個入口鏈接開始**的，但不要因此就認為它只能完成一次性的簡單爬取任務，我們可在 `parse()` 中根據情況使用 `yield scrapy.Request(next_page, callback=self.parse)` 、`response.follow(next_page, self.parse)`、`yield from response.follow_all(anchors, callback=self.parse)` **繼續生成其他請求，以滿足爬取所有其他頁面。** ---- **如何處理和保存爬取到的數據？** ```shell scrapy runspider quotes_spider.py -o quotes.jl cd project scrapy crawl quotes -O quotes.json scrapy crawl quotes -o quotes.jl ``` ---- **如何使用代理？** ---- **如何分布式大規模爬取？** ---- **如何處理登錄？** [Scrapy詳解之中間件（Middleware）](https://mp.weixin.qq.com/s?__biz=MzAxMjUyNDQ5OA==&mid=2653557181&idx=1&sn=c62810ab78f40336cb721212ab83f7bd&chksm=806e3f00b719b616286ec1a07f9a5b9eeaba105781f93491685fbe732c60db0118852cfeeec8&scene=27) [scrapy中添加cookie踩坑記錄_51CTO博客_scrapy cookie](https://blog.51cto.com/u_11949039/2859241) [scrapy 基礎組件專題（十二）：scrapy 模擬登錄](https://www.bbsmax.com/A/l1dy7YAxJe/) > 在 `settings.py` 中設置 [`COOKIES_DEBUG=True`](https://www.osgeo.cn/scrapy/topics/downloader-middleware.html#std-setting-COOKIES_DEBUG) 能夠在終端看到 cookie 的傳遞過程 [設置 — Scrapy 2.5.0 文檔](https://www.osgeo.cn/scrapy/topics/settings.html#topics-settings-ref) ---- **如何處理驗證碼？** ---- **如何處理滑塊等防爬人機驗證？** ---- **如何處理加密防爬？** ---- **如何使用無頭瀏覽器？** ---- **如何在 scrapy 中對接 selenium** ---- **如何管理、控制爬蟲？** [Scrapyd 1.4.1 documentation](https://scrapyd.readthedocs.io/en/latest/) ----