項目管道 · Scrapy 1.6 中文文檔

# 項目管道 > 譯者：[OSGeo 中國](https://www.osgeo.cn/) 在一個項目被 Spider 抓取之后，它被發送到項目管道，該管道通過幾個按順序執行的組件來處理它。每個項管道組件（有時稱為“項管道”）都是一個實現簡單方法的Python類。它們接收一個項目并對其執行操作，還決定該項目是否應繼續通過管道，或者是否應刪除并不再處理。項目管道的典型用途有： * 清理HTML數據 * 驗證抓取的數據（檢查項目是否包含某些字段） * 檢查重復項（并刪除它們） * 將刮下的項目存儲在數據庫中 ## 編寫自己的項目管道每個item pipeline組件都是一個python類，必須實現以下方法： ```py process_item(self, item, spider) ``` 對每個項管道組件調用此方法。 [`process_item()`](#process_item "process_item") 必須：返回包含數據的dict，返回 [`Item`](items.html#scrapy.item.Item "scrapy.item.Item") （或任何后代類）對象，返回 [Twisted Deferred](https://twistedmatrix.com/documents/current/core/howto/defer.html) or raise [`DropItem`](exceptions.html#scrapy.exceptions.DropItem "scrapy.exceptions.DropItem") 例外。刪除的項不再由其他管道組件處理。 | 參數: | * **item** ([`Item`](items.html#scrapy.item.Item "scrapy.item.Item") object or a dict) -- 物品被刮掉了 * **spider** ([`Spider`](spiders.html#scrapy.spiders.Spider "scrapy.spiders.Spider") object) -- 刮掉物品的 Spider | | --- | --- | 此外，它們還可以實現以下方法： ```py open_spider(self, spider) ``` 當spider打開時調用此方法。 | 參數: | **spider** ([`Spider`](spiders.html#scrapy.spiders.Spider "scrapy.spiders.Spider") object) -- 打開的 Spider | | --- | --- | ```py close_spider(self, spider) ``` 當spider關閉時調用此方法。 | 參數: | **spider** ([`Spider`](spiders.html#scrapy.spiders.Spider "scrapy.spiders.Spider") object) -- 關閉的 Spider | | --- | --- | ```py from_crawler(cls, crawler) ``` 如果存在，則調用此ClassMethod從 [`Crawler`](api.html#scrapy.crawler.Crawler "scrapy.crawler.Crawler") . 它必須返回管道的新實例。爬蟲對象提供對所有零碎核心組件（如設置和信號）的訪問；它是管道訪問它們并將其功能連接到零碎的一種方式。 | 參數: | **crawler** ([`Crawler`](api.html#scrapy.crawler.Crawler "scrapy.crawler.Crawler") object) -- 使用此管道的爬蟲程序 | | --- | --- | ## 項目管道示例 ### 無價格的價格驗證和刪除項目讓我們看看下面的假設管道，它調整了 `price` 不包括增值稅的項目的屬性（ `price_excludes_vat` 屬性），并刪除不包含價格的項目： ```py from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.15 def process_item(self, item, spider): if item.get('price'): if item.get('price_excludes_vat'): item['price'] = item['price'] * self.vat_factor return item else: raise DropItem("Missing price in %s" % item) ``` ### 將項目寫入JSON文件下面的管道將所有刮掉的項目（從所有 Spider ）存儲到一個單獨的管道中 `items.jl` 文件，每行包含一個以JSON格式序列化的項： ```py import json class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open('items.jl', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item ``` 注解 jsonWriterPipeline的目的只是介紹如何編寫項管道。如果您真的想將所有的抓取項存儲到JSON文件中，那么應該使用 [Feed exports](feed-exports.html#topics-feed-exports) . ### 將項目寫入MongoDB 在本例中，我們將使用pymongo_uu將項目寫入mongodb_u。在Scrapy設置中指定MongoDB地址和數據庫名稱；MongoDB集合以item類命名。這個例子的要點是演示如何使用 [`from_crawler()`](#from_crawler "from_crawler") 方法和如何正確清理資源。：： ```py import pymongo class MongoPipeline(object): collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item ``` ### 項目截圖這個例子演示了如何返回延遲的 [`process_item()`](#process_item "process_item") 方法。它使用splash_u呈現項目url的屏幕截圖。管道向本地運行的splash_uuu實例發出請求。在下載請求并觸發延遲回調之后，它將項目保存到文件中，并將文件名添加到項目中。 ```py import scrapy import hashlib from urllib.parse import quote class ScreenshotPipeline(object): """Pipeline that uses Splash to render screenshot of every Scrapy item.""" SPLASH_URL = "http://localhost:8050/render.png?url={}" def process_item(self, item, spider): encoded_item_url = quote(item["url"]) screenshot_url = self.SPLASH_URL.format(encoded_item_url) request = scrapy.Request(screenshot_url) dfd = spider.crawler.engine.download(request, spider) dfd.addBoth(self.return_item, item) return dfd def return_item(self, response, item): if response.status != 200: # Error happened, return item. return item # Save screenshot to file, filename will be hash of url. url = item["url"] url_hash = hashlib.md5(url.encode("utf8")).hexdigest() filename = "{}.png".format(url_hash) with open(filename, "wb") as f: f.write(response.body) # Store filename in item. item["screenshot_filename"] = filename return item ``` ### 重復篩選器查找重復項并刪除已處理的項的篩選器。假設我們的項目有一個唯一的ID，但是我們的spider返回具有相同ID的多個項目： ```py from scrapy.exceptions import DropItem class DuplicatesPipeline(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem("Duplicate item found: %s" % item) else: self.ids_seen.add(item['id']) return item ``` ## 激活項目管道組件若要激活項管道組件，必須將其類添加到 [`ITEM_PIPELINES`](settings.html#std:setting-ITEM_PIPELINES) 設置，如以下示例中所示： ```py ITEM_PIPELINES = { 'myproject.pipelines.PricePipeline': 300, 'myproject.pipelines.JsonWriterPipeline': 800, } ``` 在此設置中分配給類的整數值決定了它們的運行順序：項從低值類傳遞到高值類。習慣上把這些數字定義在0-1000范圍內。