鏈接提取器 · Scrapy 1.6 中文文檔

# 鏈接提取器 > 譯者：[OSGeo 中國](https://www.osgeo.cn/) 鏈接提取器是對象，其唯一目的是從網頁中提取鏈接（ [`scrapy.http.Response`](request-response.html#scrapy.http.Response "scrapy.http.Response") 對象），最終將遵循。有 `scrapy.linkextractors.LinkExtractor` 在Scrapy中可用，但是您可以通過實現一個簡單的接口來創建自己的自定義鏈接提取器來滿足您的需要。每個鏈接提取器唯一擁有的公共方法是 `extract_links` ，接收 [`Response`](request-response.html#scrapy.http.Response "scrapy.http.Response") 對象并返回 `scrapy.link.Link` 物體。鏈接提取器將被實例化一次及其 `extract_links` 方法多次調用，并使用不同的響應提取要跟蹤的鏈接。鏈接提取器用于 [`CrawlSpider`](spiders.html#scrapy.spiders.CrawlSpider "scrapy.spiders.CrawlSpider") 類（在scrappy中可用），通過一組規則，但您也可以在spider中使用它，即使您不從 [`CrawlSpider`](spiders.html#scrapy.spiders.CrawlSpider "scrapy.spiders.CrawlSpider") 因為它的目的很簡單：提取鏈接。 ## 內置鏈接提取程序參考在 [`scrapy.linkextractors`](#module-scrapy.linkextractors "scrapy.linkextractors: Link extractors classes") 模塊。默認的鏈接提取程序是 `LinkExtractor` ，與 [`LxmlLinkExtractor`](#scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor "scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor") ：： ```py from scrapy.linkextractors import LinkExtractor ``` 以前的Scrapy版本中還有其他的鏈接提取器類，但現在已經不推薦使用了。 ### LxmlLinkExtractor ```py class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href', ), canonicalize=False, unique=True, process_value=None, strip=True) ``` LXMLlinkextractor是推薦的帶有便捷過濾選項的鏈接提取程序。它是使用LXML的健壯的HTMLParser實現的。 | 參數: | * **allow** (_a regular expression_ _(or_ _list of__)_) -- （絕對）URL必須匹配才能提取的單個正則表達式（或正則表達式列表）。如果沒有給定（或為空），它將匹配所有鏈接。 * **deny** (_a regular expression_ _(or_ _list of__)_) -- （絕對）URL必須匹配的單個正則表達式（或正則表達式列表）才能排除（即不提取）。它優先于 `allow` 參數。如果未給定（或為空），則不會排除任何鏈接。 * **allow_domains** (_str_ _or_ _list_) -- 包含用于提取鏈接的域的單個值或字符串列表。 * **deny_domains** (_str_ _or_ _list_) -- 包含域的單個值或字符串列表，這些域不會被視為提取鏈接的域。 * **deny_extensions** (_list_) -- 包含在提取鏈接時應忽略的擴展名的單個值或字符串列表。如果沒有給出，它將默認為 `IGNORED_EXTENSIONS` 在中定義的列表 [scrapy.linkextractors](https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/__init__.py) 包。 * **restrict_xpaths** (_str_ _or_ _list_) -- 是一個xpath（或xpath的列表），它定義響應中應該從中提取鏈接的區域。如果給定，則只掃描由這些xpath選擇的文本中的鏈接。見下面的例子。 * **restrict_css** (_str_ _or_ _list_) -- 一個CSS選擇器（或選擇器列表），它定義響應中應該從中提取鏈接的區域。行為與 `restrict_xpaths` . * **restrict_text** (_a regular expression_ _(or_ _list of__)_) -- 鏈接文本必須匹配才能提取的單個正則表達式（或正則表達式列表）。如果沒有給定（或為空），它將匹配所有鏈接。如果給出了一個正則表達式列表，那么如果鏈接與至少一個匹配，則將提取該鏈接。 * **tags** (_str_ _or_ _list_) -- 提取鏈接時要考慮的標記或標記列表。默認為 `('a', 'area')` . * **attrs** (_list_) -- 在查找要提取的鏈接時應考慮的屬性或屬性列表（僅適用于在 `tags` 參數）。默認為 `('href',)` * **canonicalize** (_boolean_) -- 規范化每個提取的URL（使用w3lib.url.canonicalize_url）。默認為 `False` . 請注意，規范化URL用于重復檢查；它可以更改服務器端可見的URL，因此對于使用規范化URL和原始URL的請求，響應可能不同。如果您使用linkextractor跟蹤鏈接，那么保持默認鏈接更為可靠。 `canonicalize=False` . * **unique** (_boolean_) -- 是否對提取的鏈接應用重復篩選。 * **process_value** (_callable_) -- 一種函數，接收從掃描的標記和屬性中提取的每個值，并能修改該值并返回一個新值，或返回 `None` 完全忽略鏈接。如果沒有給出， `process_value` 默認為 `lambda x: x` . …highlight：：html例如，要從此代碼中提取鏈接，請執行以下操作：：<a href=“javascript:gotopage（'../other/page.html'）；return false“>link text.<a>。highlight:：python您可以在 `process_value` ：：def process_value（value）：m=re.search（“[javascript:gotopage](javascript:gotopage)（'（.*？）'”，value）如果m:返回m.group（1） * **strip** (_boolean_) -- 是否從提取的屬性中刪除空白。根據HTML5標準，必須從 `href` 屬性 `<a>` ， `<area>` 還有許多其他元素， `src` 屬性 `<img>` ， `<iframe>` 元素等，因此linkextractor默認情況下會刪除空格字符。集合 `strip=False` 關閉它（例如，如果從允許前導/尾隨空格的元素或屬性中提取URL）。 | | --- | --- |