選擇器 · Scrapy 1.6 中文文檔

# 選擇器 > 譯者：[OSGeo 中國](https://www.osgeo.cn/) 當你抓取網頁時，你需要執行的最常見的任務是從HTML源代碼中提取數據。有幾個庫可以實現這一點，例如： > * [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) 在Python程序員中是一個非常流行的Web抓取庫，它基于HTML代碼的結構構造了一個Python對象，并且能夠很好地處理錯誤的標記，但是它有一個缺點：速度慢。 > * [lxml](http://lxml.de/) 是一個XML解析庫（它也解析HTML），使用基于 [ElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html) . （LXML不是Python標準庫的一部分。） Scrapy有自己的數據提取機制。它們被稱為選擇器，因為它們“選擇”HTML文檔的某些部分 [XPath](https://www.w3.org/TR/xpath) 或 [CSS](https://www.w3.org/TR/selectors) 表達。 [XPath](https://www.w3.org/TR/xpath) 是一種在XML文檔中選擇節點的語言，也可以與HTML一起使用。 [CSS](https://www.w3.org/TR/selectors) 是用于將樣式應用于HTML文檔的語言。它定義選擇器，將這些樣式與特定的HTML元素相關聯。注解 Scrapy 選擇器是一個很薄的包裝 [parsel](https://parsel.readthedocs.io/) library；這個包裝器的目的是提供更好的與slapy響應對象的集成。 [parsel](https://parsel.readthedocs.io/) 是一個獨立的網頁抓取庫，可以使用沒有廢料。它使用 [lxml](http://lxml.de/) 庫位于引擎蓋下，并在LXML API之上實現一個簡單的API。這意味著scrapy選擇器在速度和解析精度方面與lxml非常相似。 ## 使用選擇器 ### 構造選擇器響應對象公開 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 實例對 `.selector` 屬性： ```py >>> response.selector.xpath('//span/text()').get() 'good' ``` 使用xpath和css查詢響應非常常見，因此響應中還包含兩個快捷方式： `response.xpath()` 和 `response.css()` ：： ```py >>> response.xpath('//span/text()').get() 'good' >>> response.css('span::text').get() 'good' ``` Scrapy選擇器是 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 通過傳遞 [`TextResponse`](request-response.html#scrapy.http.TextResponse "scrapy.http.TextResponse") 對象或標記為Unicode字符串（在 `text` 爭論）。通常不需要手動構造廢料選擇器： `response` 對象在spider回調中可用，因此在大多數情況下使用它更方便 `response.css()` 和 `response.xpath()` 捷徑。通過使用 `response.selector` 或者這些快捷方式之一，您還可以確保響應主體只解析一次。但如果需要，可以使用 `Selector` 直接。從文本構造： ```py >>> from scrapy.selector import Selector >>> body = '<html><body><span>good</span></body></html>' >>> Selector(text=body).xpath('//span/text()').get() 'good' ``` 從響應構造- [`HtmlResponse`](request-response.html#scrapy.http.HtmlResponse "scrapy.http.HtmlResponse") 是其中之一 [`TextResponse`](request-response.html#scrapy.http.TextResponse "scrapy.http.TextResponse") 子類：： ```py >>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse >>> response = HtmlResponse(url='http://example.com', body=body) >>> Selector(response=response).xpath('//span/text()').get() 'good' ``` `Selector` 根據輸入類型自動選擇最佳的解析規則（XML對HTML）。 ### 使用選擇器為了解釋如何使用選擇器，我們將使用 `Scrapy shell` （提供交互式測試）和位于Scrapy文檔服務器中的示例頁面： > [https://docs.scrapy.org/en/latest/_static/selectors-sample1.html](https://docs.scrapy.org/en/latest/_static/selectors-sample1.html) 為了完整起見，下面是完整的HTML代碼： ```py <html> <head> <base href='http://example.com/' /> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a> <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a> <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a> <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a> <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a> </div> </body> </html> ``` 首先，讓我們打開Shell： ```py scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html ``` 然后，在shell加載之后，您將得到可用的響應 `response` shell變量及其附加的選擇器 `response.selector` 屬性。由于我們處理的是HTML，選擇器將自動使用HTML解析器。所以，通過觀察 [HTML code](#topics-selectors-htmlcode) 對于該頁面，讓我們構造一個用于選擇標題標記內文本的xpath:： ```py >>> response.xpath('//title/text()') [<Selector xpath='//title/text()' data='Example website'>] ``` 要實際提取文本數據，必須調用選擇器 `.get()` 或 `.getall()` 方法如下： ```py >>> response.xpath('//title/text()').getall() ['Example website'] >>> response.xpath('//title/text()').get() 'Example website' ``` `.get()` 始終返回單個結果；如果有多個匹配項，則返回第一個匹配項的內容；如果沒有匹配項，則不返回任何匹配項。 `.getall()` 返回包含所有結果的列表。請注意，css選擇器可以使用css3偽元素選擇文本或屬性節點： ```py >>> response.css('title::text').get() 'Example website' ``` 正如你所看到的， `.xpath()` 和 `.css()` 方法返回 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") 實例，它是新選擇器的列表。此API可用于快速選擇嵌套數據： ```py >>> response.css('img').xpath('@src').getall() ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] ``` 如果只提取第一個匹配的元素，則可以調用選擇器 `.get()` （或其別名） `.extract_first()` 通常在以前的剪貼版本中使用）：： ```py >>> response.xpath('//div[@id="images"]/a/text()').get() 'Name: My image 1 ' ``` 它返回 `None` 如果未找到元素：： ```py >>> response.xpath('//div[@id="not-exists"]/text()').get() is None True ``` 可以將默認返回值作為參數提供，以代替 `None` ： ```py >>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found') 'not-found' ``` 而不是使用例如 `'@src'` xpath可以使用 `.attrib` A的性質 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") ：： ```py >>> [img.attrib['src'] for img in response.css('img')] ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] ``` 作為捷徑， `.attrib` 也可以直接在selectorlist上使用；它返回第一個匹配元素的屬性： ```py >>> response.css('img').attrib['src'] 'image1_thumb.jpg' ``` 當只需要一個結果時（例如，當按ID選擇或在網頁上選擇唯一元素時）：這是最有用的： ```py >>> response.css('base').attrib['href'] 'http://example.com/' ``` 現在我們將獲得基本URL和一些圖像鏈接： ```py >>> response.xpath('//base/@href').get() 'http://example.com/' >>> response.css('base::attr(href)').get() 'http://example.com/' >>> response.css('base').attrib['href'] 'http://example.com/' >>> response.xpath('//a[contains(@href, "image")]/@href').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.css('a[href*=image]::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.xpath('//a[contains(@href, "image")]/img/@src').getall() ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] >>> response.css('a[href*=image] img::attr(src)').getall() ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg'] ``` ### CSS選擇器的擴展根據W3C標準， [CSS selectors](https://www.w3.org/TR/css3-selectors/#selectors) 不支持選擇文本節點或屬性值。但是在Web抓取上下文中選擇這些是非常重要的，以至于scrappy（parsel）實現了非標準偽元素: * 要選擇文本節點，請使用 `::text` * 要選擇屬性值，請使用 `::attr(name)` 在哪里？ _name_ 是要為其值的屬性的名稱警告這些偽元素是特定于scrapy-/parsel的。他們很可能不會與其他類庫合作 [lxml](http://lxml.de/) 或 [PyQuery](https://pypi.python.org/pypi/pyquery) . 實例： * `title::text` 選擇子代的子文本節點 `<title>` 元素： ```py >>> response.css('title::text').get() 'Example website' ``` * `*::text` 選擇當前選擇器上下文的所有子代文本節點：： ```py >>> response.css('#images *::text').getall() ['\n ', 'Name: My image 1 ', '\n ', 'Name: My image 2 ', '\n ', 'Name: My image 3 ', '\n ', 'Name: My image 4 ', '\n ', 'Name: My image 5 ', '\n '] ``` * `foo::text` 如果 `foo` 元素存在，但不包含文本（即文本為空）：： ```py >>> response.css('img::text').getall() [] ``` 這意味著 `.css('foo::text').get()` 即使元素存在，也無法返回“無”。使用 `default=''` 如果你總是想要一個字符串： ```py >>> response.css('img::text').get() >>> response.css('img::text').get(default='') '' ``` * `a::attr(href)` 選擇 _href_ 后代鏈接的屬性值： ```py >>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] ``` 注解參見： [選擇元素屬性](#selecting-attributes) . 注解不能鏈接這些偽元素。但在實踐中，這沒有多大意義：文本節點沒有屬性，屬性值已經是字符串值，也沒有子節點。 ### 嵌套選擇器選擇方法（ `.xpath()` 或 `.css()` ）返回同一類型的選擇器列表，因此您也可以調用這些選擇器的選擇方法。舉個例子： ```py >>> links = response.xpath('//a[contains(@href, "image")]') >>> links.getall() ['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>', '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>', '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>', '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>', '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>'] >>> for index, link in enumerate(links): ... args = (index, link.xpath('@href').get(), link.xpath('img/@src').get()) ... print('Link number %d points to url %r and image %r' % args) Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg' Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg' Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg' Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg' Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg' ``` ### 選擇元素屬性有幾種方法可以獲得屬性的值。首先，可以使用xpath語法： ```py >>> response.xpath("//a/@href").getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] ``` xpath語法有幾個優點：它是標準的xpath特性，并且 `@attributes` 可用于xpath表達式的其他部分-例如，可以按屬性值篩選。 scrapy還提供了對css選擇器的擴展（ `::attr(...)` ）允許獲取屬性值： ```py >>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] ``` 除此之外，還有 `.attrib` 選擇器的屬性。如果希望在python代碼中查找屬性，而不使用xpaths或css擴展，則可以使用它： ```py >>> [a.attrib['href'] for a in response.css('a')] ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] ``` 此屬性在SelectorList上也可用；它返回具有第一個匹配元素屬性的字典。當期望選擇器給出單個結果時（例如，按元素ID選擇或在頁面上選擇唯一元素時），可以方便地使用： ```py >>> response.css('base').attrib {'href': 'http://example.com/'} >>> response.css('base').attrib['href'] 'http://example.com/' ``` `.attrib` 空SelectorList的屬性為空：： ```py >>> response.css('foo').attrib {} ``` ### 將選擇器與正則表達式一起使用 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 也有 `.re()` 使用正則表達式提取數據的方法。但是，與使用不同 `.xpath()` 或 `.css()` 方法， `.re()` 返回Unicode字符串列表。因此不能構造嵌套 `.re()` 電話。下面是一個用于從 [HTML code](#topics-selectors-htmlcode) 以上： ```py >>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)') ['My image 1', 'My image 2', 'My image 3', 'My image 4', 'My image 5'] ``` 另外還有一個助手在做往復運動 `.get()` （及其別名） `.extract_first()` 為 `.re()` 命名 `.re_first()` . 使用它只提取第一個匹配字符串： ```py >>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)') 'My image 1' ``` ### extract（）和extract_first（）。如果你是一個長期的垃圾用戶，你可能熟悉 `.extract()` 和 `.extract_first()` 選擇器方法。許多博客文章和教程也在使用它們。這些方法仍然由Scrapy支持，有 **no plans** 去貶低他們。但是，現在使用 `.get()` 和 `.getall()` 方法。我們認為這些新方法會產生更簡潔和可讀的代碼。下面的例子展示了這些方法如何相互映射。 1. `SelectorList.get()` 是一樣的 `SelectorList.extract_first()` ：： ```py >>> response.css('a::attr(href)').get() 'image1.html' >>> response.css('a::attr(href)').extract_first() 'image1.html' ``` 2. `SelectorList.getall()` 是一樣的 `SelectorList.extract()` ：： ```py >>> response.css('a::attr(href)').getall() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] >>> response.css('a::attr(href)').extract() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] ``` 3. `Selector.get()` 是一樣的 `Selector.extract()` ：： ```py >>> response.css('a::attr(href)')[0].get() 'image1.html' >>> response.css('a::attr(href)')[0].extract() 'image1.html' ``` 4. 為了保持一致性，還有 `Selector.getall()` ，返回一個列表： ```py >>> response.css('a::attr(href)')[0].getall() ['image1.html'] ``` 所以，主要的區別在于 `.get()` 和 `.getall()` 方法更容易預測： `.get()` 總是返回單個結果， `.getall()` 始終返回所有提取結果的列表。用 `.extract()` 方法：結果是否為列表并不總是顯而易見的；或者得到一個單獨的結果 `.extract()` 或 `.extract_first()` 應該被調用。 ## 使用xpaths 下面是一些提示，可以幫助您有效地將xpath與scrapy選擇器結合使用。如果您還不太熟悉xpath，可以先看看這個 [XPath tutorial](http://www.zvon.org/comp/r/tut-XPath_1.html) . 注解一些提示是基于 [this post from ScrapingHub's blog](https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/) . ### 使用相對路徑請記住，如果要嵌套選擇器并使用以開頭的xpath `/` ，該xpath對文檔是絕對的，而不是相對于 `Selector` 你是從調用來的。例如，假設您希望提取所有 `<p>` 內部元素 `<div>` 元素。首先，你會得到所有 `<div>` 元素：： ```py >>> divs = response.xpath('//div') ``` 首先，您可能會嘗試使用以下方法，這是錯誤的，因為它實際上提取了所有 `<p>` 文檔中的元素，而不僅僅是其中的元素 `<div>` 元素：： ```py >>> for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document ... print(p.get()) ``` 這是正確的方法（注意在 `.//p` XPath）： ```py >>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print(p.get()) ``` 另一個常見的情況是提取所有直接 `<p>` 兒童： ```py >>> for p in divs.xpath('p'): ... print(p.get()) ``` 有關相對路徑的更多詳細信息，請參見 [Location Paths](https://www.w3.org/TR/xpath#location-paths) XPath規范中的節。 ### 按類查詢時，請考慮使用CSS 因為一個元素可以包含多個CSS類，所以按類選擇元素的xpath方法相當冗長： ```py *[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')] ``` 如果你使用 `@class='someclass'` 如果只使用 `contains(@class, 'someclass')` 為了彌補這一點，如果元素具有共享字符串的不同類名，那么最終可能會得到更多想要的元素。 `someclass` . 事實證明，scrapy選擇器允許您鏈接選擇器，因此大多數情況下，您可以使用css按類選擇，然后在需要時切換到xpath:： ```py >>> from scrapy import Selector >>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>') >>> sel.css('.shout').xpath('./time/@datetime').getall() ['2014-07-23 19:00'] ``` 這比使用上面顯示的詳細的xpath技巧要干凈。只要記住使用 `.` 在后面的xpath表達式中。 ### 注意//node[1]和（//node[1]之間的區別 `//node[1]` 選擇所有首先出現在各自父節點下的節點。 `(//node)[1]` 選擇文檔中的所有節點，然后只獲取其中的第一個節點。例子：： ```py >>> from scrapy import Selector >>> sel = Selector(text=""" ....: <ul class="list"> ....: <li>1</li> ....: <li>2</li> ....: <li>3</li> ....: </ul> ....: <ul class="list"> ....: <li>4</li> ....: <li>5</li> ....: <li>6</li> ....: </ul>""") >>> xp = lambda x: sel.xpath(x).getall() ``` 這是最重要的 `<li>` 其父級下的元素： ```py >>> xp("//li[1]") ['<li>1</li>', '<li>4</li>'] ``` 這是第一個 `<li>` 整個文檔中的元素： ```py >>> xp("(//li)[1]") ['<li>1</li>'] ``` 這是最重要的 `<li>` 下的元素 `<ul>` 起源：： ```py >>> xp("//ul/li[1]") ['<li>1</li>', '<li>4</li>'] ``` 這是第一個 `<li>` 元素在 `<ul>` 整個文檔中的父級： ```py >>> xp("(//ul/li)[1]") ['<li>1</li>'] ``` ### 在條件中使用文本節點當需要將文本內容用作 [XPath string function](https://www.w3.org/TR/xpath/#section-String-Functions) 避免使用 `.//text()` 只使用 `.` 相反。這是因為表達式 `.//text()` 生成一個文本元素集合--a 節點集. 當一個節點集被轉換成一個字符串，當它作為參數傳遞給一個字符串函數 `contains()` 或 `starts-with()` ，它只為第一個元素生成文本。例子：： ```py >>> from scrapy import Selector >>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>') ``` 轉換A _node-set_ 串：： ```py >>> sel.xpath('//a//text()').getall() # take a peek at the node-set ['Click here to go to the ', 'Next Page'] >>> sel.xpath("string(//a[1]//text())").getall() # convert it to string ['Click here to go to the '] ``` A _node_ 但是，轉換為字符串后，會將自身的文本加上其所有子體的文本放在一起： ```py >>> sel.xpath("//a[1]").getall() # select the first node ['<a href="#">Click here to go to the <strong>Next Page</strong></a>'] >>> sel.xpath("string(//a[1])").getall() # convert it to string ['Click here to go to the Next Page'] ``` 所以，使用 `.//text()` 在這種情況下，節點集不會選擇任何內容： ```py >>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall() [] ``` 但是使用 `.` 要表示節點，工作方式： ```py >>> sel.xpath("//a[contains(., 'Next Page')]").getall() ['<a href="#">Click here to go to the <strong>Next Page</strong></a>'] ``` ### xpath表達式中的變量 xpath允許您引用xpath表達式中的變量，使用 `$somevariable` 語法。這與SQL世界中的參數化查詢或準備好的語句有點類似，在SQL世界中，將查詢中的某些參數替換為諸如 `?` ，然后用查詢傳遞的值替換。下面是一個根據元素的“id”屬性值匹配元素的示例，不需要對其進行硬編碼（前面已顯示）：： ```py >>> # `$val` used in the expression, a `val` argument needs to be passed >>> response.xpath('//div[@id=$val]/a/text()', val='images').get() 'Name: My image 1 ' ``` 下面是另一個示例，用于查找 `<div>` 包含五個的標簽 `<a>` 孩子們（在這里我們傳遞價值 `5` 作為整數）：： ```py >>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get() 'images' ``` 調用時，所有變量引用都必須具有綁定值 `.xpath()` （否則你會得到 `ValueError: XPath error:` 例外）。這是通過根據需要傳遞盡可能多的命名參數來實現的。 [parsel](https://parsel.readthedocs.io/) 為Scrapy選擇器供電的庫提供了更多的詳細信息和示例 [XPath variables](https://parsel.readthedocs.io/en/latest/usage.html#variables-in-xpath-expressions) . ### 正在刪除命名空間在處理 Scrape 項目時，完全消除名稱空間，只使用元素名，編寫更簡單/方便的xpaths通常是非常方便的。你可以使用 `Selector.remove_namespaces()` 方法。讓我們展示一個例子，用Python Insider博客Atom feed來說明這一點。首先，我們用要抓取的URL打開Shell： ```py $ scrapy shell https://feeds.feedburner.com/PythonInsider ``` 文件就是這樣開始的： ```py <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet ... <feed xmlns="http://www.w3.org/2005/Atom" xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/" xmlns:blogger="http://schemas.google.com/blogger/2008" xmlns:georss="http://www.georss.org/georss" xmlns:gd="http://schemas.google.com/g/2005" xmlns:thr="http://purl.org/syndication/thread/1.0" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"> ... ``` 您可以看到幾個名稱空間聲明，其中包括默認的“[http://www.w3.org/2005/atom](http://www.w3.org/2005/atom)”，以及另一個使用“gd:”前綴的“[http://schemas.google.com/g/2005](http://schemas.google.com/g/2005)”。一旦進入Shell，我們可以嘗試選擇所有 `<link>` 對象，并查看它是否不起作用（因為Atom XML命名空間正在混淆這些節點）：： ```py >>> response.xpath("//link") [] ``` 但一旦我們呼叫 `Selector.remove_namespaces()` 方法，所有節點都可以通過其名稱直接訪問： ```py >>> response.selector.remove_namespaces() >>> response.xpath("//link") [<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>, <Selector xpath='//link' data='<link rel="next" type="application/atom+'>, ... ``` 如果您想知道為什么不總是在默認情況下調用命名空間移除過程，而不必手動調用它，這是因為兩個原因，按照相關性的順序，這是： 1. 刪除命名空間需要迭代和修改文檔中的所有節點，這是一個相當昂貴的操作，默認情況下，對scrapy所爬行的所有文檔執行此操作。 2. 在某些情況下，實際需要使用名稱空間，以防某些元素名稱在名稱空間之間發生沖突。不過，這些病例非常罕見。 ### 使用exslt擴展建在頂上 [lxml](http://lxml.de/) ，scrapy選擇器支持一些 [EXSLT](http://exslt.org/) 擴展名，并附帶這些預注冊的命名空間以在xpath表達式中使用： | 前綴 | 命名空間 | 使用 | | --- | --- | --- | | 重新 | [http://exslt.org](http://exslt.org)/正則表達式 | [regular expressions](http://exslt.org/regexp/index.html) | | 設置 | http://exslt.org/sets | [set manipulation](http://exslt.org/set/index.html) | #### 正則表達式這個 `test()` 例如，當xpath的 `starts-with()` 或 `contains()` 還不夠。選擇列表項中以數字結尾的“class”屬性的鏈接示例： ```py >>> from scrapy import Selector >>> doc = u""" ... <div> ... <ul> ... <li class="item-0"><a href="link1.html">first item</a></li> ... <li class="item-1"><a href="link2.html">second item</a></li> ... <li class="item-inactive"><a href="link3.html">third item</a></li> ... <li class="item-1"><a href="link4.html">fourth item</a></li> ... <li class="item-0"><a href="link5.html">fifth item</a></li> ... </ul> ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> sel.xpath('//li//@href').getall() ['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html'] >>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall() ['link1.html', 'link2.html', 'link4.html', 'link5.html'] >>> ``` 警告 C庫 `libxslt` 本機不支持exslt正則表達式，因此 [lxml](http://lxml.de/) 的實現使用了對python的Hook `re` 模塊。因此，在xpath表達式中使用regexp函數可能會增加一點性能損失。 #### 集合運算例如，在提取文本元素之前，可以方便地排除文檔樹的某些部分。使用一組itemscope和相應的itemprops提取微數據（從http://schema.org/product獲取的示例內容）的示例： ```py >>> doc = u""" ... <div itemscope itemtype="http://schema.org/Product"> ... <span itemprop="name">Kenmore White 17" Microwave</span> ... <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' /> ... <div itemprop="aggregateRating" ... itemscope itemtype="http://schema.org/AggregateRating"> ... Rated <span itemprop="ratingValue">3.5</span>/5 ... based on <span itemprop="reviewCount">11</span> customer reviews ... </div> ... ... <div itemprop="offers" itemscope itemtype="http://schema.org/Offer"> ... <span itemprop="price">$55.00</span> ... <link itemprop="availability" href="http://schema.org/InStock" />In stock ... </div> ... ... Product description: ... <span itemprop="description">0.7 cubic feet countertop microwave. ... Has six preset cooking categories and convenience features like ... Add-A-Minute and Child Lock.</span> ... ... Customer reviews: ... ... <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ... <span itemprop="name">Not a happy camper</span> - ... by <span itemprop="author">Ellie</span>, ... <meta itemprop="datePublished" content="2011-04-01">April 1, 2011 ... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ... <meta itemprop="worstRating" content = "1"> ... <span itemprop="ratingValue">1</span>/ ... <span itemprop="bestRating">5</span>stars ... </div> ... <span itemprop="description">The lamp burned out and now I have to replace ... it. </span> ... </div> ... ... <div itemprop="review" itemscope itemtype="http://schema.org/Review"> ... <span itemprop="name">Value purchase</span> - ... by <span itemprop="author">Lucas</span>, ... <meta itemprop="datePublished" content="2011-03-25">March 25, 2011 ... <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating"> ... <meta itemprop="worstRating" content = "1"/> ... <span itemprop="ratingValue">4</span>/ ... <span itemprop="bestRating">5</span>stars ... </div> ... <span itemprop="description">Great microwave for the price. It is small and ... fits in my apartment.</span> ... </div> ... ... ... </div> ... """ >>> sel = Selector(text=doc, type="html") >>> for scope in sel.xpath('//div[@itemscope]'): ... print("current scope:", scope.xpath('@itemtype').getall()) ... props = scope.xpath(''' ... set:difference(./descendant::*/@itemprop, ... .//*[@itemscope]/*/@itemprop)''') ... print(" properties: %s" % (props.getall())) ... print("") current scope: ['http://schema.org/Product'] properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review'] current scope: ['http://schema.org/AggregateRating'] properties: ['ratingValue', 'reviewCount'] current scope: ['http://schema.org/Offer'] properties: ['price', 'availability'] current scope: ['http://schema.org/Review'] properties: ['name', 'author', 'datePublished', 'reviewRating', 'description'] current scope: ['http://schema.org/Rating'] properties: ['worstRating', 'ratingValue', 'bestRating'] current scope: ['http://schema.org/Review'] properties: ['name', 'author', 'datePublished', 'reviewRating', 'description'] current scope: ['http://schema.org/Rating'] properties: ['worstRating', 'ratingValue', 'bestRating'] >>> ``` 在這里我們首先迭代 `itemscope` 元素，每一個元素，我們都在尋找 `itemprops` 元素并排除它們本身在另一個元素中的元素 `itemscope` . ### 其他XPath擴展 scrapy選擇器還提供一個非常遺漏的xpath擴展函數 `has-class` 它會回來 `True` 對于具有所有指定HTML類的節點。對于以下HTML:： ```py <p class="foo bar-baz">First</p> <p class="foo">Second</p> <p class="bar">Third</p> <p>Fourth</p> ``` 你可以這樣使用它： ```py >>> response.xpath('//p[has-class("foo")]') [<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>, <Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>] >>> response.xpath('//p[has-class("foo", "bar-baz")]') [<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>] >>> response.xpath('//p[has-class("foo", "bar")]') [] ``` 所以XPath `//p[has-class("foo", "bar-baz")]` 大致相當于CSS `p.foo.bar-baz` . 請注意，在大多數情況下，它的速度較慢，因為它是一個純Python函數，可以為問題中的每個節點調用，而CSS查找被轉換為xpath，因此運行效率更高，因此性能方面，它的使用僅限于不容易用css選擇器描述的情況。 Parsel還簡化了添加自己的xpath擴展。 ```py parsel.xpathfuncs.set_xpathfunc(fname, func) ``` 注冊要在xpath表達式中使用的自定義擴展函數。函數 `func` 注冊于 `fname` 將為每個匹配節點調用標識符，并將其傳遞給 `context` 參數以及從相應的xpath表達式傳遞的任何參數。如果 `func` 是 `None` ，將刪除擴展功能。查看更多 [in lxml documentation](http://lxml.de/extensions.html#xpath-extension-functions) . ## 內置選擇器引用 ### 選擇器對象 ```py class scrapy.selector.Selector(response=None, text=None, type=None, root=None, _root=None, **kwargs) ``` 的實例 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 是一個包裝響應，用于選擇其內容的某些部分。 `response` 是一個 [`HtmlResponse`](request-response.html#scrapy.http.HtmlResponse "scrapy.http.HtmlResponse") 或 [`XmlResponse`](request-response.html#scrapy.http.XmlResponse "scrapy.http.XmlResponse") 將用于選擇和提取數據的對象。 `text` 是Unicode字符串或UTF-8編碼文本，用于 `response` 不可用。使用 `text` 和 `response` 一起是未定義的行為。 `type` 定義選擇器類型，它可以是 `"html"` ， `"xml"` 或 `None` （默認）。如果 `type` 是 `None` ，選擇器根據 `response` 類型（見下文），或默認為 `"html"` 以防與 `text` . 如果 `type` 是 `None` 和A `response` 如果傳遞，則從響應類型推斷選擇器類型，如下所示： * `"html"` 對于 [`HtmlResponse`](request-response.html#scrapy.http.HtmlResponse "scrapy.http.HtmlResponse") 類型 * `"xml"` 對于 [`XmlResponse`](request-response.html#scrapy.http.XmlResponse "scrapy.http.XmlResponse") 類型 * `"html"` 還有什么事嗎否則，如果 `type` 設置后，選擇器類型將被強制，不會發生檢測。 ```py xpath(query, namespaces=None, **kwargs) ``` 查找與xpath匹配的節點 `query` 并將結果作為 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") 將所有元素展平的實例。列表元素實現 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 接口也是如此。 `query` 是包含要應用的XPath查詢的字符串。 `namespaces` 是可選的 `prefix: namespace-uri` 將附加前綴的映射（dict）映射到 `register_namespace(prefix, uri)` . 相反 `register_namespace()` ，這些前綴不會保存以備將來調用。可以使用任何其他命名參數來傳遞xpath表達式中xpath變量的值，例如：： ```py selector.xpath('//a[href=$url]', url="http://www.example.com") ``` 注解為了方便起見，此方法可以調用為 `response.xpath()` ```py css(query) ``` 應用給定的css選擇器并返回 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") 實例。 `query` 是包含要應用的CSS選擇器的字符串。在后臺，使用 [cssselect](https://pypi.python.org/pypi/cssselect/) 類庫與運行 `.xpath()` 方法。注解為了方便起見，此方法可以調用為 `response.css()` ```py get() ``` 序列化并返回單個Unicode字符串中匹配的節點。未引用編碼內容的百分比。參見： [extract（）和extract_first（）。](#old-extraction-api) ```py attrib ``` 返回基礎元素的屬性字典。參見： [選擇元素屬性](#selecting-attributes) . ```py re(regex, replace_entities=True) ``` 應用給定的regex并返回帶有匹配項的Unicode字符串列表。 `regex` 可以是已編譯的正則表達式，也可以是將使用 `re.compile(regex)` . 默認情況下，字符實體引用替換為其相應的字符（除了 `&` 和 `<` ）經過 `replace_entities` 作為 `False` 關閉這些替換。 ```py re_first(regex, default=None, replace_entities=True) ``` 應用給定的regex并返回第一個匹配的unicode字符串。如果沒有匹配項，則返回默認值（ `None` 如果未提供參數）。默認情況下，字符實體引用替換為其相應的字符（除了 `&` 和 `<` ）經過 `replace_entities` 作為 `False` 關閉這些替換。 ```py register_namespace(prefix, uri) ``` 注冊要在此中使用的給定命名空間 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") . 如果不注冊命名空間，則無法從非標準命名空間中選擇或提取數據。見 [XML響應的選擇器示例](#selector-examples-xml) . ```py remove_namespaces() ``` 刪除所有名稱空間，允許使用不含名稱空間的xpaths遍歷文檔。見 [正在刪除命名空間](#removing-namespaces) . ```py __bool__() ``` 返回 `True` 如果選擇了任何真實內容或 `False` 否則。換句話說，布爾值 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 由它選擇的內容給出。 ```py getall() ``` 序列化并返回unicode字符串的1元素列表中匹配的節點。為了保持一致性，這個方法被添加到選擇器中；它對于選擇器列表更有用。參見： [extract（）和extract_first（）。](#old-extraction-api) ### SelectorList對象 ```py class scrapy.selector.SelectorList ``` 這個 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") 類是內置的子類 `list` 類，它提供了一些附加方法。 ```py xpath(xpath, namespaces=None, **kwargs) ``` 調用給 `.xpath()` 此列表中的每個元素的方法，并將其結果扁平化為另一個 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") . `query` is the same argument as the one in [`Selector.xpath()`](#scrapy.selector.Selector.xpath "scrapy.selector.Selector.xpath") `namespaces` 是可選的 `prefix: namespace-uri` 將附加前綴的映射（dict）映射到 `register_namespace(prefix, uri)` . 相反 `register_namespace()` ，這些前綴不會保存以備將來調用。可以使用任何其他命名參數來傳遞xpath表達式中xpath變量的值，例如：： ```py selector.xpath('//a[href=$url]', url="http://www.example.com") ``` ```py css(query) ``` 調用給 `.css()` 此列表中的每個元素的方法，并將其結果扁平化為另一個 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") . `query` is the same argument as the one in [`Selector.css()`](#scrapy.selector.Selector.css "scrapy.selector.Selector.css") ```py getall() ``` 調用給 `.get()` 每個元素的方法都是這個列表，并將它們的結果作為一個Unicode字符串列表平展地返回。參見： [extract（）和extract_first（）。](#old-extraction-api) ```py get(default=None) ``` 返回的結果 `.get()` 對于此列表中的第一個元素。如果列表為空，則返回默認值。參見： [extract（）和extract_first（）。](#old-extraction-api) ```py re(regex, replace_entities=True) ``` 調用給 `.re()` 方法，并以unicode字符串列表的形式返回結果。默認情況下，字符實體引用替換為其相應的字符（除了 `&` 和 `<` .經過 `replace_entities` 作為 `False` 關閉這些替換。 ```py re_first(regex, default=None, replace_entities=True) ``` 調用給 `.re()` 方法，并以Unicode字符串返回結果。如果列表為空或regex不匹配，則返回默認值（ `None` 如果未提供參數）。默認情況下，字符實體引用替換為其相應的字符（除了 `&` 和 `<` .經過 `replace_entities` 作為 `False` 關閉這些替換。 ```py attrib ``` 返回第一個元素的屬性字典。如果列表為空，則返回空的dict。參見： [選擇元素屬性](#selecting-attributes) . ## 實例 ### HTML響應的選擇器示例這里有一些 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 舉例說明幾個概念。在所有情況下，我們假設 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 用一個 [`HtmlResponse`](request-response.html#scrapy.http.HtmlResponse "scrapy.http.HtmlResponse") 這樣的對象： ```py sel = Selector(html_response) ``` 1. 選擇全部 `<h1>` 來自HTML響應正文的元素，返回 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 對象（例如 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") 對象）： ```py sel.xpath("//h1") ``` 2. 提取所有文本 `<h1>` 來自HTML響應正文的元素，返回Unicode字符串列表： ```py sel.xpath("//h1").getall() # this includes the h1 tag sel.xpath("//h1/text()").getall() # this excludes the h1 tag ``` 3. 全部迭代 `<p>` 標記并打印其類屬性： ```py for node in sel.xpath("//p"): print(node.attrib['class']) ``` ### XML響應的選擇器示例下面是一些例子來說明 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 對象用 [`XmlResponse`](request-response.html#scrapy.http.XmlResponse "scrapy.http.XmlResponse") 對象： ```py sel = Selector(xml_response) ``` 1. 選擇全部 `<product>` 來自XML響應主體的元素，返回 [`Selector`](#scrapy.selector.Selector "scrapy.selector.Selector") 對象（例如 [`SelectorList`](#scrapy.selector.SelectorList "scrapy.selector.SelectorList") 對象）： ```py sel.xpath("//product") ``` 2. 從A中提取所有價格 [Google Base XML feed](https://support.google.com/merchants/answer/160589?hl=en&ref_topic=2473799) 需要注冊命名空間：： ```py sel.register_namespace("g", "http://base.google.com/ns/1.0") sel.xpath("//g:price").getall() ```