选择器¶

当您抓取网页时，最常见的任务是提取 HTML 源中的数据。有几个库可以实现这一点，例如

BeautifulSoup 是 Python 程序员中非常流行的网页抓取库，它根据 HTML 代码的结构构建 Python 对象，并且能够很好地处理错误的标记，但它有一个缺点：速度慢。
lxml 是一个 XML 解析库（也解析 HTML），具有基于 ElementTree 的 Pythonic API。（lxml 不是 Python 标准库的一部分。）

Scrapy 带有自己的数据提取机制。它们被称为选择器，因为它们通过 XPath 或 CSS 表达式“选择” HTML 文档的某些部分。

XPath 是一种用于在 XML 文档中选择节点的语言，也可以用于 HTML。 CSS 是一种用于将样式应用于 HTML 文档的语言。它定义选择器以将这些样式与特定的 HTML 元素关联。

注意

Scrapy 选择器是 parsel 库的薄包装；此包装的目的是提供与 Scrapy 响应对象的更好集成。

parsel 是一个独立的网页抓取库，可以在没有 Scrapy 的情况下使用。它在后台使用 lxml 库，并在 lxml API 之上实现了一个简单的 API。这意味着 Scrapy 选择器在速度和解析准确性方面与 lxml 非常相似。

使用选择器¶

构建选择器¶

响应对象在 .selector 属性上公开一个 Selector 实例

>>> response.selector.xpath("//span/text()").get()
'good'

使用 XPath 和 CSS 查询响应非常常见，因此响应包含两个快捷方式：response.xpath() 和 response.css()

>>> response.xpath("//span/text()").get()
'good'
>>> response.css("span::text").get()
'good'

Scrapy 选择器是 Selector 类的实例，通过传递 TextResponse 对象或标记作为字符串（在 text 参数中）来构建。

通常不需要手动构建 Scrapy 选择器：response 对象在 Spider 回调中可用，因此在大多数情况下，使用 response.css() 和 response.xpath() 快捷方式更方便。通过使用 response.selector 或其中一个快捷方式，您还可以确保响应正文仅解析一次。

但如果需要，可以使用 Selector 直接。从文本构建

>>> from scrapy.selector import Selector
>>> body = "<html><body><span>good</span></body></html>"
>>> Selector(text=body).xpath("//span/text()").get()
'good'

从响应构建 - HtmlResponse 是 TextResponse 子类的其中之一

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="http://example.com", body=body, encoding="utf-8")
>>> Selector(response=response).xpath("//span/text()").get()
'good'

Selector 根据输入类型自动选择最佳解析规则（XML 与 HTML）。

使用选择器¶

为了解释如何使用选择器，我们将使用 Scrapy shell（提供交互式测试）和位于 Scrapy 文档服务器上的示例页面

https://docs.scrapy.net.cn/en/latest/_static/selectors-sample1.html

为了完整起见，以下是其完整的 HTML 代码

<!DOCTYPE html>

<html>
  <head>
    <base href='http://example.com/' />
    <title>Example website</title>
  </head>
  <body>
    <div id='images'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' alt='image1'/></a>
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' alt='image2'/></a>
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' alt='image3'/></a>
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' alt='image4'/></a>
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' alt='image5'/></a>
    </div>
  </body>
</html>

首先，让我们打开 shell

scrapy shell https://docs.scrapy.net.cn/en/latest/_static/selectors-sample1.html

然后，在 shell 加载后，您将拥有作为 response shell 变量的响应，以及其附加的 response.selector 属性中的选择器。

由于我们正在处理 HTML，因此选择器将自动使用 HTML 解析器。

因此，通过查看该页面的 HTML 代码，让我们构建一个用于选择 title 标记内文本的 XPath

>>> response.xpath("//title/text()")
[<Selector query='//title/text()' data='Example website'>]

要实际提取文本数据，您必须调用选择器的 .get() 或 .getall() 方法，如下所示

>>> response.xpath("//title/text()").getall()
['Example website']
>>> response.xpath("//title/text()").get()
'Example website'

.get() 始终返回单个结果；如果有多个匹配项，则返回第一个匹配项的内容；如果没有匹配项，则返回 None。 .getall() 返回包含所有结果的列表。

请注意，CSS 选择器可以使用 CSS3 伪元素选择文本或属性节点

>>> response.css("title::text").get()
'Example website'

如您所见，.xpath() 和 .css() 方法返回 SelectorList 实例，它是一个新的选择器列表。此 API 可用于快速选择嵌套数据

>>> response.css("img").xpath("@src").getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

如果您只想提取第一个匹配的元素，可以调用选择器的 .get()（或其别名 .extract_first()，在以前的 Scrapy 版本中常用）

>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '

如果未找到元素，则返回 None

>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True

可以提供一个默认返回值作为参数，用于代替 None

>>> response.xpath('//div[@id="not-exists"]/text()').get(default="not-found")
'not-found'

而不是使用例如 '@src' XPath，可以使用 Selector 的 .attrib 属性查询属性

>>> [img.attrib["src"] for img in response.css("img")]
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

作为快捷方式，.attrib 也可直接用于 SelectorList；它返回第一个匹配元素的属性

>>> response.css("img").attrib["src"]
'image1_thumb.jpg'

当仅期望单个结果时，这最有用，例如，当按 ID 选择或在网页上选择唯一元素时

>>> response.css("base").attrib["href"]
'http://example.com/'

现在我们将获取基本 URL 和一些图像链接

>>> response.xpath("//base/@href").get()
'http://example.com/'

>>> response.css("base::attr(href)").get()
'http://example.com/'

>>> response.css("base").attrib["href"]
'http://example.com/'

>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.css("a[href*=image]::attr(href)").getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

>>> response.css("a[href*=image] img::attr(src)").getall()
['image1_thumb.jpg',
'image2_thumb.jpg',
'image3_thumb.jpg',
'image4_thumb.jpg',
'image5_thumb.jpg']

CSS 选择器的扩展¶

根据 W3C 标准，CSS 选择器不支持选择文本节点或属性值。但在网页抓取环境中选择这些节点非常重要，因此 Scrapy（parsel）实现了一些**非标准伪元素**

要选择文本节点，请使用 ::text
要选择属性值，请使用 ::attr(name)，其中 name 是您想要其值的属性的名称

警告

这些伪元素是 Scrapy/Parsel 特定的。它们很可能不适用于其他库，如 lxml 或 PyQuery。

示例

title::text 选择后代 <title> 元素的子文本节点

>>> response.css("title::text").get()
'Example website'

*::text 选择当前选择器上下文的全部后代文本节点

>>> response.css("#images *::text").getall()
['\n   ',
'Name: My image 1 ',
'\n   ',
'Name: My image 2 ',
'\n   ',
'Name: My image 3 ',
'\n   ',
'Name: My image 4 ',
'\n   ',
'Name: My image 5 ',
'\n  ']

foo::text 如果 foo 元素存在但没有包含任何文本（即文本为空），则返回无结果

>>> response.css("img::text").getall()
[]

This means ``.css('foo::text').get()`` could return None even if an element
exists. Use ``default=''`` if you always want a string:

>>> response.css("img::text").get()
>>> response.css("img::text").get(default="")
''

a::attr(href) 选择后代链接的 href 属性值

>>> response.css("a::attr(href)").getall()
['image1.html',
'image2.html',
'image3.html',
'image4.html',
'image5.html']

注意

另请参阅：选择元素属性。

注意

您不能将这些伪元素链接起来。但在实践中，这样做没有多大意义：文本节点没有属性，属性值已经是字符串值并且没有子节点。

嵌套选择器¶

选择方法（.xpath() 或 .css()）返回相同类型选择器的列表，因此您也可以为这些选择器调用选择方法。以下是一个示例

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>',
'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg" alt="image2"></a>',
'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg" alt="image3"></a>',
'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg" alt="image4"></a>',
'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg" alt="image5"></a>']

>>> for index, link in enumerate(links):
...     href_xpath = link.xpath("@href").get()
...     img_xpath = link.xpath("img/@src").get()
...     print(f"Link number {index} points to url {href_xpath!r} and image {img_xpath!r}")
...
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

选择元素属性¶

有几种方法可以获取属性的值。首先，可以使用 XPath 语法

>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

XPath 语法有一些优点：它是标准的 XPath 功能，并且 @attributes 可用于 XPath 表达式的其他部分 - 例如，可以通过属性值进行筛选。

Scrapy 还为 CSS 选择器提供了一个扩展（::attr(...)），允许获取属性值

>>> response.css("a::attr(href)").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

此外，Selector 还有一个名为 .attrib 的属性。如果您希望在 Python 代码中查找属性，而不使用 XPath 或 CSS 扩展，可以使用它。

>>> [a.attrib["href"] for a in response.css("a")]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

此属性也可用于 SelectorList；它返回第一个匹配元素的属性字典。当选择器预期返回单个结果时（例如，按元素 ID 选择，或在页面上选择唯一元素时），使用它很方便。

>>> response.css("base").attrib
{'href': 'http://example.com/'}
>>> response.css("base").attrib["href"]
'http://example.com/'

空 SelectorList 的 .attrib 属性为空。

>>> response.css("foo").attrib
{}

使用正则表达式的选择器¶

Selector 还具有 .re() 方法，用于使用正则表达式提取数据。但是，与使用 .xpath() 或 .css() 方法不同，.re() 返回字符串列表。因此，您无法构建嵌套的 .re() 调用。

以下是一个从上面HTML 代码中提取图像名称的示例。

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r"Name:\s*(.*)")
['My image 1 ',
'My image 2 ',
'My image 3 ',
'My image 4 ',
'My image 5 ']

.re() 还提供了一个辅助函数 .re_first()，类似于 .get()（及其别名 .extract_first()）。使用它来提取第一个匹配的字符串。

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r"Name:\s*(.*)")
'My image 1 '

extract() 和 extract_first()¶

如果您是 Scrapy 的长期用户，您可能熟悉 .extract() 和 .extract_first() 选择器方法。许多博客文章和教程也使用它们。这些方法仍然受 Scrapy 支持，**没有计划**弃用它们。

但是，Scrapy 使用文档现在使用 .get() 和 .getall() 方法编写。我们认为这些新方法可以生成更简洁易读的代码。

以下示例显示了这些方法如何相互映射。

SelectorList.get() 等同于 SelectorList.extract_first()

>>> response.css("a::attr(href)").get()
'image1.html'
>>> response.css("a::attr(href)").extract_first()
'image1.html'

SelectorList.getall() 等同于 SelectorList.extract()

>>> response.css("a::attr(href)").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css("a::attr(href)").extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

Selector.get() 等同于 Selector.extract()

>>> response.css("a::attr(href)")[0].get()
'image1.html'
>>> response.css("a::attr(href)")[0].extract()
'image1.html'

为了保持一致性，还有一个 Selector.getall()，它返回一个列表。

>>> response.css("a::attr(href)")[0].getall()
['image1.html']

因此，主要区别在于 .get() 和 .getall() 方法的输出更可预测：.get() 始终返回单个结果，.getall() 始终返回所有提取结果的列表。使用 .extract() 方法，并不总是清楚结果是否为列表；要获取单个结果，需要调用 .extract() 或 .extract_first()。

使用 XPath¶

以下是一些技巧，可以帮助您有效地将 XPath 与 Scrapy 选择器一起使用。如果您不太熟悉 XPath，可以先查看此XPath 教程。

注意

其中一些技巧基于Zyte 博客中的这篇文章。

使用相对 XPath¶

请记住，如果您嵌套选择器并使用以 / 开头的 XPath，则该 XPath 将相对于文档而不是相对于您从中调用的 Selector。

例如，假设您要提取 <div> 元素内部的所有 <p> 元素。首先，您将获取所有 <div> 元素。

>>> divs = response.xpath("//div")

首先，您可能会尝试使用以下方法，这是错误的，因为它实际上提取了文档中的所有 <p> 元素，而不仅仅是 <div> 元素内部的元素。

>>> for p in divs.xpath("//p"):  # this is wrong - gets all <p> from the whole document
...     print(p.get())
...

这是正确的方法（请注意 .//p XPath 前面的点前缀）。

>>> for p in divs.xpath(".//p"):  # extracts all <p> inside
...     print(p.get())
...

另一个常见的情况是提取所有直接的 <p> 子元素。

>>> for p in divs.xpath("p"):
...     print(p.get())
...

有关相对 XPath 的更多详细信息，请参阅 XPath 规范中的位置路径部分。

当按类查询时，请考虑使用 CSS¶

因为一个元素可以包含多个 CSS 类，所以按类选择元素的 XPath 方法相当冗长。

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

如果您使用 @class='someclass'，您最终可能会错过具有其他类的元素，如果您只是使用 contains(@class, 'someclass') 来弥补这一点，您最终可能会得到比您想要的更多的元素，如果它们有不同的类名并且共享字符串 someclass。

事实证明，Scrapy 选择器允许您链接选择器，因此大多数情况下，您可以只使用 CSS 按类选择，然后在需要时切换到 XPath。

>>> from scrapy import Selector
>>> sel = Selector(
...     text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>'
... )
>>> sel.css(".shout").xpath("./time/@datetime").getall()
['2014-07-23 19:00']

这比使用上面显示的冗长的 XPath 技巧更简洁。只需记住在后续的 XPath 表达式中使用 .。

注意 //node[1] 和 (//node)[1] 之间的区别¶

//node[1] 选择在其各自父级下出现的第一个节点。

(//node)[1] 选择文档中的所有节点，然后只获取第一个节点。

示例

>>> from scrapy import Selector
>>> sel = Selector(
...     text="""
...     <ul class="list">
...         <li>1</li>
...         <li>2</li>
...         <li>3</li>
...     </ul>
...     <ul class="list">
...         <li>4</li>
...         <li>5</li>
...         <li>6</li>
...     </ul>"""
... )
>>> xp = lambda x: sel.xpath(x).getall()

这将获取其父级下所有第一个 <li> 元素。

>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']

这将获取整个文档中的第一个 <li> 元素。

>>> xp("(//li)[1]")
['<li>1</li>']

这将获取 <ul> 父级下所有第一个 <li> 元素。

>>> xp("//ul/li[1]")
['<li>1</li>', '<li>4</li>']

这将获取整个文档中 <ul> 父级下的第一个 <li> 元素。

>>> xp("(//ul/li)[1]")
['<li>1</li>']

在条件中使用文本节点¶

当您需要将文本内容用作XPath 字符串函数的参数时，请避免使用 .//text()，而应使用 .。

这是因为表达式 .//text() 生成文本元素的集合——一个节点集。当节点集转换为字符串时（发生在将其作为参数传递给字符串函数（如 contains() 或 starts-with()）时），它只会生成第一个元素的文本。

示例

>>> from scrapy import Selector
>>> sel = Selector(
...     text='<a href="#">Click here to go to the <strong>Next Page</strong></a>'
... )

将节点集转换为字符串

>>> sel.xpath("//a//text()").getall()  # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall()  # convert it to string
['Click here to go to the ']

但是，节点转换为字符串会将自身及其所有后代的文本组合在一起。

>>> sel.xpath("//a[1]").getall()  # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall()  # convert it to string
['Click here to go to the Next Page']

因此，在这种情况下，使用 .//text() 节点集将无法选择任何内容。

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]

但是，使用 . 表示节点则可以工作。

>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

XPath 表达式中的变量¶

XPath 允许您在 XPath 表达式中引用变量，使用 $somevariable 语法。这有点类似于 SQL 世界中的参数化查询或预准备语句，您在查询中用占位符（如 ?）替换一些参数，然后用传递给查询的值替换这些占位符。

以下是一个根据元素的“id”属性值匹配元素的示例，无需对其进行硬编码（之前已显示过）。

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath("//div[@id=$val]/a/text()", val="images").get()
'Name: My image 1 '

以下是一个示例，查找包含五个 <a> 子元素的 <div> 标记的“id”属性（这里我们将值 5 作为整数传递）。

>>> response.xpath("//div[count(a)=$cnt]/@id", cnt=5).get()
'images'

调用 .xpath() 时，所有变量引用都必须具有绑定值（否则您将收到 ValueError: XPath error: 异常）。这可以通过根据需要传递多个命名参数来完成。

parsel（为 Scrapy 选择器提供支持的库）提供了有关XPath 变量的更多详细信息和示例。

移除命名空间¶

在处理抓取项目时，通常很方便完全摆脱命名空间，只需使用元素名称即可编写更简单/方便的 XPath。您可以为此使用 Selector.remove_namespaces() 方法。

让我们举一个使用 Python Insider 博客 Atom Feed 说明这一点的示例。

首先，我们使用要抓取的 URL 打开 shell。

$ scrapy shell https://feeds.feedburner.com/PythonInsider

文件以以下方式开头。

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
      xmlns:blogger="http://schemas.google.com/blogger/2008"
      xmlns:georss="http://www.georss.org/georss"
      xmlns:gd="http://schemas.google.com/g/2005"
      xmlns:thr="http://purl.org/syndication/thread/1.0"
      xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  ...

您可以看到几个命名空间声明，包括一个默认的 "http://www.w3.org/2005/Atom" 和另一个使用 gd: 前缀表示 "http://schemas.google.com/g/2005" 的声明。

进入 shell 后，我们可以尝试选择所有 <link> 对象，并发现它不起作用（因为 Atom XML 命名空间隐藏了这些节点）。

>>> response.xpath("//link")
[]

但是，一旦我们调用 Selector.remove_namespaces() 方法，就可以直接通过其名称访问所有节点。

>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector query='//link' data='<link rel="alternate" type="text/html" h'>,
    <Selector query='//link' data='<link rel="next" type="application/atom+'>,
    ...

如果您想知道为什么命名空间删除过程不是始终默认调用，而是需要手动调用，这是因为两个原因，按相关性排序如下：

移除命名空间需要迭代并修改文档中的所有节点，对于 Scrapy 抓取的所有文档来说，默认执行此操作是一个相当昂贵的操作。
在某些情况下，可能实际上需要使用命名空间，以防某些元素名称在命名空间之间发生冲突。不过，这些情况非常罕见。

使用 EXSLT 扩展¶

Scrapy 选择器构建在 lxml 之上，支持一些 EXSLT 扩展，并在 XPath 表达式中预先注册了以下命名空间

前缀	命名空间	用法
re	http://exslt.org/regular-expressions	正则表达式
set	http://exslt.org/sets	集合操作

正则表达式¶

例如，test() 函数在 XPath 的 starts-with() 或 contains() 不够用时，会非常有用。

例如，选择列表项中“class”属性以数字结尾的链接

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath("//li//@href").getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']

警告

C 库 libxslt 本身不支持 EXSLT 正则表达式，因此 lxml 的实现使用钩子连接到 Python 的 re 模块。因此，在 XPath 表达式中使用正则表达式函数可能会带来少量性能损失。

集合操作¶

例如，在提取文本元素之前，这些操作可以方便地排除文档树的一部分。

例如，提取微数据（示例内容取自 https://schema.org/Product），其中包含一组 itemscope 和对应的 itemprop。

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...   Customer reviews:
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath("//div[@itemscope]"):
...     print("current scope:", scope.xpath("@itemtype").getall())
...     props = scope.xpath(
...         """
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)"""
...     )
...     print(f"    properties: {props.getall()}")
...     print("")
...

current scope: ['http://schema.org/Product']
    properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']

current scope: ['http://schema.org/AggregateRating']
    properties: ['ratingValue', 'reviewCount']

current scope: ['http://schema.org/Offer']
    properties: ['price', 'availability']

current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

这里，我们首先遍历 itemscope 元素，对于每个元素，我们查找所有 itemprops 元素，并排除那些本身位于另一个 itemscope 内部的元素。

其他 XPath 扩展¶

Scrapy 选择器还提供了一个非常需要的 XPath 扩展函数 has-class，它对于具有所有指定 HTML 类别的节点返回 True。

对于以下 HTML

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(
...     url="http://example.com",
...     body="""
... <html>
...     <body>
...         <p class="foo bar-baz">First</p>
...         <p class="foo">Second</p>
...         <p class="bar">Third</p>
...         <p>Fourth</p>
...     </body>
... </html>
... """,
...     encoding="utf-8",
... )

你可以像这样使用它

>>> response.xpath('//p[has-class("foo")]')
[<Selector query='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector query='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> response.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector query='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
>>> response.xpath('//p[has-class("foo", "bar")]')
[]

因此，XPath //p[has-class("foo", "bar-baz")] 基本上等价于 CSS p.foo.bar-baz。请注意，在大多数情况下它速度较慢，因为它是一个纯 Python 函数，会为每个目标节点调用，而 CSS 查找会被转换为 XPath，因此运行效率更高，因此，在性能方面，其用途仅限于难以用 CSS 选择器描述的情况。

Parsel 还使用 set_xpathfunc() 简化了添加自定义 XPath 扩展的过程。

内置选择器参考¶

选择器对象¶

class scrapy.selector.Selector(*args: Any, **kwargs: Any)[source]¶

Selector 的实例是对响应的包装，用于选择其内容的某些部分。

response 是一个 HtmlResponse 或 XmlResponse 对象，用于选择和提取数据。

text 是一个 Unicode 字符串或 utf-8 编码的文本，用于 response 不可用时的情况。同时使用 text 和 response 的行为未定义。

type 定义了选择器类型，可以是 "html"、"xml"、"json" 或 None（默认值）。

如果 type 为 None，则选择器会根据 response 类型自动选择最佳类型（见下文），或者在与 text 一起使用时默认为 "html"。

如果 type 为 None 并且传递了 response，则选择器类型会根据响应类型推断如下

"html" 适用于 HtmlResponse 类型
"xml" 适用于 XmlResponse 类型
"json" 适用于 TextResponse 类型
"html" 适用于其他任何情况

否则，如果设置了 type，则选择器类型将被强制，并且不会进行检测。

xpath(query: str, namespaces: Mapping[str, str] | None = None, **kwargs: Any) → SelectorList[_SelectorType][source]¶

查找与 xpath query 匹配的节点，并将结果作为 SelectorList 实例返回，其中所有元素都已扁平化。列表元素也实现了 Selector 接口。

query 是一个包含要应用的 XPATH 查询的字符串。

namespaces 是一个可选的 prefix: namespace-uri 映射（字典），用于除了使用 register_namespace(prefix, uri) 注册的前缀之外的其他前缀。与 register_namespace() 不同，这些前缀不会保存以供将来调用。

任何其他命名参数都可以用于为 XPath 表达式中的 XPath 变量传递值，例如

selector.xpath('//a[href=$url]', url="http://www.example.com")

注意

为方便起见，此方法可以调用为 response.xpath()

css(query: str) → SelectorList[_SelectorType][source]¶

应用给定的 CSS 选择器并返回一个 SelectorList 实例。

query 是一个包含要应用的 CSS 选择器的字符串。

在后台，CSS 查询使用 cssselect 库转换为 XPath 查询，并运行 .xpath() 方法。

注意

为方便起见，此方法可以调用为 response.css()

jmespath(query: str, **kwargs: Any) → SelectorList[_SelectorType][source]¶

查找匹配 JMESPath query 的对象，并将结果作为 SelectorList 实例返回，所有元素都展平。列表元素也实现了 Selector 接口。

query 是一个包含要应用的 JMESPath 查询的字符串。

任何其他命名参数都将传递给底层的 jmespath.search 调用，例如：

selector.jmespath('author.name', options=jmespath.Options(dict_cls=collections.OrderedDict))

注意

为方便起见，此方法可以作为 response.jmespath() 调用。

get() → Any[source]¶

序列化并返回匹配的节点。

对于 HTML 和 XML，结果始终为字符串，并且百分比编码的内容将取消引用。

另请参阅：extract() 和 extract_first()

attrib¶

返回底层元素的属性字典。

另请参阅：选择元素属性。

re(regex: str | Pattern[str], replace_entities: bool = True) → List[str][source]¶

应用给定的正则表达式并返回一个包含匹配项的字符串列表。

regex 可以是已编译的正则表达式，也可以是将使用 re.compile(regex) 编译成正则表达式的字符串。

默认情况下，字符实体引用将替换为其对应的字符（& 和 < 除外）。将 replace_entities 作为 False 传递将关闭这些替换。

re_first(regex: str | Pattern[str], default: None = None, replace_entities: bool = True) → str | None[source]¶

re_first(regex: str | Pattern[str], default: str, replace_entities: bool = True) → str

应用给定的正则表达式并返回第一个匹配的字符串。如果没有匹配项，则返回默认值（如果未提供参数，则为 None）。

默认情况下，字符实体引用将替换为其对应的字符（& 和 < 除外）。将 replace_entities 作为 False 传递将关闭这些替换。

register_namespace(prefix: str, uri: str) → None[source]¶: 注册给定的命名空间以在此 Selector 中使用。如果不注册命名空间，则无法选择或提取非标准命名空间中的数据。请参阅 Selector 在 XML 响应上的示例。

remove_namespaces() → None[source]¶: 删除所有命名空间，允许使用无命名空间的 xpath 遍历文档。请参阅删除命名空间。

__bool__() → bool[source]¶: 如果选择了任何真实内容，则返回 True，否则返回 False。换句话说，Selector 的布尔值由其选择的内容决定。

getall() → List[str][source]¶

序列化并以包含一个字符串元素的列表形式返回匹配的节点。

此方法添加到 Selector 中是为了保持一致性；它在 SelectorList 中更有用。另请参阅：extract() 和 extract_first()

SelectorList 对象¶

class scrapy.selector.SelectorList(iterable=(), /)[source]¶

SelectorList 类是内置 list 类的子类，它提供了一些额外的功能。

xpath(xpath: str, namespaces: Mapping[str, str] | None = None, **kwargs: Any) → SelectorList[_SelectorType][source]¶

对列表中的每个元素调用 .xpath() 方法，并将结果扁平化成另一个 SelectorList。

xpath 与 Selector.xpath() 中的参数相同。

namespaces 是一个可选的 prefix: namespace-uri 映射（字典），用于除了使用 register_namespace(prefix, uri) 注册的前缀之外的其他前缀。与 register_namespace() 不同，这些前缀不会保存以供将来调用。

任何其他命名参数都可以用于为 XPath 表达式中的 XPath 变量传递值，例如

selector.xpath('//a[href=$url]', url="http://www.example.com")

css(query: str) → SelectorList[_SelectorType][source]¶

对列表中的每个元素调用 .css() 方法，并将结果扁平化成另一个 SelectorList。

query 与 Selector.css() 中的参数相同。

jmespath(query: str, **kwargs: Any) → SelectorList[_SelectorType][source]¶

对列表中的每个元素调用 .jmespath() 方法，并将结果扁平化成另一个 SelectorList。

query 与 Selector.jmespath() 中的参数相同。

任何其他命名参数都将传递给底层的 jmespath.search 调用，例如：

selector.jmespath('author.name', options=jmespath.Options(dict_cls=collections.OrderedDict))

getall() → List[str][source]¶

对列表中的每个元素调用 .get() 方法，并将结果扁平化成一个字符串列表。

另请参阅：extract() 和 extract_first()

get(default: None = None) → str | None[source]¶

get(default: str) → str

返回列表中第一个元素的 .get() 结果。如果列表为空，则返回默认值。

另请参阅：extract() 和 extract_first()

re(regex: str | Pattern[str], replace_entities: bool = True) → List[str][source]¶

对列表中的每个元素调用 .re() 方法，并将结果扁平化成一个字符串列表。

默认情况下，字符实体引用将替换为其对应的字符（& 和 < 除外）。将 replace_entities 设为 False 将关闭这些替换。

re_first(regex: str | Pattern[str], default: None = None, replace_entities: bool = True) → str | None[source]¶

re_first(regex: str | Pattern[str], default: str, replace_entities: bool = True) → str

对列表中的第一个元素调用.re()方法，并将结果以字符串形式返回。如果列表为空或正则表达式没有匹配到任何内容，则返回默认值（如果未提供参数，则为None）。

默认情况下，字符实体引用将替换为其对应的字符（& 和 < 除外）。将 replace_entities 设为 False 将关闭这些替换。

attrib¶

返回第一个元素的属性字典。如果列表为空，则返回一个空字典。

另请参阅：选择元素属性。

示例¶

HTML 响应中的选择器示例¶

以下是一些Selector示例，用于说明几个概念。在所有情况下，我们都假设已经使用HtmlResponse对象实例化了一个Selector，如下所示

sel = Selector(html_response)

从 HTML 响应体中选择所有<h1>元素，返回一个Selector对象列表（即一个SelectorList对象）
```
sel.xpath("//h1")
```

提取 HTML 响应体中所有<h1>元素的文本，返回一个字符串列表

sel.xpath("//h1").getall()  # this includes the h1 tag
sel.xpath("//h1/text()").getall()  # this excludes the h1 tag

遍历所有<p>标签并打印它们的 class 属性

for node in sel.xpath("//p"):
    print(node.attrib["class"])

XML 响应中的选择器示例¶

以下是一些示例，用于说明使用XmlResponse对象实例化的Selector对象的几个概念

sel = Selector(xml_response)

从 XML 响应体中选择所有<product>元素，返回一个Selector对象列表（即一个SelectorList对象）
```
sel.xpath("//product")
```

从Google Base XML Feed提取所有价格，该Feed需要注册命名空间

sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()