链接提取器

链接提取器是一个从响应中提取链接的对象。

LxmlLinkExtractor 的 __init__ 方法接受用于确定可提取链接的设置。LxmlLinkExtractor.extract_links 从一个 Response 对象返回匹配的 Link 对象的列表。

链接提取器在 CrawlSpider 爬虫中通过一组 Rule 对象使用。

您也可以在常规爬虫中使用链接提取器。例如，您可以在爬虫中将 LinkExtractor 实例化为一个类变量，并在爬虫回调函数中使用它

def parse(self, response):
    for link in self.link_extractor.extract_links(response):
        yield Request(link.url, callback=self.parse)

链接提取器参考

链接提取器类是 scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor。为方便起见，它也可以导入为 scrapy.linkextractors.LinkExtractor

from scrapy.linkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href',), canonicalize=False, unique=True, process_value=None, strip=True)[source]

LxmlLinkExtractor 是推荐的链接提取器，具有方便的过滤选项。它使用 lxml 强大的 HTMLParser 实现。

参数：

allow (str 或 list) – 一个正则表达式（或正则表达式列表），绝对 URL 必须匹配该表达式才能被提取。如果未给出（或为空），它将匹配所有链接。
deny (str 或 list) – 一个正则表达式（或正则表达式列表），绝对 URL 必须匹配该表达式才能被排除（即不提取）。它优先于 allow 参数。如果未给出（或为空），它将不排除任何链接。
allow_domains (str 或 list) – 单个值或包含将被考虑提取链接的域名的字符串列表
deny_domains (str 或 list) – 单个值或包含将不被考虑提取链接的域名的字符串列表
deny_extensions (list) –
单个值或字符串列表，包含在提取链接时应忽略的扩展名。如果未给出，它将默认为 scrapy.linkextractors.IGNORED_EXTENSIONS。

在 2.0 版本中更改: IGNORED_EXTENSIONS 现在包含 7z, 7zip, apk, bz2, cdr, dmg, ico, iso, tar, tar.gz, webm, 和 xz。
restrict_xpaths (str 或 list) – 是一个 XPath（或 XPath 列表），定义了响应中应从中提取链接的区域。如果给出，将仅扫描由这些 XPath 选择的文本以查找链接。
restrict_css (str 或 list) – 一个 CSS 选择器（或选择器列表），定义了响应中应从中提取链接的区域。其行为与 restrict_xpaths 相同。
restrict_text (str 或 list) – 一个正则表达式（或正则表达式列表），链接文本必须匹配该表达式才能被提取。如果未给出（或为空），它将匹配所有链接。如果给出了正则表达式列表，则链接将至少匹配其中一个时被提取。
tags (str 或 list) – 提取链接时要考虑的标签或标签列表。默认为 ('a', 'area')。
attrs (列表) — 在查找要提取的链接时，一个或多个应被考虑的属性 (仅适用于 tags 参数中指定的那些标签)。默认为 ('href',)
canonicalize (bool) – 对每个提取的 URL 进行规范化处理（使用 `w3lib.url.canonicalize_url`）。默认为 False。请注意，`canonicalize_url` 用于重复检查；它可能会改变服务器端可见的 URL，因此对于经过规范化和原始 URL 的请求，响应可能会不同。如果您使用 LinkExtractor 来跟踪链接，保持默认的 canonicalize=False 会更健壮。
unique (bool) – 是否应对提取的链接应用重复过滤。
process_value (collections.abc.Callable) –
一个函数，它接收从扫描的标签和属性中提取的每个值，可以修改该值并返回一个新值，或者返回 None 以完全忽略该链接。如果未给出，process_value 默认为 lambda x: x。

例如，要从以下代码中提取链接
```
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
```
您可以在 process_value 中使用以下函数
```
def process_value(value):
 m = re.search(r"javascript:goToPage\('(.*?)'", value)
 if m:
 return m.group(1)
```
strip (bool) – 是否从提取的属性中去除空白字符。根据 HTML5 标准，必须去除、和许多其他元素的 href 属性、、</code> 元素的 <code class="docutils literal notranslate">src</code> 属性等的开头和结尾空白字符，因此 LinkExtractor 默认去除空白字符。设置 <code class="docutils literal notranslate">strip=False</code> 以关闭此功能（例如，如果您从允许开头/结尾空白字符的元素或属性中提取 URL）。</li> </ul> </dd> </dl> <dl class="py method"> <dt class="sig sig-object py" id="scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links"> extract_links(response: <a class="reference internal" href="request-response.html#scrapy.http.TextResponse" title="scrapy.http.TextResponse">TextResponse</a>) → <a class="reference external" href="https://docs.pythonlang.cn/3/library/stdtypes.html#list" title="(in Python v3.13)">list</a>[<a class="reference internal" href="#scrapy.link.Link" title="scrapy.link.Link">Link</a>]<a class="reference internal" href="../_modules/scrapy/linkextractors/lxmlhtml.html#LxmlLinkExtractor.extract_links">[source]</a><a class="headerlink" href="#scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor.extract_links" title="链接到此定义"></a></dt> <dd>从指定的 <a class="reference internal" href="request-response.html#scrapy.http.Response" title="scrapy.http.Response"><code class="xref py py-class docutils literal notranslate">response</code></a> 返回 <a class="reference internal" href="#scrapy.link.Link" title="scrapy.link.Link"><code class="xref py py-class docutils literal notranslate">Link</code></a> 对象的列表。 仅返回与传递给链接提取器 <code class="docutils literal notranslate">__init__</code> 方法的设置匹配的链接。 如果 <code class="docutils literal notranslate">unique</code> 属性设置为 <code class="docutils literal notranslate">True</code>，则省略重复链接，否则返回重复链接。 </dd></dl> </dd></dl> </section> <section id="module-scrapy.link"> <h3>Link<a class="headerlink" href="#module-scrapy.link" title="链接到此标题"></a></h3> <dl class="py class"> <dt class="sig sig-object py" id="scrapy.link.Link"> class scrapy.link.Link(url: <a class="reference external" href="https://docs.pythonlang.cn/3/library/stdtypes.html#str" title="(in Python v3.13)">str</a>, text: <a class="reference external" href="https://docs.pythonlang.cn/3/library/stdtypes.html#str" title="(in Python v3.13)">str</a> = '', fragment: <a class="reference external" href="https://docs.pythonlang.cn/3/library/stdtypes.html#str" title="(in Python v3.13)">str</a> = '', nofollow: <a class="reference external" href="https://docs.pythonlang.cn/3/library/functions.html#bool" title="(in Python v3.13)">bool</a> = False)<a class="reference internal" href="../_modules/scrapy/link.html#Link">[source]</a><a class="headerlink" href="#scrapy.link.Link" title="链接到此定义"></a></dt> <dd>Link 对象表示由 LinkExtractor 提取的链接。 使用下面的锚标签示例来说明参数 <div class="highlight-python notranslate"><div class="highlight"><pre><a href="https://example.com/nofollow.html#foo" rel="nofollow">Dont follow this one</a> </pre></div> </div> <dl class="field-list simple"> <dt class="field-odd">参数：</dt> <dd class="field-odd"><ul class="simple"> <li>url – 锚标签中链接到的绝对 URL。根据示例，这是 <code class="docutils literal notranslate">https://example.com/nofollow.html</code>。</li> <li>text – 锚标签中的文本。根据示例，这是 <code class="docutils literal notranslate">Dont follow this one</code>。</li> <li>fragment – URL 中井号后面的部分。根据示例，这是 <code class="docutils literal notranslate">foo</code>。</li> <li>nofollow – 指示锚标签的 <code class="docutils literal notranslate">rel</code> 属性中是否存在 nofollow 值。</li> </ul> </dd> </dl> </dd></dl> </section> </section> </section> </div> </div> <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer"> <a href="request-response.html" class="btn btn-neutral float-left" title="请求与响应" accesskey="p" rel="prev"> 上一页</a> <a href="settings.html" class="btn btn-neutral float-right" title="设置" accesskey="n" rel="next">下一页 </a></div> <hr /> <div role="contentinfo"> © 版权所有 Scrapy 开发者。 最后更新于 2025年5月8日。 </div>使用 <a href="https://sphinx-doc.cn/">Sphinx</a> 构建，主题由 <a href="https://readthedocs.org">Read the Docs</a> 提供。</footer> </div> </div> </section> </div> <script> jQuery(function () { SphinxRtdTheme.Navigation.enable(true); }); </script> </body></html>