解码 Python 字符串中的 HTML 实体？

问题描述

我用美丽的汤 3 解析一些 HTML，但它包含了美丽的汤 3 不会为我自动解码的 HTML 实体：

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

如何解码 text 中的 HTML 实体来获取"£682m"而不是"£682m"。

最佳解决方案

Python 3.4+

HTMLParser.unescape 已被弃用，而 was supposed to be removed in 3.5 虽然被遗漏了。它将很快从语言中删除。而是使用 html.unescape()：

import html
print(html.unescape('&pound;682m'))

见 https://docs.python.org/3/library/html.html#html.unescape

Python 2.6-3.3

您可以使用标准库中的 HTML 解析器：

>>> try:
...     # Python 2.6-2.7
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

参见 http://docs.python.org/2/library/htmlparser.html

您还可以使用 six 兼容性库来简化导入：

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

次佳解决方案

美丽的汤处理实体转换。在美丽的汤 3 中，您需要为 BeautifulSoup 构造函数指定 convertEntities 参数 (请参阅归档文档的‘Entity Conversion’ 部分) 。美丽的汤 4，实体自动解码。

美丽的汤 3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>",
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

美丽的汤 4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

第三种解决方案

您可以使用 w3lib.html 库中的 replace_entities

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

参考文献

Decode HTML entities in Python string?

注：本文内容整合自 Google/Baidu/Bing 辅助翻译的英文资料结果。如果您对结果不满意，可以加入我们改善翻译效果：薇晓朵技术论坛。

解码 Python 字符串中的 HTML 实体？

解码 Python 字符串中的 HTML 实体？

问题描述

最佳解决方案

Python 3.4+

Python 2.6-3.3

次佳解决方案

美丽的汤 3

美丽的汤 4

第三种解决方案

参考文献

订单服务

媒体中心

服务支持

使用条款

关于公司