The lxml toolkit is a library for working with XML
and HTML from Python. It includes a utility called
HTML for dangerous
and unwanted elements and attributes, although since early 2022 it has
been marked as not suitable for security
it is still used that way by many projects.
A coworker and I were recently exploring its capabilities. At one point he made a simple mistake that is extremely common in Python, and stumbled onto what I recognized as a vulnerability. Exploring the code more, I found another vulnerability, this one not dependent upon misconfiguration.
(As of this posting, the issues are not fixed and released, but patches are available.)
What Cleaner does
To provide background, let's first take a look at a typical use case. In this code block, we instruct Cleaner not to strip frames but to only allow embeds with hosts on a given allowlist, in this case a couple of popular video sites:
from lxml.html.clean import Cleaner cl = Cleaner(frames=False, host_whitelist=['www.youtube.com', 'vimeo.com']) print(cl.clean_html(f'Embedded video: <iframe src="https://youtube.com/video"></iframe>')) # ↪ <p>Embedded video: <iframe src="https://youtube.com/video"></iframe></p>
In that input, the iframe's src was acceptable, so it was left alone.
However, if I set the iframe src to
https://evil.com/, the output is
<p>Embedded video: </p> -- the frame element is gone entirely.
For simplicity, in this article I'll be giving
and iframe src URLs and indicating whether the URL is accepted or
rejected. And if you're following along with the source code, the
method of interest is
allow_embedded_url in lxml version
Type confusion + slash confusion
When my coworker was playing around with Cleaner on the Python REPL, I
mentioned the risk of parser
and he started feeding in a couple of examples from Claroty and
Snyk's joint report on URL
parsers. In the
course of playing with this, he ended up with the combination of
host_whitelist=('youtube.com') and the malformed URL
https:///youtube.com/video (note the extra slash before the
hostname). This URL was allowed. But it's even worse: This
configuration also allows the URL
https:///evil.com/. (And yes,
browsers will accept those.) What's going on here?
Why this happens
The first problem here partly lies with Python's tuple syntax, which has a rather unfortunate requirement for how to write single-item tuples:
('foo', 'bar')evaluates to a 2-tuple
('foo')evaluates to a string
('foo',)evaluates to a 1-tuple
If you start with
host_whitelist=('youtube.com', 'vimeo.com') and
strip it down to
host_whitelist=('youtube.com'), you've effectively
host_whitelist='youtube.com'. The allowlist is now a string,
not a collection. This mistake is extremely common in Python code.
A second problem is that lxml does not have type checking or coercion
host_whitelist. Some other configuration properties do accept
either a string or collection of
but not this one.
Third is that Python's urlsplit, rather than rejecting the malformed URL, just tries its best to return something:
>>> urlsplit('https:///evil.com/more/path') SplitResult(scheme='https', netloc='', path='/evil.com/more/path', query='', fragment='')
netloc is its name for the URL authority section (userinfo,
hostname, port) and here is returned as an empty string. RFC 3986
makes it clear that when there is a
//, the authority section is
required, and within that the host is a required, non-empty
this should have thrown an exception or returned
None rather than a
And finally, the most popular browsers use a different URL parser,
conforming to WHATWG's alternative view of how URLs should
work. They'll treat
https://youtube.com. So this triple-slash
URL is treated as a double-slash URL, but with a "validation error" --
that is then ignored.
Putting this all together, when lxml checks
self.host_whitelist in the above example, what it's actually
'' in 'youtube.com', which is true -- the empty string
is indeed a substring of
'youtube.com', or of any string. (The
in operator is overloaded for a variety of types.)
First off, this is of course low severity; it relies on a type of misconfiguration which, while easy to make, is still a misconfiguration. However, in such a situation, an attacker can simply inject an additional slash in any http/https URL they want to use and it will be accepted; the triple-slash URL will then be interpreted by the browser as having a non-empty hostname.
But in fact, Cleaner's own defaults use an empty tuple, so anyone starting from that is at risk. And using a tuple here is actually pretty common. In a few minutes I was able to find a number of projects using an empty tuple for their whitelist (1 2 3 4 5) and another using an 8-tuple. None of these is currently vulnerable, but an incautious edit to use a single host would make them vulnerable.
Additionally, the configuration is not always local to the call to
Cleaner. I found several projects
using a configurable
host_whitelist, which is theoretically fine
except someone might write the setting as
"host1, host2, host3"
["host1", "host2", "host3"] and then become
vulnerable. This method of providing a list of hosts in a
comma-delimited string is common in software, and I was even able to
using it with Cleaner.
host_whitelist is never supposed to be a string in the first
place, the easiest thing to do is to validate it in Cleaner's
constructor to always be a list.
It's tempting to instead wrap it in a list if it's a string, but
this is still dangerous. Remember, the configured value might be
turning it into
[""] would allow URLs with an empty hostname. Best
to just deny invalid configs in the first place.
(In Python, you can also use a linter to check for misspelled
1-tuples. This is not robust against all typos, though;
("youtube.com" "vimeo.com") would also be a single
string. However, linters are still a great idea in general.)
That covers the type confusion. What about the other two problems I identified, relating to the actual parser mismatch?
The two obvious approaches I can see are
and strict parsing. But reserialization doesn't help here because
urlsplit not only incorrectly allows an empty authority on parse,
but also recomposes it with an empty authority:
>>> from urllib.parse import urlsplit, urlunsplit >>> parts = urlsplit('https:///evil.com/path') >>> parts.netloc == '' True >>> urlunsplit(parts) 'https:///evil.com/path'
Browsers are then happy to "fix up"
https:///... URLs in the spirit
of "garbage in, something nice out". Both deviate from the spec, and
in different ways, but in a combination that leads to a vulnerability.
The remaining option is strictness; urlsplit must raise an exception when encountering invalid URLs. However, it is unlikely the Python maintainers will do this; they've been unwilling to reject malformed URLs in the past and urlsplit only raises exceptions in the case of certain malformed IP addresses. So there's one obvious answer left: lxml should switch to using an RFC 3986 compliant URL parser, and drop any iframes and other embdedded objects with invalid URLs.
I made a minimal pull request to fix this vulnerability by ensuring
host_whitelist is never a string:
Incorrect authority parsing
host_whitelist=['youtube.com'], lxml allows the URL
https://youtube.com:@evil.com/, and the browser would load a page
Why this happens
In order to determine the scheme and hostname of the URL,
allow_embedded_url performs the following:
scheme, netloc, path, query, fragment = urlsplit(url) netloc = netloc.lower().split(':', 1)
That is, it asks for the authority component and then manually parses
it. But the authority has up to four pieces and three or four
different places where colons can be found, and is not totally trivial
to parse. Let's take as an example the ugly but valid URL
are the parts:
example.com:some:stuff. The structure within this isn't really part of the URL specification, and userinfo in URLs is deprecated in browser-land, but nevertheless would currently be interpreted by a browser like so:
example.com(split at first colon)
[2620:0:861:ed1a::1]-- An IPv6 address with its required delimiters. Always contains colons. Let's not even get into zone-IDs.
Needless to say, many pieces of code that try to parse an authority component come to different conclusions about the boundaries, since they use handwritten parsers that often don't comply with the spec. In the case of lxml, the first colon is used as a delimiter. Not only would this break for any IPv6 address literal, it also introduces a vulnerability by getting confused by colons in the userinfo component.
This is, as usual when it comes to URL-related vulnerabilities, an example of parser mismatch.
Because of the usefulness in phishing attacks and other naughtiness,
browser vendors have been trying for years to get people to stop using
userinfo in web URLs. They already require people to click through a
warning popup asking them if they really meant to log into
evil.com with username
youtube.com. Chromium has switched to
discarding the userinfo entirely on top-level navigation and since
version 59 blocks iframe URLs that contain it. Firefox 91 warns at the
top level, but still permits it in the iframe. (I'm guessing they'll
make this change as well at some point.)
However, the opposite seems to hold for ajax requests: Firefox blocks a userinfo-bearing XHR, but Chromium allows it (though it does at least strip out the userinfo).
So in either of these popular browsers there's the possibility (for now) of tricking the page into loading a resource from the wrong origin.
Fixing this might be as simple as switching to asking for the
hostname property of the
urlsplit returns. It's
appears to have been available as far back as Python
2.7. And that's what
my pull request for fixing
this one uses.
- 2022-08-24: Discovered tuple pitfall
- 2022-08-25: Discovered userinfo/triple-slash vulnerability
- 2022-08-25: Asked about disclosure process on lxml mailing list
- 2022-08-26: Disclosed vulnerabilities to Stefan Behnel by email, offering to make public PRs
- 2022-08-28: Received confirmation that I can just make public PRs
- 2022-08-29: Posted PRs for review
- 2022-10-16: Published blog post