What is a parser mismatch vulnerability?
There's a class of security vulnerabilities that has gotten very little attention until recently but shows up everywhere. In the past I called these dueling parser vulnerabilities, but recently there has been more recognition of this vulnerability class, and the terms parser confusion and parser mismatch have come into use. In this post I'll be using "parser mismatch" because it is the clearest and most descriptive.
Broadly defined
A parser mismatch occurs when you have:
- Two code locations
- ...each of which tries to parse the same thing
- ...but where the parsers disagree on what some inputs mean.
In general, you'll see two kinds of behavior:
- For "normal" inputs they'll almost always agree
- For malformed inputs, they'll often disagree, creating the possibility of a vulnerability
Kinda abstract. Let's get more concrete.
Example: Almost nobody agrees on URLs
I'll start with an apparently low-stakes scenario that turns out to be quite serious.
Imagine you have a support forum for a website and you want to allow people to post comments. Before the site accepts a comment, it checks any links it contains and rejects it if any of the URLs aren't on an allowlist (perhaps out of concern of the risk of spam or off-topic discussions.) The comment is then later displayed in a browser.
Someone tries to post a link to http://example.net\@github.com
. To
our eyes this is clearly a malformed URL; \
isn't allowed in URLs according to RFC 3986.
Ignoring that, you would expect the parser to break off the scheme at
the :
, recognize //
as starting a "hosted authority" and the next
/
as terminating it, and then split the authority at the @
to find
the userinfo and host. So, you'd expect a host of github.com
and a
userinfo section of example.net\
.
But what actually happens?
Maybe you're using Java's oldest URL parser on your backend,
java.net.URL
, and its getHost
method. It returns github.com
as
you might expect. GitHub happens to be on your allowlist, so you let
it through. But watch what happens in your browser:
http://example.net\@github.com.
If you use Firefox, Chrome, or certain other browsers, you'll end up on
example.net
instead of github.com
.
The result is that an "attacker" can post a link that looks OK to the
server (domain == "github.com"
) but that will go to a disallowed
domain (example.net
) in some major browsers.
Why?
Well, these browsers "helpfully" fix the URL to change backslashes
into regular forward slashes, I suppose because people sometimes
type in URLs and get their forward and back slashes confused.
With the resulting value of http://example.net/@github.com
,
the backslash becomes a path separator and the host is example.net
.
(You might wonder how such a blatant difference can exist. You can read
up on the background of WHATWG and its alternative browser-centric
URL spec,
but for the purposes of understanding this vulnerability it doesn't matter; essentially all
URL libraries have some kind of spec non-compliance. java.net.URL
has some truly bizarre behaviors of its own. The newer java.net.URI
is better and will just throw an exception if given this backslash
URL, but still has some issues. And the same kinds of issues are
present in all standard libraries I'm aware of, not just in Java. Go
read Claroty's report on URL parser
mismatches if you
want to see more. The state of the industry here is an absolute train
wreck.)
So: We have two code locations (one in the server, one in the browser) that use two different parsers, and which will understand some inputs differently. This is a classic parser mismatch.
Now, I mentioned that this turned out to be quite serious in practice. In fact, this exact parser mismatch made David Schütz over $12,000 in bug bounties with Google [1, 2] when one end of the mismatch was in a gatekeeper that had the power to grant access to all sorts of Google internal systems.
Note, also, that Google's solution was to try to make their server try to behave more like a browser, even though browsers vary in their behavior and could change at any time. Their solution was also incomplete and David managed to bypass it repeatedly, which added up to a rather large bounty in total. (In fact, after his second post, someone else bypassed the latest fix.) In an upcoming post, I'll give some safer, more robust solutions.
Example: Sloppy header splitting
In a previous post I described a vulnerability in Imzy's image
proxy,
an endpoint which was supposed to allow users to embed external images
in their posts without leaking their IP address to the hosting
server. The proxy would fetch the indicated URL and relay the body and
headers back to the browser. There were a number of vulnerabilities
here, but at one point an Imzy developer tried to secure the endpoint
by failing the response if the Content-Type
response header did not
match image/.*
.
My approach to evasion was to have my proof-of-concept exploit page
send a response header of Content-Type: image/foo, text/html
.
Now, in HTTP, there are a number of headers that are considered
multivalued, meaning they can be present multiple times in
the response and can be squashed together into a single header field
by comma-separating their values. Content-Type
is not supposed to
be multi-valued, and is only supposed to contain a single
media-type;
splitting on commas is not only a violation of the spec but would
mangle any quoted-string parameter that happened to contain a
comma. (And an unquoted comma would never be present in a well-formed Content-Type
.)
- The first code location, Imzy's proxy server, understood the header
value of
image/foo, text/html
to be a single value and checked that it started withimage/
. Since it did, the server allowed the response through. - The second code location, my browser, apparently treated the comma
as a value separator for a multivalued header, and it took the last
"value". This would be as if the server had first sent
Content-Type: image/foo
and thenContent-Type: text/html
. My proxied response was therefore treated asContent-Type: text/html
, as the later value took priority, and the exploit ran.
Again, we have an input that failed to meet the spec, and two different parsers that handled it differently. The first piece OK'd the input, the second acted on it, but with a different understanding.
(It's also possible that Imzy's server instead split on commas and
took the first value, image/foo
. Some server software mishandles
multivalued headers in this way. Same effect, though.)
Also note that just as with the URL examples, the first code location was acting as a guard to prevent abuse of the second code location. By sending in a malformed input, the attacker can slip an invalid or malicious input past the guard. This guard/actor pattern is very common with parser mismatch.
Example: HTTP Request Smuggling
HTTP requests can either be sent with a Content-Length
header
indicating up-front the exact number of bytes to read after the
headers, or can specify Transfer-Encoding
in order to indicate that
there will be a stream of pieces, each preceeded by a
length. Specifying both in a single request is invalid, and it can
cause a proxy and its origin server to disagree on the boundaries
within a stream of HTTP requests that are all sent over the same
kept-alive connection. Because of this disagreement, a chunk of one
request can end up becoming part of another unrelated one, allowing
hijacking of credentials if that other request was
authenticated. Caches can be hijacked. Malicious requests can bypass
WAFs. In short,
arbitrarily bad things can happen in handling those requests.
Giving a worked example or a thorough explanation is beyond the scope of this blog post, so I'll refer you to Portswigger's article on the subject as well as the original 2005 report from Watchfire.
What I find interesting here is that the cache poisoning and auth hijacking fall outside of what I think of as the usual "guard/actor pair" pattern of parser mismatch exploits. But the cause is still the same: A disagreement on how to parse an unusual input leads to a vulnerability.
This also highlights that HTTP is something that is parsed, despite being what people would call a "protocol" rather than a "format". Don't be distracted by these categorizations. Handling HTTP requires parsing, both of the header format in general and the values of specific headers. And even once the headers are all processed into data structures (e.g. a multi-valued map), anything that consumes those parsed headers still needs to understand them in relation to each other.
Example: JSON is not a subset of Javascript
Originally, JSON was conceived of as a subset of Javascript that did
not allow code execution, just data construction: Strings, numbers,
arrays, maps, etc. It's in the name "JavaScript Object
Notation". Before JSON.parse
was a part of the Javascript language,
this meant that developers would sometimes "parse" JSON by checking if
it was well-formed and then passing it to eval
.
Except... it wasn't entirely a subset of Javascript. There were some odd quirks around Unicode characters U+2028 'LINE SEPARATOR' and U+2029 'PARAGRAPH SEPARATOR' that I ran up against, resulting in exceptions. (Those are allowed in JSON, but not JS, or at least that was the case in 2010.) But I only recently learned about a major security issue stemming from this mismatch, fixed quietly in 2008.
That post is well worth a read, but it comes down to a situation where
a certain whitespace character is in the middle of what would
otherwise be an escape sequence like \"
. In by-the-book JSON parsing
the character is preserved; in Javscript, the character is stripped,
and the quote now actually terminates a string. Code execution is then
trivial.
Where is the guard/actor pair, here? You have to squint a bit, as it's actually "inside the parser"; json2.js presented itself as a parser, but it delegated most of the work to a different parser entirely: It ran a regular expression first to check if the JSON looked valid (the guard), and then eval'd the JSON as JS (the actor). The mismatch, here, was between the regex's assumptions about JS and what JS actually does.
Conclusions
Drawing on these examples, what can we say about the general properties of parser mismatch?
- The vulnerability requires two parsers that are intended to follow the same spec, or at least two very similar specs, but differ in practice. The "spec" may or may not be well-defined, in practice, but if the two parsers are identical then it does not matter how badly they deviate from the spec, or how they fill in the gaps. But even with generally well-written parsers, even a single deviation from spec in one that is not shared by the other may be enough to produce a vulnerability.
- The definition of "parser" here is looser than one might normally use. If two systems understand two data structures differently, it's still a parser mismatch in my book, just as much as if the inputs were strings.
- Exploitation is more likely with an input that violates the spec, but is accepted by both parsers anyway. Or, it may be an input that the spec does not sufficiently define the behavior for—some kind of edge case. (None of the examples provided here involve spec-conformant inputs, and though an exploit using one might be possible in some situation, it doesn't seem to be the common case.) In a sense, these are ambiguous inputs—optical illusions for software.
- The two code locations are often a guard/actor pair, where the first location controls whether the second location is executed, or more generally is "upstream" in the control flow.
Some implications:
- Following the spec, while quite important, is no guarantee.
- A parser mismatch vulnerability may sometimes be more properly described as a vulnerability in an integration rather than any particular well-defined package of code. The two locations may be in the same software package, or may be in different programming languages and maintained by different people.
- Consequently, looking at any one piece of code will not definitively reveal a parser mismatch vulnerability, and can at best suggest the possibility of one.
- Exploiting a parser mismatch is a bit like programming a weird machine, as the attacker needs to reverse engineer the parsers to find out what they "implement" in practice, rather than what they were intended to implement.
- Following the "be liberal in what you accept" part of Postel's Law dramatically increases the likelihood of a parser mismatch vulnerability, since it encourages parsers to widen what they accept in an uncontrolled and uncoordinated fashion.
- Any standard that introduces a grammar should be combed for underspecified edge cases. Grammars should be made as simple and composable as possible in order to reduce the chances of disagreement between implementations. I've seen a few RFCs with a "Security Considerations" section that simply says "no security implications" after having introduced a grammar, and I feel like at this point we should regard those as being in conflict.
Relationship to other vulnerability classes
A quick note on two other vulnerability classes that I feel have some kinship to parser mismatches, despite being very different in other ways:
- Time-of-check to time-of-use (TOC/TOU) is another "two locations" class of vulnerability where the relationship of a guard and an actor is poorly controlled. But instead of being exploitable by an ambiguous input, the issue is that something in an external context is mutable and actually changes between the times the two code locations run. I don't have a lot of familiarity with these, but I believe the code locations in TOC/TOU are usually quite close together, which is another important distinction.
- Injection attacks of various sorts (SQLi, XSS, etc.) are only related insofar as they center on parsing; they're possible when the shape of the parse tree can be controlled or altered by the attacker. Sometimes there is synergy, though, such as when parser mismatch allows bypass of the XSS auditor or some other attempted mitigation.
You might also see some subclasses of parser mismatch with their own name, just like injection vulnerabilities are broken out into XSS, SQLi, HTTP header splitting, and so on. The only named subclass of parser mismatch I've seen so far is "URL parser confusion", but we'll likely see others soon. I predict that sooner or later we'll also have a name for bad handling of multi-valued HTTP headers, too, but I haven't seen one yet.
Next up: What to do about it
I hope by now you have a solid understanding of what a parser mismatch is and will be more prepared to recognize one while reviewing code or trying to decide how to patch a vulnerability.
But recognition is only half the battle—how will you prevent them, or even fix ones that you discover? These mismatches are slippery and involve multiple systems, and digging into parser internals and spec edge cases can be brutal work.
That topic deserves a post of its own, so continue on to part 2: Preventing (and fixing) parser mismatch vulnerabilities.
Updates
- 2022-04-16: Added optical illusion comparison, and discussion of protocols vs formats.
- 2022-04-29: Link to followup post.
- 2023-07-29: Add example about
JSON.parse
.
No comments yet.
Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).