What is a parser mismatch vulnerability?

There's a class of security vulnerabilities that has gotten very little attention until recently but shows up everywhere. In the past I called these dueling parser vulnerabilities, but recently there has been more recognition of this vulnerability class, and the terms parser confusion and parser mismatch have come into use. In this post I'll be using "parser mismatch" because it is the clearest and most descriptive.

Optical illusion, a drawing that can be seen as either a rabbit or a duck
Parser mismatches: Optical illusions for software

Broadly defined

A parser mismatch occurs when you have:

  • Two code locations
  • ...each of which tries to parse the same thing
  • ...but where the parsers disagree on what some inputs mean.

In general, you'll see two kinds of behavior:

  • For "normal" inputs they'll almost always agree
  • For malformed inputs, they'll often disagree, creating the possibility of a vulnerability

Kinda abstract. Let's get more concrete.

Example: Almost nobody agrees on URLs

I'll start with an apparently low-stakes scenario that turns out to be quite serious.

Imagine you have a support forum for a website and you want to allow people to post comments. Before the site accepts a comment, it checks any links it contains and rejects it if any of the URLs aren't on an allowlist (perhaps out of concern of the risk of spam or off-topic discussions.) The comment is then later displayed in a browser.

Someone tries to post a link to http://example.net\@github.com. To our eyes this is clearly a malformed URL; \ isn't allowed in URLs according to RFC 3986. Ignoring that, you would expect the parser to break off the scheme at the :, recognize // as starting a "hosted authority" and the next / as terminating it, and then split the authority at the @ to find the userinfo and host. So, you'd expect a host of github.com and a userinfo section of example.net\.

But what actually happens?

Maybe you're using Java's oldest URL parser on your backend, java.net.URL, and its getHost method. It returns github.com as you might expect. GitHub happens to be on your allowlist, so you let it through. But watch what happens in your browser: http://example.net\@github.com. If you use Firefox, Chrome, or certain other browsers, you'll end up on example.net instead of github.com. The result is that an "attacker" can post a link that looks OK to the server (domain == "github.com") but that will go to a disallowed domain (example.net) in some major browsers.


Well, these browsers "helpfully" fix the URL to change backslashes into regular forward slashes, I suppose because people sometimes type in URLs and get their forward and back slashes confused. With the resulting value of http://example.net/@github.com, the backslash becomes a path separator and the host is example.net.

(You might wonder how such a blatant difference can exist. You can read up on the background of WHATWG and its alternative browser-centric URL spec, but for the purposes of understanding this vulnerability it doesn't matter; essentially all URL libraries have some kind of spec non-compliance. java.net.URL has some truly bizarre behaviors of its own. The newer java.net.URI is better and will just throw an exception if given this backslash URL, but still has some issues. And the same kinds of issues are present in all standard libraries I'm aware of, not just in Java. Go read Claroty's report on URL parser mismatches if you want to see more. The state of the industry here is an absolute train wreck.)

So: We have two code locations (one in the server, one in the browser) that use two different parsers, and which will understand some inputs differently. This is a classic parser mismatch.

Now, I mentioned that this turned out to be quite serious in practice. In fact, this exact parser mismatch made David Schütz over $12,000 in bug bounties with Google [1, 2] when one end of the mismatch was in a gatekeeper that had the power to grant access to all sorts of Google internal systems.

Note, also, that Google's solution was to try to make their server try to behave more like a browser, even though browsers vary in their behavior and could change at any time. Their solution was also incomplete and David managed to bypass it repeatedly, which added up to a rather large bounty in total. (In fact, after his second post, someone else bypassed the latest fix.) In an upcoming post, I'll give some safer, more robust solutions.

Example: Sloppy header splitting

In a previous post I described a vulnerability in Imzy's image proxy, an endpoint which was supposed to allow users to embed external images in their posts without leaking their IP address to the hosting server. The proxy would fetch the indicated URL and relay the body and headers back to the browser. There were a number of vulnerabilities here, but at one point an Imzy developer tried to secure the endpoint by failing the response if the Content-Type response header did not match image/.*.

My approach to evasion was to have my proof-of-concept exploit page send a response header of Content-Type: image/foo, text/html.

Now, in HTTP, there are a number of headers that are considered multivalued, meaning they can be present multiple times in the response and can be squashed together into a single header field by comma-separating their values. Content-Type is not supposed to be multi-valued, and is only supposed to contain a single media-type; splitting on commas is not only a violation of the spec but would mangle any quoted-string parameter that happened to contain a comma. (And an unquoted comma would never be present in a well-formed Content-Type.)

  • The first code location, Imzy's proxy server, understood the header value of image/foo, text/html to be a single value and checked that it started with image/. Since it did, the server allowed the response through.
  • The second code location, my browser, apparently treated the comma as a value separator for a multivalued header, and it took the last "value". This would be as if the server had first sent Content-Type: image/foo and then Content-Type: text/html. My proxied response was therefore treated as Content-Type: text/html, as the later value took priority, and the exploit ran.

Again, we have an input that failed to meet the spec, and two different parsers that handled it differently. The first piece OK'd the input, the second acted on it, but with a different understanding.

(It's also possible that Imzy's server instead split on commas and took the first value, image/foo. Some server software mishandles multivalued headers in this way. Same effect, though.)

Also note that just as with the URL examples, the first code location was acting as a guard to prevent abuse of the second code location. By sending in a malformed input, the attacker can slip an invalid or malicious input past the guard. This guard/actor pattern is very common with parser mismatch.

Example: HTTP Request Smuggling

HTTP requests can either be sent with a Content-Length header indicating up-front the exact number of bytes to read after the headers, or can specify Transfer-Encoding in order to indicate that there will be a stream of pieces, each preceeded by a length. Specifying both in a single request is invalid, and it can cause a proxy and its origin server to disagree on the boundaries within a stream of HTTP requests that are all sent over the same kept-alive connection. Because of this disagreement, a chunk of one request can end up becoming part of another unrelated one, allowing hijacking of credentials if that other request was authenticated. Caches can be hijacked. Malicious requests can bypass WAFs. In short, arbitrarily bad things can happen in handling those requests.

Giving a worked example or a thorough explanation is beyond the scope of this blog post, so I'll refer you to Portswigger's article on the subject as well as the original 2005 report from Watchfire.

What I find interesting here is that the cache poisoning and auth hijacking fall outside of what I think of as the usual "guard/actor pair" pattern of parser mismatch exploits. But the cause is still the same: A disagreement on how to parse an unusual input leads to a vulnerability.

This also highlights that HTTP is something that is parsed, despite being what people would call a "protocol" rather than a "format". Don't be distracted by these categorizations. Handling HTTP requires parsing, both of the header format in general and the values of specific headers. And even once the headers are all processed into data structures (e.g. a multi-valued map), anything that consumes those parsed headers still needs to understand them in relation to each other.

More recently, SMTP smuggling was described as well.

Example: JSON is not a subset of Javascript

Originally, JSON was conceived of as a subset of Javascript that did not allow code execution, just data construction: Strings, numbers, arrays, maps, etc. It's in the name "JavaScript Object Notation". Before JSON.parse was a part of the Javascript language, this meant that developers would sometimes "parse" JSON by checking if it was well-formed and then passing it to eval.

Except... it wasn't entirely a subset of Javascript. There were some odd quirks around Unicode characters U+2028 'LINE SEPARATOR' and U+2029 'PARAGRAPH SEPARATOR' that I ran up against, resulting in exceptions. (Those are allowed in JSON, but not JS, or at least that was the case in 2010.) But I only recently learned about a major security issue stemming from this mismatch, fixed quietly in 2008.

That post is well worth a read, but it comes down to a situation where a certain whitespace character is in the middle of what would otherwise be an escape sequence like \". In by-the-book JSON parsing the character is preserved; in Javscript, the character is stripped, and the quote now actually terminates a string. Code execution is then trivial.

Where is the guard/actor pair, here? You have to squint a bit, as it's actually "inside the parser"; json2.js presented itself as a parser, but it delegated most of the work to a different parser entirely: It ran a regular expression first to check if the JSON looked valid (the guard), and then eval'd the JSON as JS (the actor). The mismatch, here, was between the regex's assumptions about JS and what JS actually does.


Drawing on these examples, what can we say about the general properties of parser mismatch?

  • The vulnerability requires two parsers that are intended to follow the same spec, or at least two very similar specs, but differ in practice. The "spec" may or may not be well-defined, in practice, but if the two parsers are identical then it does not matter how badly they deviate from the spec, or how they fill in the gaps. But even with generally well-written parsers, even a single deviation from spec in one that is not shared by the other may be enough to produce a vulnerability.
  • The definition of "parser" here is looser than one might normally use. If two systems understand two data structures differently, it's still a parser mismatch in my book, just as much as if the inputs were strings.
  • Exploitation is more likely with an input that violates the spec, but is accepted by both parsers anyway. Or, it may be an input that the spec does not sufficiently define the behavior for—some kind of edge case. (None of the examples provided here involve spec-conformant inputs, and though an exploit using one might be possible in some situation, it doesn't seem to be the common case.) In a sense, these are ambiguous inputs—optical illusions for software.
  • The two code locations are often a guard/actor pair, where the first location controls whether the second location is executed, or more generally is "upstream" in the control flow.

Some implications:

  • Following the spec, while quite important, is no guarantee.
  • A parser mismatch vulnerability may sometimes be more properly described as a vulnerability in an integration rather than any particular well-defined package of code. The two locations may be in the same software package, or may be in different programming languages and maintained by different people.
  • Consequently, looking at any one piece of code will not definitively reveal a parser mismatch vulnerability, and can at best suggest the possibility of one.
  • Exploiting a parser mismatch is a bit like programming a weird machine, as the attacker needs to reverse engineer the parsers to find out what they "implement" in practice, rather than what they were intended to implement.
  • Following the "be liberal in what you accept" part of Postel's Law dramatically increases the likelihood of a parser mismatch vulnerability, since it encourages parsers to widen what they accept in an uncontrolled and uncoordinated fashion.
  • Any standard that introduces a grammar should be combed for underspecified edge cases. Grammars should be made as simple and composable as possible in order to reduce the chances of disagreement between implementations. I've seen a few RFCs with a "Security Considerations" section that simply says "no security implications" after having introduced a grammar, and I feel like at this point we should regard those as being in conflict.

Relationship to other vulnerability classes

A quick note on two other vulnerability classes that I feel have some kinship to parser mismatches, despite being very different in other ways:

  • Time-of-check to time-of-use (TOC/TOU) is another "two locations" class of vulnerability where the relationship of a guard and an actor is poorly controlled. But instead of being exploitable by an ambiguous input, the issue is that something in an external context is mutable and actually changes between the times the two code locations run. I don't have a lot of familiarity with these, but I believe the code locations in TOC/TOU are usually quite close together, which is another important distinction.
  • Injection attacks of various sorts (SQLi, XSS, etc.) are only related insofar as they center on parsing; they're possible when the shape of the parse tree can be controlled or altered by the attacker. Sometimes there is synergy, though, such as when parser mismatch allows bypass of the XSS auditor or some other attempted mitigation.

You might also see some subclasses of parser mismatch with their own name, just like injection vulnerabilities are broken out into XSS, SQLi, HTTP header splitting, and so on. The only named subclass of parser mismatch I've seen so far is "URL parser confusion", but we'll likely see others soon. I predict that sooner or later we'll also have a name for bad handling of multi-valued HTTP headers, too, but I haven't seen one yet.

Next up: What to do about it

I hope by now you have a solid understanding of what a parser mismatch is and will be more prepared to recognize one while reviewing code or trying to decide how to patch a vulnerability.

But recognition is only half the battle—how will you prevent them, or even fix ones that you discover? These mismatches are slippery and involve multiple systems, and digging into parser internals and spec edge cases can be brutal work.

That topic deserves a post of its own, so continue on to part 2: Preventing (and fixing) parser mismatch vulnerabilities.


  • 2022-04-16: Added optical illusion comparison, and discussion of protocols vs formats.
  • 2022-04-29: Linked to followup post.
  • 2023-07-29: Added example about JSON.parse.
  • 2023-12-21: Added link to SMTP smuggling article.

Responses: 3 so far Feed icon

  1. Verisimilitude says:

    Firstly, I believe this and other classes of flaw to be caused by poor design and lack of foresight. The URL, as explained, isn't at all appropriate as a data structure, which means it fails in its one purpose. The proper solution is to have some part of the program understand it, to then pass around some much more specific data structure and, while this is a variant of using only one parser, I feel this obvious consequence almost unworthy of mention. This method should be preferred in most cases. JSON and these other silly, little, poorly-defined ``standards'' fall into that same garbage bucket.

    BitTorrent is a poor example, because its Bencoding is very strictly defined to avoid such problems.

    I found it poor how the efficiency of the reserializing approach wasn't mentioned, as a downside. I also agree with use of the strict approach for handling this. With apologies to Theodore Kaczynski:

    Postel's Law and its consequences have been a disaster for software development.

    A stronger subset of the strict approach is to accept but a very small amount of some such standard. An example with URLs would be avoiding the whole nonsense with trying to parse them and using string matching instead, since usernames and whatnot in them are just a bad idea; then, to handle thwarting this, by using usernames and the like to defeat simple string matching, simply reject all such URLs. Further, this is the only sane way to handle particularly complicated and poorly-designed standards.

    I believe the guard and actor pair to be fine like anything else, when proven correct, and the issue is in how this is rarely done. It requires a total understanding of the system, which will never be available to someone dabbling in WWW nonsense. This pair probably shouldn't be used in any company.

    In general, parsing is stupid, and all software should be defined without it. A program should pass a real data structure to another program, and nowhere near the same issues occur when done properly.

  2. Tim McCormack says:

    Parsing is unfortunately unavoidable -- any time there is a process boundary, data has to be serialized and deserialized. We're basically stuck with it.

    Agreed that URLs are not a great way of passing data around. I suspect it's another case of a format where the needs of developers have grown beyond what was originally imagined for the format. (Even with that, I'm still a little mystified as to why the spec does not define a canonical way of handling repeated query parameters -- a reliable source of vulnerabilities.)

    Good point about reserialization having performance implications. I can add a mention of that to the other piece.

  3. Verisimilitude says:

    I'd disagree on the necessity of parsing but concede it depends on how we define the word, which has an etymology in PARS PARTIS, the Latin word from which we get ``part'', and to parse is to divide an item into parts. We can regardless narrow down on different types of parsing, and when passing what I call numerical formats (commonly called binary formats), we generally needn't allow flexibility in form and the like. As a very simple example, I've written about this numerical format for integers: http://verisimilitudes.net/2023-10-10

    My purpose in defining the format was to abandon all convention and to think about something I could see in use indefinitely for binary automatic computers. It's clear how every bit pattern can have a meaning, which is fundamentally different from how most parsers operate; it eliminates a category of flaw, even though something that can be called parsing is present; most parsing should be like this.

Self-service commenting is not yet reimplemented after the Wordpress migration, sorry! For now, you can respond by email; please indicate whether you're OK with having your response posted publicly (and if so, under what name).