Vulnerability Development mailing list archives
Re: Canonicalization and apache/PHP-attacks
From: "Sverre H. Huseby" <shh () thathost com>
Date: Tue, 27 Sep 2005 02:14:56 +0200
Everything is about data and logic: Data passing through some program code. Data typically passes through several layers of logic, written by different programmers, often working for different companies or organizations. These programmers do not always agree on how the data should be parsed. Or to be more correct, they do not think about how the other programmers may have chosen to parse the data. When data means one thing to one part of the code, and another thing to another part of the code, the problems you are concerned about start showing up. I don't use "canonicalization" as a description of the problem, because canonicalization is supposed to be the (or one possible) solution to the problem of "Different Interpretation of Characters or Byte Sequences". Or something like that. Let's just call the problem DIoCoBS for the rest of this text. (A related problem is "Incompatible Parameter Parsing" [0]. I'm sure some clever person will be able to unify those problems.) There are two main players to the typical DIoCoBS scenario: A: A filter trying to make sure some data will be interpreted in a reasonable way. B: Some piece of logic operating on the filtered data. This logic will fail if the data are not properly filtered. The DIoCoBS problem occurs when A tries to make sure that B will not screw things up, but neglects to realize exactly how B will interpret the data. Simple example: Let's say that A wants to prevent directory traversal (much like your example) by detecting the presence of the path separator character. The programmer of A is a typical Unix guy, so he creates a filter that looks for the character '/' and refuses to hand the data over to B if any such character is found, because B would then certainly have to deal with directories (screw things up). Now, let's say that the code for B is actually running on Windows, on which both '/' and '\' work as path separators. An attacker would clearly be able to bypass the filter (A) by using '\' rather than '/' as a path separator, because B (the target code) would treat both characters as the same thing, while A (the filter) would not. [ Canonicalization means (in my world) to find a representation that all parts of the system agrees upon (like: "there's only one path separator, and it is 'foo'"). Canonicalization should happen before any parts of the system starts interpreting the data. It's often difficult, and sometimes impossible, to perform canonicalization. But it would solve the problem. ] The above, simple example is often not exploitable in practice, because most programmers realize how the directly following code will deal with the data. In practice, however, there's often several layers of data-modifying code between the A (filter) and the B (target logic), and then the programmer of A tends to fail in his logic. (In practice there are typically layers before A as well, making it even harder to realize what data actually hit A in the first place.) You bring a couple of examples yourself. In a modern web setting, there will typically be some web server component before A. The web server component is typically responsible for doing the URL decoding stuff, i.e. decoding all occurrences of %xx into single bytes. And there may be some logic between A and B as well. In one of your examples you mention "%252F". This reminds me of the "MS IIS Double Decode Bug" [1,2,3] from 2001. MS IIS didn't want people to "../.." or "..\.." out of the "scripts" directory, so it contained tests looking for "../" and "..\" (I don't know the details). Now, the infamous worms known as Code Red II and Nimda spread happily using stuff like "..%252f" to do directory traversal. Let's see one possible explanation of how they could do that (and as I said, I don't know the details, but I understand the principles, and the following four points is one possible explanation to the cause of the problem in this closed-source software): 1. The web server decodes "..%252f" into "..%2f", as "%25" is a URL encoded '%' (25 hex is the ASCII code for a percentage sign). 2. The filter (A) kicks in, and looks for "../" and "..\", of which it finds none. 3. Some intermediate, buggy logic runs: An unfortunate programmer once scratched his head and wondered about this URL decoding stuff, and added another round of it "just in case". Unlucky bastard. Now "..%2f" turns into "../", _after_ the filter (A). 4. Part of the URL is passed on to the OS (B in this case) for execution. The OS deals with the '/' as a path separator. In this case there was one step before A, and one (misplaced, buggy, stupid, mayhem-causing) step between A and B. The one writing A couldn't know that some poor fellow messed it all up in step 3. You also mention "..%C0%AF". This reminds me of the "MS IIS Unicode Bug" [4] from 2000. Let's look at my interpretation of what happened here (and remember my standard disclaimer: I haven't seen the code, but my speculations give a plausible explanation, IMHO): 1. The web server decodes "%C0%AF" into the matching two-byte sequence. 2. The filter (A) kicks in, and looks for "../" and "..\", of which it finds none. The filter probably works in the eastern European character set ISO-8859-1, or some Microsoft derivative thereof (let's just say "one byte per character" in order not to offend anyone living outside Europe), and treats the two bytes as two distinct characters. 3. Part of the URL is passed on to the OS (B in this case) for execution. The file system code of the OS doesn't work with ISO-8859-1, but rather with UTF-8 (one of many representations of Unicode), in which several bytes may be combined into single characters. In particular, bytes with the high bit (bit 7) set is combined with the next byte to produce characters, in a semi-iterative manner. Following this UTF-8 system [5], those two bytes (c0 af) combined, decode into a slash. This is an "overlong" representation of the slash, because the slash could just as well have been represented by a single byte with the plain, one-byte hex value "2f". Mayhem again, and this time because the file system code of the OS allowed overlong character sequences, thus opening up for representing the same character in multiple ways. The opposite of canonicalization, sort of. Now, our first example worked because some programmer added a faulty second URL decoding inside MS IIS. There's no reason to believe that this attack should work on other web servers, or even more recent versions of IIS, although it might do. After all, I found stupid and compatible URL decoding bugs in both Tomcat [6] and BEA WebLogic [7] back in 2001. What I mean to say is: Don't expect the "%252f" thing to work elsewhere. If it does, you have probably spotted a new bug, or you run on outdated Microsoftware. The second example worked because the file system code of the target OS accepted path components in which overlong UTF-8 sequences were allowed, despite those overlong sequences being discouraged by best practice documents. You shouldn't expect to find that behavior elsewhere either. By all means, check it out (maybe you're lucky), but do not expect it to work unless the target code is the same as the target code of the exploit published back in 2000. [ Due to my own experiments back then, I have reason to believe that Microsoft fixed the bug by patching IIS (A) rather than the OS (B). If I'm correct, the bug may show up again if you use another web server on the same target OS. (Disclaimer again: My tests may have been wrong (don't remember the details), I may remember wrongly, and so on.) ] The point is: All those DIoCoBS (I made that word up, remember) problems depend heavily on multiple parts of the setup. One cannot read a list of "canonicalization problem byte sequences" and expect them to work everywhere, although some books may give you that impression. Most of the time none of the sequences will work, and there's no reason to be surprised about that. After all, admins update their systems, and programmers (slowly) learn from past mistakes. To exploit DIoCoBS problems, one will need to know how data pass through the logic, and how different parts of the logic will interpret the data differently. When you find that A will pass through something that will make B misbehave, you have an exploit. To do that, you need to know (or educatedly guess) what happens before A, and what happens between A and B, and even what happens inside B. Since you have the source code of both Apache and PHP (both before A in your case), your remaining problem (after reading the source) is figuring out what Windows95 (B) does with whatever it receives from the layers above. Sverre. 0: http://shh.thathost.com/text/incompatible-parameter-parsing.txt 1: http://www.nsfocus.com/english/homepage/sa01-02.htm 2: http://www.cert.org/advisories/CA-2001-12.html 3: http://www.microsoft.com/technet/security/bulletin/MS01-026.mspx 4: http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx 5: http://www.cl.cam.ac.uk/%7emgk25/unicode.html 6: http://shh.thathost.com/secadv/2001-03-29-tomcat.txt 7: http://shh.thathost.com/secadv/2001-03-28-weblogic.txt -- shh () thathost com My web security book: Innocent Code http://shh.thathost.com/ http://innocentcode.thathost.com/
Current thread:
- Canonicalization and apache/PHP-attacks tapio_niemela1 (Sep 26)
- Re: Canonicalization and apache/PHP-attacks Sverre H. Huseby (Sep 27)