Vulnerability Development mailing list archives

Re: Canonicalization and apache/PHP-attacks

From: "Sverre H. Huseby" <shh () thathost com>
Date: Tue, 27 Sep 2005 02:14:56 +0200
Everything is about data and logic: Data passing through some program
code.  Data typically passes through several layers of logic, written
by different programmers, often working for different companies or
organizations.  These programmers do not always agree on how the data
should be parsed.  Or to be more correct, they do not think about how
the other programmers may have chosen to parse the data.  When data
means one thing to one part of the code, and another thing to another
part of the code, the problems you are concerned about start showing
up.

I don't use "canonicalization" as a description of the problem,
because canonicalization is supposed to be the (or one possible)
solution to the problem of "Different Interpretation of Characters or
Byte Sequences".  Or something like that.  Let's just call the problem
DIoCoBS for the rest of this text.  (A related problem is
"Incompatible Parameter Parsing" [0].  I'm sure some clever person
will be able to unify those problems.)

There are two main players to the typical DIoCoBS scenario:

  A: A filter trying to make sure some data will be interpreted in a
     reasonable way.

  B: Some piece of logic operating on the filtered data.  This logic
     will fail if the data are not properly filtered.

The DIoCoBS problem occurs when A tries to make sure that B will not
screw things up, but neglects to realize exactly how B will interpret
the data.

Simple example: Let's say that A wants to prevent directory traversal
(much like your example) by detecting the presence of the path
separator character.  The programmer of A is a typical Unix guy, so he
creates a filter that looks for the character '/' and refuses to hand
the data over to B if any such character is found, because B would
then certainly have to deal with directories (screw things up).  Now,
let's say that the code for B is actually running on Windows, on which
both '/' and '\' work as path separators.  An attacker would clearly
be able to bypass the filter (A) by using '\' rather than '/' as a
path separator, because B (the target code) would treat both
characters as the same thing, while A (the filter) would not.

[ Canonicalization means (in my world) to find a representation that all
  parts of the system agrees upon (like: "there's only one path
  separator, and it is 'foo'").  Canonicalization should happen before
  any parts of the system starts interpreting the data.  It's often
  difficult, and sometimes impossible, to perform canonicalization.
  But it would solve the problem.  ]

The above, simple example is often not exploitable in practice,
because most programmers realize how the directly following code will
deal with the data.  In practice, however, there's often several
layers of data-modifying code between the A (filter) and the B (target
logic), and then the programmer of A tends to fail in his logic.  (In
practice there are typically layers before A as well, making it even
harder to realize what data actually hit A in the first place.)

You bring a couple of examples yourself.  In a modern web setting,
there will typically be some web server component before A.  The web
server component is typically responsible for doing the URL decoding
stuff, i.e. decoding all occurrences of %xx into single bytes.  And
there may be some logic between A and B as well.  In one of your
examples you mention "%252F".  This reminds me of the "MS IIS Double
Decode Bug" [1,2,3] from 2001.  MS IIS didn't want people to "../.."
or "..\.." out of the "scripts" directory, so it contained tests
looking for "../" and "..\" (I don't know the details).  Now, the
infamous worms known as Code Red II and Nimda spread happily using
stuff like "..%252f" to do directory traversal.  Let's see one
possible explanation of how they could do that (and as I said, I don't
know the details, but I understand the principles, and the following
four points is one possible explanation to the cause of the problem in
this closed-source software):

  1.  The web server decodes "..%252f" into "..%2f", as "%25" is a
      URL encoded '%' (25 hex is the ASCII code for a percentage
      sign).

  2.  The filter (A) kicks in, and looks for "../" and "..\", of which
      it finds none.

  3.  Some intermediate, buggy logic runs: An unfortunate programmer
      once scratched his head and wondered about this URL decoding
      stuff, and added another round of it "just in case".  Unlucky
      bastard.  Now "..%2f" turns into "../", _after_ the filter (A).

  4.  Part of the URL is passed on to the OS (B in this case) for
      execution.  The OS deals with the '/' as a path separator.

In this case there was one step before A, and one (misplaced, buggy,
stupid, mayhem-causing) step between A and B.  The one writing A
couldn't know that some poor fellow messed it all up in step 3.

You also mention "..%C0%AF".  This reminds me of the "MS IIS Unicode
Bug" [4] from 2000.  Let's look at my interpretation of what happened
here (and remember my standard disclaimer: I haven't seen the code,
but my speculations give a plausible explanation, IMHO):

  1.  The web server decodes "%C0%AF" into the matching two-byte
      sequence.

  2.  The filter (A) kicks in, and looks for "../" and "..\", of which
      it finds none.  The filter probably works in the eastern
      European character set ISO-8859-1, or some Microsoft derivative
      thereof (let's just say "one byte per character" in order not to
      offend anyone living outside Europe), and treats the two bytes
      as two distinct characters.

  3.  Part of the URL is passed on to the OS (B in this case) for
      execution.  The file system code of the OS doesn't work with
      ISO-8859-1, but rather with UTF-8 (one of many representations
      of Unicode), in which several bytes may be combined into single
      characters.  In particular, bytes with the high bit (bit 7) set
      is combined with the next byte to produce characters, in a
      semi-iterative manner.  Following this UTF-8 system [5], those
      two bytes (c0 af) combined, decode into a slash.  This is an
      "overlong" representation of the slash, because the slash could
      just as well have been represented by a single byte with the
      plain, one-byte hex value "2f".

Mayhem again, and this time because the file system code of the OS
allowed overlong character sequences, thus opening up for representing
the same character in multiple ways.  The opposite of
canonicalization, sort of.

Now, our first example worked because some programmer added a faulty
second URL decoding inside MS IIS.  There's no reason to believe that
this attack should work on other web servers, or even more recent
versions of IIS, although it might do.  After all, I found stupid and
compatible URL decoding bugs in both Tomcat [6] and BEA WebLogic [7]
back in 2001.  What I mean to say is: Don't expect the "%252f" thing
to work elsewhere.  If it does, you have probably spotted a new bug,
or you run on outdated Microsoftware.

The second example worked because the file system code of the target
OS accepted path components in which overlong UTF-8 sequences were
allowed, despite those overlong sequences being discouraged by best
practice documents.  You shouldn't expect to find that behavior
elsewhere either.  By all means, check it out (maybe you're lucky),
but do not expect it to work unless the target code is the same as the
target code of the exploit published back in 2000.

[ Due to my own experiments back then, I have reason to believe that
  Microsoft fixed the bug by patching IIS (A) rather than the OS (B).
  If I'm correct, the bug may show up again if you use another web
  server on the same target OS.  (Disclaimer again: My tests may have
  been wrong (don't remember the details), I may remember wrongly, and
  so on.) ]

The point is: All those DIoCoBS (I made that word up, remember)
problems depend heavily on multiple parts of the setup.  One cannot
read a list of "canonicalization problem byte sequences" and expect
them to work everywhere, although some books may give you that
impression.  Most of the time none of the sequences will work, and
there's no reason to be surprised about that.  After all, admins
update their systems, and programmers (slowly) learn from past
mistakes.

To exploit DIoCoBS problems, one will need to know how data pass
through the logic, and how different parts of the logic will interpret
the data differently.  When you find that A will pass through
something that will make B misbehave, you have an exploit.  To do
that, you need to know (or educatedly guess) what happens before A,
and what happens between A and B, and even what happens inside B.

Since you have the source code of both Apache and PHP (both before A
in your case), your remaining problem (after reading the source) is
figuring out what Windows95 (B) does with whatever it receives from
the layers above.


Sverre.

0: http://shh.thathost.com/text/incompatible-parameter-parsing.txt
1: http://www.nsfocus.com/english/homepage/sa01-02.htm
2: http://www.cert.org/advisories/CA-2001-12.html
3: http://www.microsoft.com/technet/security/bulletin/MS01-026.mspx
4: http://www.microsoft.com/technet/security/bulletin/MS00-078.mspx
5: http://www.cl.cam.ac.uk/%7emgk25/unicode.html
6: http://shh.thathost.com/secadv/2001-03-29-tomcat.txt
7: http://shh.thathost.com/secadv/2001-03-28-weblogic.txt

-- 
shh () thathost com               My web security book: Innocent Code
http://shh.thathost.com/       http://innocentcode.thathost.com/
Current thread:

Canonicalization and apache/PHP-attacks tapio_niemela1 (Sep 26)
- Re: Canonicalization and apache/PHP-attacks Sverre H. Huseby (Sep 27)