Vulnerability Development mailing list archives

Re: [imp] sanitizing html


From: marcs () ZNEP COM (Marc Slemko)
Date: Wed, 23 Feb 2000 09:39:11 -0700


On Wed, 23 Feb 2000, Mikael Olsson wrote:

Stuart Henderson wrote:

Not sufficiently global, since an attacker can still use,
for example hrEf=script:foo -- however, this is tricky to
filter without hitting some legitimate addresses, for example
http://foo.bar.com/womble.cgi?user=someone&page=something.

Correct. And you can also use UTF-7 (Unicode) chars to make
script tags and everything look like something else altogether.

This means that

## $data = preg_replace('|<([^>]*)[Ee][Mm][Bb][Ee][Dd]|', '<horde_cleaned_embed', $data);

wouldn't protect you at all.

Well, no it won't but that is for other reasons.  To avoid charset issues,
as long as you don't actually have to worry about pages really using that
charset (yea, which is a problem for a a fair chunk of people), then you
simply have to specify the charset explicitly.  If you don't know what
charset the client will use, you have no way to know what has to be
encoded.  This is the reason that the patches released for Apache allow
you to force a charset on all pages that don't have an explicit charset in
the HTTP headers.

Some of the other things to worry about, some of which work only in IE or
only in Navigator:

&{alert('foo')};        (in navigator in an attribute value)

<b onmouseover="alert('foo')">foo</b>

<a href="javascript:alert('foo')">foo</a>

<a href="livescript:alert('foo')">foo</a>

<a href="mocha:alert('foo')">foo</a>

<a zoodles="animal noodles>" href="javascript:alert('foo')">foo</a>

<img src=userentered.gif> can be used (note no quotes around
userentered.gif) if userentered.gif is entered as something like:

        xxx.gif onmouseover="alert('foo')

If you are putting user data inside javascript, then there are other
characters to be wary of.

If you are outputting a text/plain page with user content embedded, then
you can't because IE has a major hole (yea, yea, MS calls it a "feature",
but it should be more and more obvious why it isn't) that it will try to
guess the MIME type.  So if you send a text/plain page, then you can't
encode any characters since there is no encoding defined.  Yet, if IE
feels like it, it will go ahead and interpret it as HTML anyway.  Perhaps
having this brought up as a security bug in IE will make MS fix it.
Probably not.  It isn't like this horribly broken behaviour is anything
new.

The list goes on and on.  And that doesn't even include all the
random HTML tags that are obviously dangerous.  The only thing that
I can almost guarantee is that any list you make won't be complete.

There is no way to safely filter HTML by specifying what not to
allow.  Even if you somehow did create a filter that magically
worked 100% with one or two or three browsers today, your filter
will break tomorrow or the next day.  You need to be explicit about
what you do allow, and make sure that it is in a very restricted
form.  Things like php's strip_tags function that only allow certain
tags through are not stringent enough, because they allow arbitrary
attributes.

In addition, remember that this problem isn't just about scripting.
Say you have an auction site that lets people bid by viewing an
item, then entering their username and password at the bottom of
the page.  All the attacker needs is the ability to insert a form
tag and associated stuff to exploit this.  No scripts involved;
this is not a scripting problem.  The name is unfortunate.  As a
real life example of this problem, take ebay.  They are wide open
to almost this exact example (they do allow scripting languages,
and do have the enter username/password bit on a second page, which
changes little), have known about it for a long time, and just
don't give a damn.


Current thread: