Bugtraq mailing list archives

Re: PHP security (or the lack thereof)


From: Glynn Clements <glynn () gclements plus com>
Date: Sat, 24 Jun 2006 13:43:57 +0100


Crispin Cowan wrote:

Trying to make the language 'safe' won't fix it because the
language is not the problem. The real problem is the way PHP is
presented to most new developers.

PHP has been introduced as a tool for the web developer. As a
language its goal is "to allow web developers to write dynamically
generated pages quickly." (
http://www.php.net/manual/en/faq.general.php ). The focus then is to
enable the web developer by giving him the tools he needs to create
dynamic content, with as little hassle as possible. The web
developer need only read a short tutorial (
http://www.php.net/manual/en/tutorial.php ) and he is ready to read,
understand and implement the ideas presented in the various example
scripts on PHP.net. Unfortunately this situation leaves the web
developer uninformed and unprepared to face the hostile environment
that is the net.

That is a fascinating perspective.

Web developers who work with static content (HTML and images, etc.) is
pretty secure: the security threat amounts to Apache configuration
(directory browsing and htpasswd stuff) and it is pretty difficult for
an attacker to corrupt static content by way of the content.

Dynamic content, while not inherently dangerous, becomes dangerous when
you hand the web developer a Turing-complete language. Suddenly the
exact behavior of the web site under arbitrary input becomes
undecidable. Programmers (mostly) know this. Security developers
(should) know this. Web artists may have just been introduced to
programming to get their web site to be dynamic.

There are two possible approaches to fixing this. One, as nabiy
suggests, is to change how PHP is presented to web developers. Label it
as a chain saw, and point out that chain saws don't know the difference
between "log" and "leg" :)

The other is to contrive a language that is both sufficient for dynamic
web content development, and also *not* Turing-complete. I have no idea
what such a language might look like, or even whether the intersection
of these two requirements is the null set.

Eliminating Turing-completeness would be fairly straightforward:
prohibit unbounded recursion and iteration (i.e. no "while" loops). It
probably wouldn't have much impact upon the usability of the language
either; the kind of processing performed by most web applications
don't require anything beyond simple iteration over finite
lists/strings/arrays.

Unfortunately, it wouldn't have much impact upon the security of the
language either; you don't need anything beyond string concatenation
to fall vulnerable to XSS, SQL-injection or shell-injection attacks. 
And you don't need unbounded iteration to make exhaustive analysis
impractical. Just because you can /theoretically/ determine something,
that doesn't mean that you can make the determination using existing
hardware in a reasonable time-frame.

So far as writing secure web applications is concerned, it's likely to
be more fruitful to stop using a common "string" type for raw text,
HTML, URLs, URL-encoded form data, SQL statements, shell commands,
regexps, prtinf-style format strings, HTTP headers and so on. IOW,
stop using "in-band signalling".

The problems aren't limited to web applications; I wouldn't be able to
count the number of times I've seen shell scripts (or C programs using
the printf/system idiom) which fail on filenames or other strings
which contain shell metacharacters (or begin with a leading hyphen).

Web applications are just a more extreme case, due to a combination
of:

a) relatively inexperienced programmers
b) having a whole bunch of extra syntaxes thrown in
c) the fact that the very nature of a web application means that
anyone, anywhere can throw malicious data at it.

So far as designing a language which accounts for these issues is
concerned: IMHO, the most feasible solution is to stop passing
structured data around as "formatted" strings and to use data
structures (e.g. a parse tree) instead.

If you want to construct HTML, you construct the parse tree by
creating leaf nodes from strings and higher-level nodes from a tag
name, a list of attribute name/value nodes, and a list of child nodes.

Each constructor would validate its input according the allowed
syntax. The process of generating HTML from the parse tree would
perform any necessary conversions (e.g. "<" -> "&lt;" within leaf
nodes).

For every formal language which is likely to be useful, the
development language would provide a parser, generator, and a library
of useful operations on the structured representation (find, add,
delete, modify nodes, etc). Any data entering or leaving the
structured representation as a string would be represented in its
"natural" form, i.e. it wouldn't contain any language "syntax".

Apart from being more robust, such a language should also make life
easier for the application developer, as they don't have to implement
their own equivalents. Even programmers who don't understand the
security issues will typically have to deal with many of these issues
in order to get their code to simply work.

-- 
Glynn Clements <glynn () gclements plus com>


Current thread: