Vulnerability Development mailing list archives

RE: Possible DOS against search engines?


From: "Rob Shein" <shoten () starpower net>
Date: Mon, 3 Feb 2003 18:45:00 -0500

I see a few problems here.  Problems are listed below each concept, for
clarity, and assume a decent webcrawler.


1. You create a generator for fake web pages, whose purpose 
is to spit out HTML containing a huge amount of (pseudo) 
random _non-existing_ words, as well as links to other pages 
within the generator;

I doubt this would make even a slight dent in things.  Seeing as how
webcrawlers already walk the entire internet, with its various languages,
enormous expanse, and endless misspellings, I think anything you could
create would end up being a drop in the bucket.


2. You place that generator somewhere and submit the URL to 
search engines for crawling;

3. The search engines then crawls the site, possibly reaching 
their pre-defined maximum of crawling depth (or, if badly 
broken, crawl the site indefinitely, jumping from one freshly 
generated page to another);
 
But they don't crawl indefinitely.  What do they do if they hit two sites
that link to each other?  They notice this, and move on.

4. Upon adding the gathered words to the search engine's 
index, the index becomes heavily overloaded with the newly 
added words, as they are outside of the real-language words 
already present in the index. The following should be 
theoretically possible:
 
But who would search on them?

    - craft fake words so that they attack a specific hash 
function. Make a bunch of fakes that hash to the same value 
as a legitimate word in the English language. This will 
possibly impact the performance of search engines using that 
particular hash function when they try to look up the 
legitimate words that are being targeted.

This would be noticed by the search engine long before it became a real
problem, and it would be addressed.  This is how they deal with many things,
including people who try to influence their ranking using various means.
 
    - craft fake words so that they disbalance a b-tree 
index, if one is used. I am not entirely sure, however it 
appears to me that it is possible to craft words in such a 
way as to alter the shape of the b-tree and thus impact the 
performance on the lookups where it used.

    - craft fake words randomly so that the index just grows. 
To the best of my understanding, most search engines will 
index and retain keywords that are only seen on one web page 
in the entire Internet. However, I think the capacity of the 
search engines to keep track of such one-time non-English 
letter sequences is limited and can be eventually exhausted.

It is my belief that, again, they will notice the impact on their database
and quickly address the issue.  What about a bit of code that states that if
more then 5% of the words in a page are unique in the database, that that
page is dropped?

If the above-mentioned things are feasible, then one can even 
construct a worm of some sort, that will auto-install such 
fake page generators on valid sites, thus increasing the 
traffic to the crawler even more. Writing an short Apache 
handler meant to be silently installed in httpd.conf at 
root-kit installation should not be that difficult. When is 
the last time your reviewed the module list of your Apache? 
Will you spot a malicious module if it is called 
mod_ip_vhost_alias, loaded inbetween two other modules that 
you never knew are vital or not?

No, but I'd notice an abrupt lack of space on my web server.  And the sudden
oddly-named URLS in my logs.  And the corresponding oddly-named pages in my
site.  And if I didn't notice, my hosting provider would.

Please note that the setup described differs from the 
practice of generating fake pages containing a lot of real 
(mostly adult) keywords. After all, such real-language words 
already exist in the index, whereas I suggest bombing the 
index with a huge number of not-previously-existing 
freshly-generated random letter sequences. Also, please note 
that the purpose of the attack is to damage the index, and 
not to make the crawler consume bandwidth by going in an 
endless loop or something like that (though, the crawler has 
to scan the pages first so that the generated keywords are 
ultimately delivered to the index).

I will appreciate any and all thoughts on the issue.

Philip Stoev



Current thread: