nanog mailing list archives

Re: yahoo crawlers hammering us


From: Matthew Petach <mpetach () netflight com>
Date: Wed, 8 Sep 2010 00:04:07 -0700

On Tue, Sep 7, 2010 at 1:19 PM, Ken Chase <ken () sizone org> wrote:
So i guess im new at internets as my colleagues told me because I havent gone
around to 30-40 systems I control (minus customer self-managed gear) and
installed a restrictive robots.txt everywhere to make the web less useful to
everyone.

Does that really mean that a big outfit like yahoo should be expected to
download stuff at high speed off my customers servers? For varying values of
'high speed', ~500K/s (4Mbps+) for a 3 gig file is kinda... a bit harsh.
Especially for an exe a user left exposed in a webdir, thats possibly (C)
software and shouldnt have been there (now removed by customer, some kinda OS boot
cd/toolset thingy).

The large search engines like Google, Bing, and Yahoo do try to be good
netizens, and not have multiple crawlers hitting a given machine at the same
time, and they put delays between each request, to be nice to the CPU load
and bandwidth of the machines; but I don't think any of the crawlers explicitly
make efforts to slow down single-file-fetches.  Ordinarily, the transfer speed
doesn't matter as much for a single URL fetch, as it lasts a very short period
of time, and then the crawler waits before doing another fetch from the same
machine/same site, reducing the load on the machine being crawled.  I doubt
any of them rate-limit down individual fetches, though, so you're likely to see
more of an impact when serving up large single files like that.

I *am* curious--what makes it any worse for a search engine like Google
to fetch the file than any other random user on the Internet?  In either case,
the machine doing the fetch isn't going to rate-limit the fetch, so
you're likely
to see the same impact on the machine, and on the bandwidth.

Is this expected/my own fault or what?

Well...if you put a 3GB file out on the web, unprotected, you've got to figure
at some point someone's going to stumble across it and download it to see
what it is.  If you don't want to be serving it, it's probably best to
not put it up
on an unprotected web server where people can get to it.  ^_^;

Speaking purely for myself in this manner, as a random user who sometimes
sucks down random files left in unprotected directories, just to see what they
are.

Matt
(now where did I put that antivirus software again...?)


Current thread: