nanog mailing list archives

Re: Yahoo! clue (Slightly OT: Spiders)


From: "Matthew Petach" <mpetach () netflight com>
Date: Tue, 5 Jun 2007 13:57:23 -0700


On 3/30/07, Zach White <zwhite () darkstar frop org> wrote:
On Thu, Mar 29, 2007 at 10:17:50AM -0400, Kradorex Xeron wrote:
> Another problem is that the Yahoo/Inktomi search robots do not stop if no site
> is present at that address, Thus, someone could register a DNS name and have
> a site set on it temporarily,  just enough time for Yahoo/Inktomi's bots to
> notice it, then redirect it thereafter to any internet host's address and the
> bots would proceed to that host and access them over and over in succession,
> wasting bandwidth of both the user end (Which in most cases is being
> monitored and is limited, sometimes highly by the ISP), and the bot's end
> wasted time that could have been used spidering other sites.

It's not limited to that. I bought this domain which had previously been
in use. I've owned the domain for over 5 years, but I still get requests
for pages that I've never had up.

<zwhite@leet:/var/www/logs:8>$ grep ' 404 ' access_log | grep
darkstar.frop.org | awk '/Yahoo/ { print $8 }' | wc -l
     830
<zwhite@leet:/var/www/logs:9>$ grep ' 404 ' access_log | grep
darkstar.frop.org | awk '/Yahoo/ { print $8 }' | sort -u | wc -l
      82

That's 82 unique URLs that have been returning a 404 for over 5 years.
That log file was last rotated 2006 Sep 26. That's averaging 138
requests per month for pages that don't exist on that one domain alone.
How many bogus requests are they sending each month, and what can
we do to stop them? (The first person to say something involving
robots.txt gets a cookie made with pickle juice.)

Sure, on my domain alone that's not a big deal. It hasn't cost me any
money that I'm aware of, and it hasn't caused any trouble. However, it
is annoying, and at some point it becomes a little ridiculous.

Can anyone that runs a large web server farm weigh in on these sorts of
requests? Has this annoyance multiplied over thousands of domains and
IPs caused you problems? Increased bandwidth costs?

-Zach


Speaking purely for myself, and not for any other organization, I would
wonder what level of response you had gotten from the abuse address
listed in the requesting netblock:

mpetach@netops:/home/mrtg/archive> whois -h whois.ra.net 74.6.0.0/16
route:      74.6.0.0/16
descr:      YST
origin:     AS14778
remarks:    Send abuse mail to slurp () inktomi com
mnt-by:     MAINT-AS7280
source:     RADB
mpetach@netops:/home/mrtg/archive>

First line of inquiry in my mind would be to use the slurp@
email, and work my way along from there.

Matt


Current thread: