nanog mailing list archives

STILL Paging Google...

From: Matthew Elvey <matthew () elvey com>
Date: Tue, 15 Nov 2005 16:56:12 -0800

Still no word from google, or indication that there's anything wrongwith the robots.txt. Google's estimated hit count is going slightly up,instead of way down.Why am I bugging NANOG with this? Well, I'm sure if Googlebot keepsignoring my robots.txt file, thereby hammering the server andfacilitating s pam, they're doing the same with a google other sites.(Well, ok, not a google, but you get my point.)



On 11/14/05 2:18 PM, Coyle, Brian sent forth electrons to convey:

Just thinking out loud...

Have you confirmed the IP addresses of the Googlebot entries in your log

actually belong to Google?

/paranoia  :)

The google search URL I posted shows that google is hitting the site.There are results in there that point to pages that postdate therobots.txt that should have blocked 'em.(http://www.google.com/search?q=site%3Awiki.fastmail.fm)



On 11/14/05 2:09 PM, Jeff Rosowski sent forth electrons to convey:

Are you trying to block everything except the main page? I know toblock everything ...

No; me too. See
http://www.google.com/webmasters/remove.html
The above page says that
User-agent: Googlebot
Disallow: /*?
will block all standard-looking dynamic content, i.e. URLs with "?" in them.

On Mon, 14 Nov 2005, Matthew Elvey wrote:
Doh! I had no idea my thread would require login/be hidden fromgeneral view! (A robots.txt info site had directed me there...) Itseems I fell for an SEO scam... how ironic. I guess that's why Ihaven't heard from google...
Anyway, here's the page content (with some editing and paraphrasing):

Subject: paging google! robots.txt being ignored!

Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
doesn't complain (other than about the use of google's nonstandardextensions described at
http://www.google.com/webmasters/remove.html )

The above page says that it's OK that

#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*

is last (after User-agent: *)

and seems to suggest that the syntax is OK.

I also tried

User-agent: Googlebot
Disallow: /*?
but it hasn't helped.
I asked google to review it via the automatic URL removal system(http://services.google.com/urlconsole/controller).
Result:
URLs cannot have wild cards in them (e.g. "*"). The following linecontains a wild card:
DISALLOW: /*?

How insane is that?
Oh, and while /*?* wasn't per their example, it was legal, per theirsyntax, same as /*? !
The site as around 35,000 pages, and I don't think a small robots.txtto do what I want is possible without using the wildcard extension.

Current thread:

Sorry! Here's the URL content (re. Paging Google...) Matthew Elvey (Nov 14)
- Message not available
  - STILL Paging Google... Matthew Elvey (Nov 15)
    - Re: STILL Paging Google... MH (Nov 15)
    - paypal down! Steven Kalcevich (Nov 15)
    - Re: paypal down! Chris Owen (Nov 15)
    - Re: paypal down! Harald Koch (Nov 16)
    - Re: STILL Paging Google... Niels Bakker (Nov 16)
    - Re: STILL Paging Google... Michael . Dillon (Nov 16)
    - Message not available
    - Re: STILL Paging Google... Matthew Elvey (Nov 16)