nanog mailing list archives

Re: Anyone have contacts at the Amazon or OpenAI web spiders?


From: "John Levine" <johnl () iecc com>
Date: 14 Feb 2024 11:48:27 -0500

It appears that Patrick Clochesy <patrick () mach net> said:
Both robots respect robots.txt, of course they’re not going to answer.

The content farm is not one site with six billion pages, it's six billion
sites each with one page.  They check the robots.txt for each site they
visit but by then its's too late.

Most spiders can take the hint that they're all on the same IP.  But not
these two.

R's,
John


On Feb 13, 2024, at 8:35 PM, John Levine <johnl () iecc com> wrote:

One day I set up the world's lamest content farm. You can see it here:

https://www.web.sp.am/

While humans tend not to find its six billion pages very interesting,
some web spiders are entranced. In the past week or so, Amazon's
amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6
million. (If you were wondering what they use to train ChatGPT, now
you know.) I don't care that googlebot comes by every 5 or 10 minutes,
but gptbot is every few seconds and amazon as fast as the server will
respond.

They both come from predictable IPs so I can set packet filters but
they're still hammering pretty hard. Each has a URL in the user agent
string, Amazon's page has an address to write to but OpenAI's doesn't.
I wrote to the Amazon address, no response.

If anyone has contacts at either I would appreciate it. A few years
ago the bingbot got trapped but fortunately I knew someone at
Microsoft who could pass the word. He reported back that while he
could not go into detail, there was a great deal of animated
conversation at the other end of the hall, and shortly after that it
stopped.

R's,
John




Current thread: