Interesting People mailing list archives

more on What if Google and Yahoo got into a fight?


From: David Farber <dave () farber net>
Date: Tue, 16 Aug 2005 08:57:58 -0400



Begin forwarded message:

From: Eric Glover <eric () ericglover com>
Date: August 15, 2005 9:46:54 PM EDT
To: "David Farber (by way of Bernard A. Galler)" <dave () farber net>
Cc: i-p () umich edu
Subject: Re: [IP] What if Google and Yahoo got into a fight?


There are several flaws with the NCSA Study (I looked at their PERL code) - I do *not* work for Google or Yahoo.

There are a few other assumptions which were not explicit:

#1: That both Google and Yahoo use the same relevance function to decide which results to include - or there is some way to post process this to compare equally.

#2: That the Yahoo crawler has is biased in a way that is equally probable for results which are returned for the keywords in the study - or at least close.

Regarding #1: It is clear Google does not require keywords to be on a particular web page for a result to be returned - it is necessary to actually verify that both Yahoo and Google pages meet some identical constraints. It is also possible that Yahoo has multiple partitions, and only searches the later partition when there are no results in the first - this might cause it to falsely appear that Yahoo has lower coverage.

Regarding #2: It is entirely possible that there is a bias in the Yahoo crawler, pretend that it crawled 15 Billion 'calendar pages' or Spanish pages or some other different bias. In this case the NCSA study fails for two reasons: First, what if the 'excluded queries' (those with more than 1000 results) all come from Yahoo - and the extra coverage was biased towards those queries? Second, what if the extra content in Yahoo does not have keywords from the NCSA study? Maybe Yahoo found 15 Billion Spanish pages, or image archives with no keywords?

A few improvements to the study would involve:

#1: Post processing using a different relevance function: Such as pages *must* contain each keyword somewhere in the HTML.

#2: Examining the remaining results actual intersection and use this to test a probability model based on some predicted ratio of sizes. Again this assumes that the crawlers are biased similarly.

I am not saying I believe Yahoo has added these pages, but I am saying that this study does not deny the possibility Yahoo has in fact indexed nearly 19 Billion pages.

-Eric

David Farber (by way of Bernard A. Galler) wrote:

Begin forwarded message:
From: Tim Finin <finin () cs umbc edu>
Date: August 15, 2005 5:29:41 PM EDT
To: Dave Farber <dave () farber net>
Subject: What if Google and Yahoo got into a fight?
Yahoo and Google have been arguing about whose index is bigger.
The disagreement was touched off when Yahoo claimed [1] that its
index provided access to "over 20 billion items".  Google
demurred [2].  Researchers at the University of Illinois NCSA
just announced the results [5] an experiment designed to settle
the question.  Their scheme used a Perl program [3] to generate
over 10K random two word queries drawn from words in the the
ispell dictionary [4]. Comparing the number of results found by
each engine for these search queries identified a clear winner --
Google [5].
[1] http://www.ysearchblog.com/archives/000172.html
[2] http://battellemedia.com/archives/001790.php
[3] http://vburton.ncsa.uiuc.edu/compare.txt
[4] http://vburton.ncsa.uiuc.edu/wordlist.txt
[5] http://vburton.ncsa.uiuc.edu/indexsize.html
-- Tim Finin, Prof Computer Science & Electrical Engineering, Univ of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore MD 21250. finin () umbc edu +1-410-455-3522 fax:-3969 http://umbc.edu/~finin/ http:// ebiquity.umbc.edu/
-------------------------------------
You are subscribed as galler () umich edu
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip
Archives at: http://www.interesting-people.org/archives/interesting- people/


-------------------------------------
You are subscribed as lists-ip () insecure org
To manage your subscription, go to
 http://v2.listbox.com/member/?listname=ip

Archives at: http://www.interesting-people.org/archives/interesting-people/


Current thread: