Interesting People mailing list archives

How big is the web


From: David Farber <farber () central cis upenn edu>
Date: Wed, 15 Dec 1993 14:17:03 -0500

The Uniform Resource Locator for this document is:
http://www.mit.edu:8001/afs/sipb/user/mkgray/ht/web-growth.html


Measuring the Size and Growth of the Web


The World Wide Web Wanderer


The Web is BIG


The World Wide Web started to grow in 1990. In the three years since
it was begun, the volume of information avialable has increased
enormously. In addition to the ability to access preexisting network
resources such as WAIS, FTP, and Gopher, the mass of material
avaialable via HTTP has become a vast resource of its own. Though by
simply wandering the web with one of the many browsers available, it
is clear that the web is huge, it is very difficult to actually
estimate the size of the web in this fashion. Due to the structure of
the web, many documents are obscure and only reachable through a thin
trail of documents with few other links to them. This leaves the
question, what is a realistic, quantitative measure of the size of the
World Wide Web?


How does one find out how BIG?


In an attempt to answer this question I wrote an automoton that would
do a limited parsing of HTML and would attempt to do a depth-first
search of the Web. The automoton became known as the World Wide Web
Wanderer, or W4. The Wanderer, a script written in perl, wandered the
Web for many hours, accumulating URL's and new site names. The
Wanderer did not follow any links that were not HTTP sites, and had
certain special cases to eliminate certain vast gatewayed structures
such as Techinfo and AFS. Though these are certainly part of the Web,
it is not desirable to have them included in a scale estimate of the
Web for two reasons. First, I am not including ftp or wais or any
other 'large' structures available over the web, so including AFS or
Techinfo would be a similar misrepresentation. The second concern is
duration of the wandering. Wandering the entire Techinfo tree would be
a project in itself. Further complicating the search, a number of WWW
servers provide complete access to local directory trees, which are
larger and not representative of the number of documents available on
the Web. In an attempt to avoid these medium scale structures, I
incorporated a 'boredom' factor into the Wanderer that would make an
attempt to recognize directory tree like structures and skip parts of
them, to expidite the search and to avoid misrepresenting the number
of documents on the Web.


How BIG is the Web?


Finally, the Wanderer was complete enough and intelligent enough to
wander the Web and produce a reasonable estimate of the size of the
Web. In total, when run in June of 1993, the Wanderer found more than
100 HTTP sites and a total of well over two hundred thousand
documents. This of course includes all of the reduction mentioned
above, indicating that the actual size of the Web is far greater.
Additionally new sites are being added to the Web every week with many
thousands of new documents becoming available every month.


How fast is it growing?


The Web is growing fast. Further runs of the wanderer found out how
fast.  (Additionally I changed it do a breadth first search search) I
reran the wanderer in September 1993, and it found just over 200
sites. I have run it again in November 1993 and it has found over 270
sites. (available soon) This represents a growth rate of one new web
site every day since June 1993. Lately, it looks like it may be
growing even faster than that.


I have recently made some refinements of the wanderer. The most recent
run, for December, 1993 has colleced 623 sites. Wow. I'm impressed.


Really though, it's even bigger than that


Additionally, the Wanderer could not accurately measure the volume of
information available via the many searchable indeces available on the
Web. When one then considers all the services that have been gatewayed
to the Web, the ability to access the massive information resources in
gopher, ftp and wais, and that the Web is only three years old, the
incredible size and growth rate of the Web becomes apparent.


Sites not on my list


If your site is not on my list, feel free to send me mail, but I can
make no guarantee it will be added promptly. If your site is new, I
may not have run the wanderer since your site came up, so please be
patient for the wanderer to find you. If your site has been up for
many months, feel free to send me mail.


Source to the Wanderer


I'm not giving it out because it's a little messy and I'm embarrassed
to distribute such sloppy code. Plus, it probably has some stupid bugs
in it that I'd be mortified to let the world see.


Table of Wanderer Results


Date            Number of sites
Jun  5, 1993            130
Sep  ?, 1993            204
Oct 15, 1993            212
Oct 25, 1993            228
Nov 18, 1993            272 (available soon)
Dec 13, 1993            623


The Future of the Wanderer


Once I get a little bit of free time (hopefully January) I will be
doing substantially more hacking with the Wanderer. Watch this space.
Suggestions are welcome. I only have so much time to work on this,
unfortunately. If you want to give me a job doing this, let me know.
:-)




Matthew Gray, mkgray () mit edu


Current thread: