Interesting People mailing list archives
How big is the web
From: David Farber <farber () central cis upenn edu>
Date: Wed, 15 Dec 1993 14:17:03 -0500
The Uniform Resource Locator for this document is: http://www.mit.edu:8001/afs/sipb/user/mkgray/ht/web-growth.html Measuring the Size and Growth of the Web The World Wide Web Wanderer The Web is BIG The World Wide Web started to grow in 1990. In the three years since it was begun, the volume of information avialable has increased enormously. In addition to the ability to access preexisting network resources such as WAIS, FTP, and Gopher, the mass of material avaialable via HTTP has become a vast resource of its own. Though by simply wandering the web with one of the many browsers available, it is clear that the web is huge, it is very difficult to actually estimate the size of the web in this fashion. Due to the structure of the web, many documents are obscure and only reachable through a thin trail of documents with few other links to them. This leaves the question, what is a realistic, quantitative measure of the size of the World Wide Web? How does one find out how BIG? In an attempt to answer this question I wrote an automoton that would do a limited parsing of HTML and would attempt to do a depth-first search of the Web. The automoton became known as the World Wide Web Wanderer, or W4. The Wanderer, a script written in perl, wandered the Web for many hours, accumulating URL's and new site names. The Wanderer did not follow any links that were not HTTP sites, and had certain special cases to eliminate certain vast gatewayed structures such as Techinfo and AFS. Though these are certainly part of the Web, it is not desirable to have them included in a scale estimate of the Web for two reasons. First, I am not including ftp or wais or any other 'large' structures available over the web, so including AFS or Techinfo would be a similar misrepresentation. The second concern is duration of the wandering. Wandering the entire Techinfo tree would be a project in itself. Further complicating the search, a number of WWW servers provide complete access to local directory trees, which are larger and not representative of the number of documents available on the Web. In an attempt to avoid these medium scale structures, I incorporated a 'boredom' factor into the Wanderer that would make an attempt to recognize directory tree like structures and skip parts of them, to expidite the search and to avoid misrepresenting the number of documents on the Web. How BIG is the Web? Finally, the Wanderer was complete enough and intelligent enough to wander the Web and produce a reasonable estimate of the size of the Web. In total, when run in June of 1993, the Wanderer found more than 100 HTTP sites and a total of well over two hundred thousand documents. This of course includes all of the reduction mentioned above, indicating that the actual size of the Web is far greater. Additionally new sites are being added to the Web every week with many thousands of new documents becoming available every month. How fast is it growing? The Web is growing fast. Further runs of the wanderer found out how fast. (Additionally I changed it do a breadth first search search) I reran the wanderer in September 1993, and it found just over 200 sites. I have run it again in November 1993 and it has found over 270 sites. (available soon) This represents a growth rate of one new web site every day since June 1993. Lately, it looks like it may be growing even faster than that. I have recently made some refinements of the wanderer. The most recent run, for December, 1993 has colleced 623 sites. Wow. I'm impressed. Really though, it's even bigger than that Additionally, the Wanderer could not accurately measure the volume of information available via the many searchable indeces available on the Web. When one then considers all the services that have been gatewayed to the Web, the ability to access the massive information resources in gopher, ftp and wais, and that the Web is only three years old, the incredible size and growth rate of the Web becomes apparent. Sites not on my list If your site is not on my list, feel free to send me mail, but I can make no guarantee it will be added promptly. If your site is new, I may not have run the wanderer since your site came up, so please be patient for the wanderer to find you. If your site has been up for many months, feel free to send me mail. Source to the Wanderer I'm not giving it out because it's a little messy and I'm embarrassed to distribute such sloppy code. Plus, it probably has some stupid bugs in it that I'd be mortified to let the world see. Table of Wanderer Results Date Number of sites Jun 5, 1993 130 Sep ?, 1993 204 Oct 15, 1993 212 Oct 25, 1993 228 Nov 18, 1993 272 (available soon) Dec 13, 1993 623 The Future of the Wanderer Once I get a little bit of free time (hopefully January) I will be doing substantially more hacking with the Wanderer. Watch this space. Suggestions are welcome. I only have so much time to work on this, unfortunately. If you want to give me a job doing this, let me know. :-) Matthew Gray, mkgray () mit edu
Current thread:
- How big is the web David Farber (Dec 15)