Interesting People mailing list archives

IP: Frequency of top 1,000 USENET words


From: Dave Farber <farber () cis upenn edu>
Date: Sun, 27 Dec 1998 03:23:21 -0500



From: Mike Radow <mradow () inx inx net>
-
It is hoped that this will be useful to others...

In building "word-to-token" compressed files of technical text, we've had
good experience with this file. 

We've used this for several years and the distribution is a good fit for
the distribution of our text. 

Unlike other "general text" frequencies, this list was generated from
USENET traffic.

My sincere thanks to Lee Maixner, for locating this URL...:

 Linkname: top1000.use
        
 URL:

  http://wiretap.spies.com/Gopher/Library/Article/Language/top1000.use

/\/\...snipped...
Date: Tue, 19 Jan 1993 20:43:44 GMT
Subject: Re: Top 1000 English words ...

Top 1000 English words

Culled from one year of USENET traffic, here is my list of the top 1000
words, along with percentage of occurence:  (this is from a database of
343945617 total scanned words).

--
Rick Walker

4.01838 the
2.43805 to
2.05957 of
1.95582 a
1.70176 I
1.68549 and
1.32531 is
1.23345 in
1.14749 that
0.811128 it

..

0.0109892 science
0.0109852 interface
0.010977 Americans
0.0109578 action
0.0109552 entire
0.0109494 below
0.0109288 Has
\/\/

Mike
-
Mike Radow  <--->  mradow () inx net



Current thread: