Interesting People mailing list archives

American language standardized dictionary for text compression


From: David Farber <farber () central cis upenn edu>
Date: Fri, 25 Feb 1994 17:14:16 -0500

From: Sig () seuss vantage gte com
Subject: American language standardized dictionary for text compression
Sender: pgut1 () cs aukuni ac nz (Peter Gutmann)
Organization: GTE
Date: Fri, 25 Feb 1994 03:18:24 GMT








As an aid to those involved in natural language parsing, dictionary compression,
or textual encryption, I have been collecting and compiling a lengthy list of
words.  It is expected that a comprehensive standardized dictionary will
eventually result.  This dictionary should contain most common American words,
abbreviations, hyphenations, and even incorrect spellings.


An anonymous ftp server has been built on wocket.vantage.gte.com which contains
the following files in the pub/standard_dictionary directory:


                    words         bytes


-r--r--r--                       8552448 Jan 28 12:00 dic-0194.tar
-r--r--r--                       4058075 Jan 28 12:02 dic-0194.tar.Z
-r--r--r--                       8880128 Feb 24 10:39 dic-0294.tar
-r--r--r--                       4220442 Feb 24 10:41 dic-0294.tar.Z
-r--r--r--                       1269760 Aug 16  1993 dic-0893.tar
-r--r--r--                        523393 Aug 16  1993 dic-0893.tar.Z
-r--r--r--                        421239 Aug 16  1993 dic-0893.zip
-r--r--r--                       3186688 Sep 17 08:26 dic-0993.tar
-r--r--r--                       1503561 Sep 17 09:27 dic-0993.tar.Z
-r--r--r--                       7479296 Oct 26 17:29 dic-1093.tar
-r--r--r--                       3516519 Oct 26 17:32 dic-1093.tar.Z
-r--r--r--                       8273920 Dec 17 11:58 dic-1293.tar
-r--r--r--                       3918385 Dec 17 11:59 dic-1293.tar.Z


-r--r--r--            1022          4088 Feb 24 10:37 length02.txt
-r--r--r--           21225        106125 Feb 24 10:37 length03.txt
-r--r--r--           52657        315940 Feb 24 10:37 length04.txt
-r--r--r--           83336        583349 Feb 24 10:37 length05.txt
-r--r--r--          113449        907655 Feb 24 10:37 length06.txt
-r--r--r--          123546       1111907 Feb 24 10:37 length07.txt
-r--r--r--          134549       1345480 Feb 24 10:37 length08.txt
-r--r--r--           94474       1039205 Feb 24 10:37 length09.txt
-r--r--r--           73793        885502 Feb 24 10:37 length10.txt
-r--r--r--           55147        716900 Feb 24 10:37 length11.txt
-r--r--r--           39799        557185 Feb 24 10:37 length12.txt
-r--r--r--           26870        403037 Feb 24 10:37 length13.txt
-r--r--r--           17801        284816 Feb 24 10:37 length14.txt
-r--r--r--           11525        195925 Feb 24 10:37 length15.txt
-r--r--r--            7228        130104 Feb 24 10:37 length16.txt
-r--r--r--            4559         86621 Feb 24 10:37 length17.txt
-r--r--r--            2894         57880 Feb 24 10:37 length18.txt
-r--r--r--            1871         39291 Feb 24 10:37 length19.txt
-r--r--r--            1196         26312 Feb 24 10:37 length20.txt
-r--r--r--             784         18032 Feb 24 10:37 length21.txt
-r--r--r--             562         13488 Feb 24 10:37 length22.txt
-r--r--r--             363          9075 Feb 24 10:37 length23.txt
-r--r--r--             240          6240 Feb 24 10:37 length24.txt
-r--r--r--             160          4320 Feb 24 10:37 length25.txt
-r--r--r--             106          2968 Feb 24 10:37 length26.txt
-r--r--r--              70          2030 Feb 24 10:37 length27.txt
-r--r--r--               1            30 Feb 24 10:37 length28.txt
-r--r--r--               0             0 Feb 24 10:37 length29.txt
-r--r--r--               0             0 Feb 24 10:37 length30.txt
-r--r--r--               0             0 Feb 24 10:37 length31.txt
-r--r--r--               1            34 Feb 24 10:37 length32.txt


                    869228       8853539 total


-r--r--r--                         11521 Aug 13  1993 tarread.com


The most recent compilation being dic-0294.tar is composed of the 31 text files
and may be restored on an MS-DOS computer using the tarread.com utility program.


Any words for inclusion in future dictionaries should be submitted to my E-Mail
address directly or placed in the /pub/incoming directory.  Please compare your
dictionaries with standard Unix 'words' and submit only the differences.  Many
thanks to those that have submitted the 32,000 words during the last month.


Take care.


         - Sig


Sigurd P. Crossland
Advanced Technology Lab                   Telephone: (703) 818-8504
GTE                                       Facsimile: (703) 802-3110
15000 Conference Center Drive             Internet: sig () seuss vantage gte com
Chantilly, VA   22021                     Home: (703) 818-8942


Current thread: