Educause Security Discussion mailing list archives

Re: Has anyone looked at digital archiving?


From: stanislav shalunov <shalunov () INTERNET2 EDU>
Date: Wed, 12 Apr 2006 18:14:00 -0400

Jim,

Just some random notes first:

* Digital signatures with today's security levels will likely be quite
  useless in 40 years.

* Error-correcting codes are a storage space optimization technique
  (you can always just have multiple copies), and storage space
  typically is not a problem, so they have little or nothing to
  contribute here.

* No electronic storage medium is known to physically retain the bits
  for 40 years.  CD-Rs last for 2--5 years after being burned.  DVD-Rs
  are shorter-lived.  Tapes vary widely and depend on the tape type.
  Regular hard disks might last about 10 years on the shelf.

* The only format that exists today that has a claim on having been
  around for 40 years is plain text.  Even there, the changes have
  been no less significant proportionally to the complexity of the
  format than in other formats; it's just that the format is so
  trivial (just a sequence of letters, each represented with a
  fixed-sized block of bits, with special values for space and newline
  and so forth) that recoding is just as trivial (character set
  changes, byte size changes, newline and other special character
  encoding, etc.).

My best solution being able to read a Word document in 40 years would
be to print it out in a few copies on a black-and-white laser printer
making sure the paper is alkaline (you can test that with a $5
device---I use something called Abbey pH pen, which is just a marker
with a solution of chlorophenol red instead of a dye) and the fusing
is done at high enough temperature and the finish is compatible
(that's easy to check with a regular eraser: if adhesion is poor,
you'll be able to make the characters fainter or even erase parts of
them; with good adhesion, an eraser will have no effect until it
starts ripping paper).  Then store it in your library and in another
library.  A simple and cost-effective way to store a collection of
documents in another good library (but only assuming the documents are
not meant to be highly proprietary) is to register your copyright and
submit a bound copy of the collection to the Library of Congress (the
fee for indefinite storage is about $30---a real bargain).

If for some strange and irrational reason it is desirable to keep
archival documents exclusively electronic, then one would have to keep
multiple copies, keep them in constant rotation, updating media and
formats and never letting anything sit for more than a few years.
Keeping all the versions (e.g., Word 6, Word 7, Word 8, ... Word 55,
etc.) is probably advisable.  Using the simplest possible formats is
also good.  Plain text beats anything else for simplicity.  The worst
would be proprietary, frequently and incompatibly changing formats
such as DOC or PDF.  Things like DVI, HTML, and XML would be in the
middle.  I would not trust PDF/A at all (too young, far too complex,
and not at all implemented).  For integrity checking, one might
compute one-way cryptographically secure hashes using the strongest
technology today (so, SHA512 for now and probably something else,
perhaps substantially algorithmically different, 20 years from now)
and keep files of these hashes around for each directory.  I'd really,
really want to print all these out, but conceptually, one might
compute a SHA512 on those files and keep higher level files around and
so forth and only keep secure copies (written down, printed out,
memorized, whatever) of a few topmost levels of the hierarchy.

--
Stanislav Shalunov              http://www.internet2.edu/~shalunov/

Just my 0.086g of Ag.

Current thread: