Educause Security Discussion mailing list archives

Re: Has anyone looked at digital archiving?


From: Brad Judy <Brad.Judy () COLORADO EDU>
Date: Thu, 13 Apr 2006 12:52:53 -0600

It's funny that you mention the Rosetta Stone since the Rosetta Project
(http://www.rosettaproject.org/) was faced with the challenge of a
near-permanent (i.e. thousands of years) archive of all of the languages
of the world.  They selected micro-engraving in a metal disk since
optical magnification will likely be a technology that any society would
have or create.  It would be interesting to hear what they are doing
with their electronic language database.  Of course, I doubt any of us
have that kind of retention goal.  :)

Brad Judy 

-----Original Message-----
From: Stewart, Ian [mailto:istewart () UMASSP EDU] 
Sent: Thursday, April 13, 2006 8:06 AM
To: SECURITY () LISTSERV EDUCAUSE EDU
Subject: Re: [SECURITY] Has anyone looked at digital archiving?

Don't forget the Rosetta Stone and film.

-----Original Message-----
From: stanislav shalunov [mailto:shalunov () INTERNET2 EDU]
Sent: Wednesday, April 12, 2006 6:14 PM
To: SECURITY () LISTSERV EDUCAUSE EDU
Subject: Re: [SECURITY] Has anyone looked at digital archiving?

Jim,

Just some random notes first:

* Digital signatures with today's security levels will likely be quite
  useless in 40 years.

* Error-correcting codes are a storage space optimization technique
  (you can always just have multiple copies), and storage space
  typically is not a problem, so they have little or nothing to
  contribute here.

* No electronic storage medium is known to physically retain the bits
  for 40 years.  CD-Rs last for 2--5 years after being burned.  DVD-Rs
  are shorter-lived.  Tapes vary widely and depend on the tape type.
  Regular hard disks might last about 10 years on the shelf.

* The only format that exists today that has a claim on having been
  around for 40 years is plain text.  Even there, the changes have
  been no less significant proportionally to the complexity of the
  format than in other formats; it's just that the format is so
  trivial (just a sequence of letters, each represented with a
  fixed-sized block of bits, with special values for space and newline
  and so forth) that recoding is just as trivial (character set
  changes, byte size changes, newline and other special character
  encoding, etc.).

My best solution being able to read a Word document in 40 
years would be to print it out in a few copies on a 
black-and-white laser printer making sure the paper is 
alkaline (you can test that with a $5 device---I use 
something called Abbey pH pen, which is just a marker with a 
solution of chlorophenol red instead of a dye) and the fusing 
is done at high enough temperature and the finish is 
compatible (that's easy to check with a regular eraser: if 
adhesion is poor, you'll be able to make the characters 
fainter or even erase parts of them; with good adhesion, an 
eraser will have no effect until it starts ripping paper).  
Then store it in your library and in another library.  A 
simple and cost-effective way to store a collection of 
documents in another good library (but only assuming the 
documents are not meant to be highly proprietary) is to 
register your copyright and submit a bound copy of the 
collection to the Library of Congress (the fee for indefinite 
storage is about $30---a real bargain).

If for some strange and irrational reason it is desirable to 
keep archival documents exclusively electronic, then one 
would have to keep multiple copies, keep them in constant 
rotation, updating media and formats and never letting 
anything sit for more than a few years.
Keeping all the versions (e.g., Word 6, Word 7, Word 8, ... Word 55,
etc.) is probably advisable.  Using the simplest possible 
formats is also good.  Plain text beats anything else for 
simplicity.  The worst would be proprietary, frequently and 
incompatibly changing formats such as DOC or PDF.  Things 
like DVI, HTML, and XML would be in the middle.  I would not 
trust PDF/A at all (too young, far too complex, and not at 
all implemented).  For integrity checking, one might compute 
one-way cryptographically secure hashes using the strongest 
technology today (so, SHA512 for now and probably something 
else, perhaps substantially algorithmically different, 20 
years from now) and keep files of these hashes around for 
each directory.  I'd really, really want to print all these 
out, but conceptually, one might compute a SHA512 on those 
files and keep higher level files around and so forth and 
only keep secure copies (written down, printed out, 
memorized, whatever) of a few topmost levels of the hierarchy.

-- 
Stanislav Shalunov            http://www.internet2.edu/~shalunov/

Just my 0.086g of Ag.


Current thread: