nanog mailing list archives

Re: Had an idea - looking for a math buff to tell me if it's possible with today's technology.


From: Valdis.Kletnieks () vt edu
Date: Wed, 18 May 2011 19:42:43 -0400

On Thu, 19 May 2011 00:26:26 BST, Heath Jones said:
I wonder if this is possible:

- Take a hash of the original file. Keep a counter.
- Generate data in some sequential method on sender side (for example simply
starting at 0 and iterating until you generate the same as the original
data)
- Each time you iterate, take the hash of the generated data. If it matches
the hash of the original file, increment counter.
- Send the hash and the counter value to recipient.
- Recipient performs same sequential generation method, stopping when
counter reached.

MD5 is a 128 bit hash.

2^128 is 340,282,366,920,938,463,463,374,607,431,768,211,456 - you're welcome
to iterate that many times to find a duplicate. You may get lucky and get a hit
in the first trillion or so attempts - but you may get unlucky and not get a
hit until the *last* few trillion attempts. On average you'll have to iterate
about half that huge number before you get a hit.

And it's lossy - if you hash all the possible 4K blocks with MD5, you'll find
that each of those 2^128 hashes has been hit about 256 times - and no
indication in the hash of *which* of the 256 colliding 4K blocks you have on
this iteration.  (The only reason that companies can do block-level
de-duplication by saving a hash as an index to one copy shared by all blocks
with the same hash value is because you have a *very small* fraction of the
possibilities covered, so if you saved a 4K block of data from somebody's
system32 folder under a given MD5 hash, it's *far* more likely that another
block with that same hash is from another copy of another identical system32
folder, than it is an actual accidental collision.)

Protip:  A good hash function is by definition one-way - given the data, it's
easy to generate the hash - but reversing it to find the "pre-image" (the data
that *generated* the hash) is massively difficult.

Attachment: _bin
Description:


Current thread: