nanog mailing list archives

Re: Had an idea - looking for a math buff to tell me if it's possible


From: Heath Jones <hj1980 () gmail com>
Date: Thu, 19 May 2011 01:01:43 +0100

My point here is it IS possible to transfer just a hash and counter value
and effectively generate identical data at the remote end.
The limit that will be hit is the difficulty of generating and comparing
hash values with current processing power.

I'm proposing iterating through generated data up until the actual data.
It's not even a storage issue, as once you have incremented the data you
don't need to store old data or hash values - just the counter. No massive
hash tables.

It's a CPU issue.

Heath

On 19 May 2011 00:42, <Valdis.Kletnieks () vt edu> wrote:

On Thu, 19 May 2011 00:26:26 BST, Heath Jones said:
I wonder if this is possible:

- Take a hash of the original file. Keep a counter.
- Generate data in some sequential method on sender side (for example
simply
starting at 0 and iterating until you generate the same as the original
data)
- Each time you iterate, take the hash of the generated data. If it
matches
the hash of the original file, increment counter.
- Send the hash and the counter value to recipient.
- Recipient performs same sequential generation method, stopping when
counter reached.

MD5 is a 128 bit hash.

2^128 is 340,282,366,920,938,463,463,374,607,431,768,211,456 - you're
welcome
to iterate that many times to find a duplicate. You may get lucky and get a
hit
in the first trillion or so attempts - but you may get unlucky and not get
a
hit until the *last* few trillion attempts. On average you'll have to
iterate
about half that huge number before you get a hit.

And it's lossy - if you hash all the possible 4K blocks with MD5, you'll
find
that each of those 2^128 hashes has been hit about 256 times - and no
indication in the hash of *which* of the 256 colliding 4K blocks you have
on
this iteration.  (The only reason that companies can do block-level
de-duplication by saving a hash as an index to one copy shared by all
blocks
with the same hash value is because you have a *very small* fraction of the
possibilities covered, so if you saved a 4K block of data from somebody's
system32 folder under a given MD5 hash, it's *far* more likely that another
block with that same hash is from another copy of another identical
system32
folder, than it is an actual accidental collision.)

Protip:  A good hash function is by definition one-way - given the data,
it's
easy to generate the hash - but reversing it to find the "pre-image" (the
data
that *generated* the hash) is massively difficult.




Current thread: