Educause Security Discussion mailing list archives

Re: Password entropy


From: Valdis Kletnieks <Valdis.Kletnieks () VT EDU>
Date: Sun, 23 Jul 2006 16:52:06 -0400

On Fri, 21 Jul 2006 08:26:59 CDT, Graham Toal said:
I'm not real clear on the "entropy" concept but it has
something to do with the pattern?

I'm not sure it's the right word in this context, but I believe
this is what they're talking about:

Actually, it *is* the right word, and you're basically correct but managed to
avoid saying *why* you're correct...

if you have an 8 character password and the characters are
chosen randomly, and each character is only lower case alphabetic,
then the number of possible passwords available is 26^8

Now remember, this is 8 *randomly chosen* characters.  Well-chosen
random characters have *high* entropy, which is a measure of how
unpredictable the next one is (mathematically, it's actually very
similar to measuring the entropy of (for instance) the molecules
in a gas - in both cases, the entropy measures the amount of "disorder").

But what is worse is that there is a pattern involved: to make
it easier to remember, you use a grammatically correct phrase,
such as "subject verb object".  Lets say our vocabulary has
9000 nouns and 1000 verbs, then our password space is only
9000*1000*9000.

What chews up more entropy is the patterns *inside* each word.  In English, the
"next letter" is usually easy to predict (and therefor has little *effective*
randomness or entropy).  If you're looking at a 'q', the next letter is
*almost* guaranteed to be a 'u', so there's very little "uncertainty" there, so
that 'u' has a very low entropy.  If the two letters you're looking at are
'io', the next one is probably an 'n', less likely to be a 'u' or 'l', and
highly unlikely to be a 'z' (looking at some 490K words here):

[/usr/share/dict]2 grep 'io' linux.words | sed 's/.*\(io.\).*/\1/' | sort | uniq -c | sort -nr | head -15
  15681 ion
   2279 iou
   1124 iol
    711 iot
    693 ios
    571 iop
    450 ior
    438 iom
    406 ioc
    382 iog
    360 iod
    183 ioi
    102 io-
     92 iob
     74 ioe

The *actual* chances are even more biased, since here I treated all words
equally. The 92 words that have 'iob' include things like thiobacillus,
plesiobiosis, and dithiobenzoic. (Of the 92, 17 also contain 'blast',
indicating a medical term like 'angioblast')...

(Incidentally, it goes even further - if that next letter is an 'n' as
expected, guess what the letter before the 'i' almost always is?

[/usr/share/dict]2 grep 'ion' linux.words | sed 's/.*\(.ion\).*/\1/' | sort | uniq -c | sort -nr | head
  11889 tion
   1715 sion
    451 lion
    304 nion
    262 hion
    239 rion
    122 pion
    104 cion
    101 gion
     97 xion

Yep, a 't'.  By the time you've seen a 'tio', you may as well reserve just one
bit to store the next letter, because it's almost always going to be an 'n'
(some 11K times), so 95% of the time, you can store 'yes, it's the expected N'
in one bit, and the other 5% store a 'no it wasn't" as one bit, follow the 'no'
with a 5-bit code indicating what it actually was, and *still* save space,
as you'll average about 1.25 bits.  This is why English text compresses so
well (and in fact, the entropy of data is *directly* related to the maximum
possible compression of the data).

A bit of thought will reveal a lot of other 2 and 3 character combinations
that are a lot more common ('ing', etc...).  The end result is that running
English text averages about 2.5 to 3 bits of entropy per character, and
even skript kiddie 'l33t sp33k' and that obfuscated spam stuff is probably
still under 4 bits/character (I'll go out on a limb and hypothesis that
if it's trying to pass itself off as English, and has over 3.5 bits/char
of entropy, it's been too obfuscated to be easily readable....)

By the way, this is why pass phrases have to be quite long to have
equivalent strength to a password.

If we had keyboards and brains and systems that accepted Chinese characters
that represent words as single characters, an 8-word passphrase would be
as long and nearly as strong as an 8-character random password.  The reason
the passphrase has to be longer is because you get much less randomness
and entropy *per character* in a Latin-charset passphrase...

And actually, the high redundancy (the inverse of entropy) of most human
languages is a Good Thing - it's what our brains use to figure out what
was really meant when we hit the the inevitable typo, or can't hear somebody
very well in a bar or other noisy environment.  There's even at least one
example in this paragraph that you probably didn't even notice (two, if
I made an intentional typo ;)

(Now, the entropy of a random number source is something quite
different, and I think in that case entropy is the right word
to use.

It's correct in this context as well - and in fact, a good theoretical
way to look at passphrases is as the result of a "not very" random
source, and what you want to compute is how much data you have to gather
before you have gathered a given level of total randomness.

Attachment: _bin
Description:


Current thread: