Educause Security Discussion mailing list archives

Re: Password entropy

From: Valdis Kletnieks <Valdis.Kletnieks () VT EDU>
Date: Sun, 23 Jul 2006 16:52:06 -0400

On Fri, 21 Jul 2006 08:26:59 CDT, Graham Toal said:
I'm not real clear on the "entropy" concept but it has
something to do with the pattern?

I'm not sure it's the right word in this context, but I believe
this is what they're talking about:

Actually, it *is* the right word, and you're basically correct but managed to
avoid saying *why* you're correct...

if you have an 8 character password and the characters are
chosen randomly, and each character is only lower case alphabetic,
then the number of possible passwords available is 26^8

Now remember, this is 8 *randomly chosen* characters.  Well-chosen
random characters have *high* entropy, which is a measure of how
unpredictable the next one is (mathematically, it's actually very
similar to measuring the entropy of (for instance) the molecules
in a gas - in both cases, the entropy measures the amount of "disorder").

But what is worse is that there is a pattern involved: to make
it easier to remember, you use a grammatically correct phrase,
such as "subject verb object".  Lets say our vocabulary has
9000 nouns and 1000 verbs, then our password space is only

What chews up more entropy is the patterns *inside* each word.  In English, the
"next letter" is usually easy to predict (and therefor has little *effective*
randomness or entropy).  If you're looking at a 'q', the next letter is
*almost* guaranteed to be a 'u', so there's very little "uncertainty" there, so
that 'u' has a very low entropy.  If the two letters you're looking at are
'io', the next one is probably an 'n', less likely to be a 'u' or 'l', and
highly unlikely to be a 'z' (looking at some 490K words here):

[/usr/share/dict]2 grep 'io' linux.words | sed 's/.*\(io.\).*/\1/' | sort | uniq -c | sort -nr | head -15
  15681 ion
   2279 iou
   1124 iol
    711 iot
    693 ios
    571 iop
    450 ior
    438 iom
    406 ioc
    382 iog
    360 iod
    183 ioi
    102 io-
     92 iob
     74 ioe

The *actual* chances are even more biased, since here I treated all words
equally. The 92 words that have 'iob' include things like thiobacillus,
plesiobiosis, and dithiobenzoic. (Of the 92, 17 also contain 'blast',
indicating a medical term like 'angioblast')...

(Incidentally, it goes even further - if that next letter is an 'n' as
expected, guess what the letter before the 'i' almost always is?

[/usr/share/dict]2 grep 'ion' linux.words | sed 's/.*\(.ion\).*/\1/' | sort | uniq -c | sort -nr | head
  11889 tion
   1715 sion
    451 lion
    304 nion
    262 hion
    239 rion
    122 pion
    104 cion
    101 gion
     97 xion

Yep, a 't'.  By the time you've seen a 'tio', you may as well reserve just one
bit to store the next letter, because it's almost always going to be an 'n'
(some 11K times), so 95% of the time, you can store 'yes, it's the expected N'
in one bit, and the other 5% store a 'no it wasn't" as one bit, follow the 'no'
with a 5-bit code indicating what it actually was, and *still* save space,
as you'll average about 1.25 bits.  This is why English text compresses so
well (and in fact, the entropy of data is *directly* related to the maximum
possible compression of the data).

A bit of thought will reveal a lot of other 2 and 3 character combinations
that are a lot more common ('ing', etc...).  The end result is that running
English text averages about 2.5 to 3 bits of entropy per character, and
even skript kiddie 'l33t sp33k' and that obfuscated spam stuff is probably
still under 4 bits/character (I'll go out on a limb and hypothesis that
if it's trying to pass itself off as English, and has over 3.5 bits/char
of entropy, it's been too obfuscated to be easily readable....)

By the way, this is why pass phrases have to be quite long to have
equivalent strength to a password.

If we had keyboards and brains and systems that accepted Chinese characters
that represent words as single characters, an 8-word passphrase would be
as long and nearly as strong as an 8-character random password.  The reason
the passphrase has to be longer is because you get much less randomness
and entropy *per character* in a Latin-charset passphrase...

And actually, the high redundancy (the inverse of entropy) of most human
languages is a Good Thing - it's what our brains use to figure out what
was really meant when we hit the the inevitable typo, or can't hear somebody
very well in a bar or other noisy environment.  There's even at least one
example in this paragraph that you probably didn't even notice (two, if
I made an intentional typo ;)

(Now, the entropy of a random number source is something quite
different, and I think in that case entropy is the right word
to use.

It's correct in this context as well - and in fact, a good theoretical
way to look at passphrases is as the result of a "not very" random
source, and what you want to compute is how much data you have to gather
before you have gathered a given level of total randomness.

Attachment: _bin

Current thread: