Dailydave mailing list archives

Unmask vs Internet Superheroes

From: Dave Aitel <dave () immunityinc com>
Date: Fri, 17 Aug 2007 13:46:48 -0400
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I quite liked this little e-zine. They even have a whole feature on
Unmask, my very first Python program ever! Rightfully, he complains
about code quality. But then he goes on to talk a lot of about how it
works, which is quite useful! Based on his comments, you could easily
improve Unmask to avoid small words or conjunctions. Worth a try someday.

http://milw0rm.com/papers/175



sub Unmaskunmask { ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; my $self =
shift; say($self, <<'EOUNMASK'

unmask.py blows.

The code is shit. Considering that it is only one of seven sources listed
on http://www.immunitysec.com/resources-freesoftware.shtml, and that the
others are mostly Python too, we can reasonably conclude that Dave Aitel
codes like shit. Maybe if you pay for non-free software from Dave Aitel
you will get something better.

"By releasing tools, such as these, we hope to demonstrate our knowledge
leadership, and give back to the security community as a whole."

It is horribly incomplete and doesn't do all the things it says it does.
You are better off entirely ignoring the comments Aitel wrote, because
they are LIES. How could he give this a 1.0 version number? It isn't even
0.1 unless you're living in the 90s.

Here's how it actually gets a score from two stores:

- ---
Take the 100 most common words from both and compare. Add the amount that
match. So, that's a possible 100 points.

Take the 100 most common (continuous or non-continuous) doubles in
sentences (where the phrase "I like dogs. I am." would form doubles of "I
like", "I dogs", "like dogs" and "I am"). Add the amount of those that
match.

Do the same for triples.
- ---

So you have a highest possible score of 300. Don't be fooled into thinking
scores are a percentage.

It entirely ignores punctuation, sentence length, and other things he said
were used.

Now, take me for example. I try to write short sentences using simple
words. So my singles list is packed with words that are less than five
chars long. These, from two texts, match at a ridiculous rate. Look at
this paragraph itself, words including "take me for I try to write words
so my list is with that are less than from two at a", most of those 20+
words will be in my most common word lists in both texts.

Then take the doubles and triples. It would make a bit of sense if it was
only continuous words, but it isn't. So in anything except for very short
sentences, all the doubles list suggests is, again, your most common
words. My doubles list would be full of combinations of the above example
short words.

It's basically taking how similar your basic vocabulary is and multiplying
that by three.

- ----


Here are some ideas to cause unmask fun:

In one text, write always with "I", and in the other always write with
"we". Use very short sentences. That will leave your singles almost the
same but will destroy your doubles and triples values, so you should be
able to drop your score 10 points in a bad situation and much much more if
your sentences are short enough.

Misspell common words and they will be cut out of the results, lowering
your scores in general.

See, Aitel was a dumbass and decided to not match words that aren't words.
I can understand how he wants to remove non-text data, but even if he just
held onto text that matched something like /\b[a-z]{5,}\b/ he could catch
a lot of words that are spelt wrong and would increase the accuracy of his
script. Unfortunately, regex is probably a bit too much for Aitel in
general.

Ever wonder why it takes so long to build such massive stores? Ask
yourself, did Dave Aitel think to just store the most common 100 keys of
his lists (the data he actually uses), or did he decide to store
everything?

Ask yourself, if I add one word to a sentence, although that will create
just one more single item, how many more doubles and triples will it
create? Could I possibly make unmask.py hang for half an hour and take up
50mb with just a short essay that lacks punctuation?

- ----

It is interesting that the script works pretty well. The reason is does so
is because vocabulary does mean a lot. By compounding it he expands the
differences between different people. The more you match, the more your
score will increase by in a non-linear way. So even though X writes good
english fairly similar to mine, the small differences may account for
25-40 points. On the other hand Y writes with entirely different english
that I do, and might sit just 50-60 points back of me. For example with a
certain small text of mine as a baseline, another writing by myself got
about 85, one by X got 62, and one by Y got about 45. So even though X
writes much more like I do than like Y does, his score is closer to that
of Y in comparison to me.

Something to note is that there are some really obvious words, like "a",
"the", "I", etc, that everybody will use, and thus common doubles (not so
much triples, but still some), so any two people comparing each other
should get something like a 20 score basically by default. So subtracting
that from the above, it's like I had 65, X had 42, and Y had 23. The small
differences between X and I still lead to such a big variation because of
the non-linear function that makes comparisons: I don't just have more of
the same words, I thus have more of the same doubles and triples. The fact
that X got that close to me (64% of my score), even after subtracting an
arbitrary default value of obvious words, is a testiment to just how
comparable the rest of our basic vocabulary is.

People with english as a second language, even if writing technically
proper english, may rely on specific words a lot and completely avoid
other obvious ones (like 'got' for 'have', or so). So they could match
themselves extremely well and match others very poorly.

So here are some good and accurate excuses if some random guy matches you:

- - unmask.py is crap
- - Two people with a strong command of english who use a lot of
  conjunctions and common verbs with ease could have a very strong
  correlation.
- - People that have a very limited vocabulary will have their top100 lists
  padded by less common words, and will match less in general.
- - By knowing and using a mass of small words at the expense of long words,
  you increase your match potential in general.
- - Most of the long/odd words used won't make it into the top 100 lists
  that are used for comparison.
- - Vocabulary testing is a good idea for people of different nationalities
  and education levels, but for people of the same ones, it's very cheap.
- - This is entirely vocabulary, and doesn't even test that, just compares
  the most popular words.

EOUNMASK
);}
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGxd8GB8JNm+PA+iURAnNtAJ9wQHq5DkPb1DrX7PsiuZKMuuKZrgCggUTq
J/+EvgdvAZtWORY0cTJX824=
=3Y9f
-----END PGP SIGNATURE-----

_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
http://lists.immunitysec.com/mailman/listinfo/dailydave
Current thread:

Unmask vs Internet Superheroes Dave Aitel (Aug 17)