Politech mailing list archives

FC: Hexamail's Finn Johansen on how to filter naughty words


From: Declan McCullagh <declan () well com>
Date: Fri, 13 Jun 2003 02:12:40 -0400

Previous Politech message:
http://www.politechbot.com/p-04831.html

---

From: "Finn Johansen" <finnj () hexamail com>
To: <declan () well com>
References: <1055311145.2fe7328.finna.net@[216.110.36.217]>
Subject: Re: Interscan blocks musician's email due to use of "whore"
Date: Thu, 12 Jun 2003 11:39:01 +0200

Declan,

I usually don't write this type of emails as it may be considered spam by
the readers. However, the problem described is very interesting and shows
the lack of intelligence in various spam filtering solutions.

Blocking emails on the basis of single terms in the email context is rather
pointless. It may sound amusing in the situation below, but it is certainly
not amusing to Linda or her contacts. It is, as Thomas also says, a bit
scary. To leave critical business correspondance to this type of context
evaluation is a bit like gambling. If you're lucky, the information may pass
through to the recipient, or it may as well just "disappear" somewhere
without anyone knowing where it is.

New spam filtering solutions is emerging almost every day. But just a
minority of these are able to use a contextual approach in evaluating the
emails. Even though reports shows that the global ratio of spam has reached
the 50% mark in May 2003, there is still millions of legitimate emails
passing among servers every day. Having to rely on solutions analyzing
emails by single terms will certainly block a large amount of these
legitimate emails and leave behind frustrated people like Linda - not
getting their business information delivered correctly.

The only way to overcome the limitation of keyword investigation of emails
is to contextually analyze the content of the email. Words like f*ck has a
pattern that is understanding to humans, but not to keyword searches, unless
explicitly told so. Given the context of this pattern, statistical pattern
matching technology is able to 'understand' this as either good or bad given
the patterns surrounding it. Using this technique, new patterns from
spammers can be catched as they are usually found together with other
patterns that are already known by the system. The statistical approach will
not catch 100% of spam emails without having to leave behind some false
positives. However, our test shows that by accepting a block ratio of 96%,
you end up with 0.01% false positives. Pretty good figures. And best of
all - it doesn't block emails like this one containing single 'bad' terms
scattered around the document.

More readings about the method used by us can be found in Gary Robinson's
execellent article on spam filtering:
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html


Regards,

Finn Johansen
CEO
Hexamail Ltd.

Email: finnj () hexamail com
http://www.hexamail.com/




-------------------------------------------------------------------------
POLITECH -- Declan McCullagh's politics and technology mailing list
You may redistribute this message freely if you include this notice.
-------------------------------------------------------------------------
To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
This message is archived at http://www.politechbot.com/
Declan McCullagh's photographs are at http://www.mccullagh.org/
Like Politech? Make a donation here: http://www.politechbot.com/donate/
-------------------------------------------------------------------------


Current thread: