Interesting People mailing list archives

Re: content filters


From: David Farber <dave () farber net>
Date: Mon, 1 Dec 2008 20:42:47 -0500



Begin forwarded message:

From: Steven Champeon <schampeo () hesketh com>
Date: December 1, 2008 6:00:05 PM EST
To: David Farber <dave () farber net>
Cc: gep2 () terabites com
Subject: content filters


Dave, for IP if you think appropriate.

on Sun, Nov 30, 2008 at 12:31:41PM -0500, Gordon Peterson wrote:
I however STRONGLY disagree with Rich's claim about content
inspection. as an antispam technique.

The trick, however, is to COMBINE content detection SPECIFIC TO THE SENDER.

There are several different sorts of "content detection" in the antispam
world:

1) keyword filters (often found in corporate gateways, to implement
  policies related to sexual harrassment or "the hostile workplace");
  these are often so poorly written that even confirmed opt-in list
  mail (like IP, or my longest running list, webdesign-l) falls victim
  to their inanities. I often simply unsub members who are behind these
  things, because it's a waste of my time to even try to educate them
  about the limitations of this approach.

2) signature-driven filters; these come in several types, including
  hash-driven products like Cloudmark SpamNet or DCC; these measure
  bulkiness (DCC) or spot content common to messages marked as spam by
  some process or another (SpamNet); these are equally prone to trust
  issues spread among those who are allowed to submit samples to the
  corpus(es), as well as to malfunctions in the submission process
  and badly vetted "trap" addresses, etc. I stopped using them in
  2003, because there were too many issues. This class also includes
  all antivirus engines, which operate primarily by signatures.

3) another type of signature-driven filter, which looks for constructs
  that increase the likelihood that the message is spam, and uses a
  scoring system to reject or quarantine such messages (e.g.,
  SpamAssassin) - these are useful and can catch stuff the others miss,
  but they must be properly tuned and kept up to date in order to work
  properly. I primarily use these when the signature can be isolated
  to the headers or the SMTP conversation itself.

4) Naive Bayesian filters, which use word frequencies to spot garbage
  but which must also be trained in the sense Gordon is talking about
  (different people have different mail flows, and what works for some
  may not work for others, or as your own mail flow changes). They're
  also trivial to game, by appending known "good" text such as news
  or literature to a minimal spam payload.

5) URIBLs, which look for domains in message bodies (or, to a lesser
  extent, in the message headers) and compare them to databases full
  of domains that are known to be registered to spammers or otherwise
  bad actors; these are useful against a wide variety of spam, and
  especially so as spammers collaborate with evil registrars (see
  Estdomains, which ICANN recently shut down, or directi, which took
  over where Estdomains left off) and purchase and burn through large
  quantities of throwaway domains.

6) another type of signature filter looks for common mailing addresses,
  phone numbers, and the like in order to take advantage of the fact
  that some spammers actually do include their addresses in order to
  "comply" with CAN-SPAM. IIRC, Catherine Hampton's Spambouncer does
  a bit of this, among others.

The biggest problem with content-driven filtering is that it can be
gamed (inserting a link to a spammy URL in an editorial or news item,
for example), worked around (always use new domains, updated templates)
and evaded (hashbusting, t|-|05e weird Subject liness).

In sum, the strengths of content-driven filtering are few, though they
can be combined in scoring systems (if Bayes doesn't detect, maybe the
URI lookup will) and do tend to be good with spammy content that doesn't
change very often (419 scams, for example).

Prof. Ezor has already pointed out that expecting anyone to even know
they're sending HTML mail or multipart/alternative is and has been futile
for years, as end users migrate to new mail clients that default to
HTML when sending. I can confirm that few know they are even sending
HTML, as we have to educate them whenever they join webdesign-l, which
doesn't permit HTML email (it screws up the digests, and is wasteful
in any case).

By eliminating HTML, you also eliminate malicious ActiveX, malicious
images, hidden/misrepresented links, and a lot more.

Agreed. And yet, it's the default in nearly every non-console mail
program in existence. I've even seen Exchange servers configured to
*translate* outbound plain text mail to multipart/alternative, in order
to append lawyer-mandated disclaimers and the like.

Most of the non-content-inspection schemes (like SPF, which is stupid)
unreasonably (and unnecessarily) limit senders (who might, for
example, be sending from an inhabitual location, such as a cruise ship
Internet cafe), and do nothing to stop mail from zombie spambot armies
which have commandeered friends' machines, and are sending their
infected or objectionable mail under that legitimate user's
qualifications or reputation.

I'd beg to differ. My project, enemieslist, is aimed squarely at helping
antispam companies and services do exactly that: stop mail from hosts
that have a high likelihood of being an end user with a bot (or multiple
bots) on their computer. We use very few content-driven checks in the
sendmail package I wrote and which we use here, the vast vast bulk of
mail we reject here is rejected on the basis of its /origin/ and its
membership of a class of IP space (dynamic end user, university resnet,
generic, provider-assigned reverse DNS, etc.)

By using a fine-grained "permissions" list, based on the sender of the
mail (AND SET BY THE RECIPIENT!!), one can achieve FAR better
antispam/antivirus/antiworm defenses than are possible using either
non-content-based, or only-content-based, antispam techniques by
themselves. PLUS, this returns control of their Inbox to the owner of
that Inbox, who ultimately is the only person whose opinion matters
when deciding what kind of mail they want to receive, and from who.

Erm, no. While whitelisting is a perfect solution if your correspondents
never change, it's practically useless in the context of anyone with
more than "friends and family" type mail flows. It's also unworkable in
practice, because what do you whitelist? The domain your friend sends
from (get ready for all the spam that leaks from every big ISP and ESP)?
The IP (forget about big webmail farms, then)? The sender address (all
too often forged by the very bots you're trying to avoid)? What happens
when it's okay for me to email you from this address, but I have an
issue with my mail and have to resort to using my gmail account? What
if it's okay for me to email you, but not okay for my wife to email you
to tell you I've got a cold and won't make dinner?

The only way to kill spam was demonstrated a few days ago: disconnect
those who offer services to spammers. In the meantime, while there are
still those who will host command-and-control hosts for botnets, we
have to stop mail coming from bots and behind insecure NATs and the
like; the best way I've seen for doing this is to know how end user
hosts are named, treat mail from them with extreme suspicion (as they
should be using their ISP's outbounds) and *never forget* those names,
by storing them as patterns instead of individual hosts or IPs.

DNSBLs can't do it, because they mostly work on IP addresses, not names,
and because the volumes they have to deal with require them to expire
listings rapidly. URIBLs are useful, but it's small comfort to a box
that's already running hundreds of concurrent processes just to deal
with the onslaught, to have to accept mail just so it can analyze the
domains in links found in the bodies.

Steve

--
hesketh.com/inc. v: +1(919)834-2552 f: +1(919)834-2553 w: http://hesketh.com/
antispam news, solutions for sendmail, exim, postfix: http://enemieslist.com/




-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/
Powered by Listbox: http://www.listbox.com


Current thread: