Interesting People mailing list archives
Re: content filters
From: David Farber <dave () farber net>
Date: Mon, 1 Dec 2008 20:42:47 -0500
Begin forwarded message: From: Steven Champeon <schampeo () hesketh com> Date: December 1, 2008 6:00:05 PM EST To: David Farber <dave () farber net> Cc: gep2 () terabites com Subject: content filters Dave, for IP if you think appropriate. on Sun, Nov 30, 2008 at 12:31:41PM -0500, Gordon Peterson wrote:
I however STRONGLY disagree with Rich's claim about content inspection. as an antispam technique.The trick, however, is to COMBINE content detection SPECIFIC TO THE SENDER.
There are several different sorts of "content detection" in the antispam world: 1) keyword filters (often found in corporate gateways, to implement policies related to sexual harrassment or "the hostile workplace"); these are often so poorly written that even confirmed opt-in list mail (like IP, or my longest running list, webdesign-l) falls victim to their inanities. I often simply unsub members who are behind these things, because it's a waste of my time to even try to educate them about the limitations of this approach. 2) signature-driven filters; these come in several types, including hash-driven products like Cloudmark SpamNet or DCC; these measure bulkiness (DCC) or spot content common to messages marked as spam by some process or another (SpamNet); these are equally prone to trust issues spread among those who are allowed to submit samples to the corpus(es), as well as to malfunctions in the submission process and badly vetted "trap" addresses, etc. I stopped using them in 2003, because there were too many issues. This class also includes all antivirus engines, which operate primarily by signatures. 3) another type of signature-driven filter, which looks for constructs that increase the likelihood that the message is spam, and uses a scoring system to reject or quarantine such messages (e.g., SpamAssassin) - these are useful and can catch stuff the others miss, but they must be properly tuned and kept up to date in order to work properly. I primarily use these when the signature can be isolated to the headers or the SMTP conversation itself. 4) Naive Bayesian filters, which use word frequencies to spot garbage but which must also be trained in the sense Gordon is talking about (different people have different mail flows, and what works for some may not work for others, or as your own mail flow changes). They're also trivial to game, by appending known "good" text such as news or literature to a minimal spam payload. 5) URIBLs, which look for domains in message bodies (or, to a lesser extent, in the message headers) and compare them to databases full of domains that are known to be registered to spammers or otherwise bad actors; these are useful against a wide variety of spam, and especially so as spammers collaborate with evil registrars (see Estdomains, which ICANN recently shut down, or directi, which took over where Estdomains left off) and purchase and burn through large quantities of throwaway domains. 6) another type of signature filter looks for common mailing addresses, phone numbers, and the like in order to take advantage of the fact that some spammers actually do include their addresses in order to "comply" with CAN-SPAM. IIRC, Catherine Hampton's Spambouncer does a bit of this, among others. The biggest problem with content-driven filtering is that it can be gamed (inserting a link to a spammy URL in an editorial or news item, for example), worked around (always use new domains, updated templates) and evaded (hashbusting, t|-|05e weird Subject liness). In sum, the strengths of content-driven filtering are few, though they can be combined in scoring systems (if Bayes doesn't detect, maybe the URI lookup will) and do tend to be good with spammy content that doesn't change very often (419 scams, for example). Prof. Ezor has already pointed out that expecting anyone to even knowthey're sending HTML mail or multipart/alternative is and has been futile
for years, as end users migrate to new mail clients that default to HTML when sending. I can confirm that few know they are even sending HTML, as we have to educate them whenever they join webdesign-l, which doesn't permit HTML email (it screws up the digests, and is wasteful in any case).
By eliminating HTML, you also eliminate malicious ActiveX, malicious images, hidden/misrepresented links, and a lot more.
Agreed. And yet, it's the default in nearly every non-console mail program in existence. I've even seen Exchange servers configured to *translate* outbound plain text mail to multipart/alternative, in order to append lawyer-mandated disclaimers and the like.
Most of the non-content-inspection schemes (like SPF, which is stupid) unreasonably (and unnecessarily) limit senders (who might, for example, be sending from an inhabitual location, such as a cruise ship Internet cafe), and do nothing to stop mail from zombie spambot armies which have commandeered friends' machines, and are sending their infected or objectionable mail under that legitimate user's qualifications or reputation.
I'd beg to differ. My project, enemieslist, is aimed squarely at helping antispam companies and services do exactly that: stop mail from hosts that have a high likelihood of being an end user with a bot (or multiple bots) on their computer. We use very few content-driven checks in the sendmail package I wrote and which we use here, the vast vast bulk of mail we reject here is rejected on the basis of its /origin/ and its membership of a class of IP space (dynamic end user, university resnet, generic, provider-assigned reverse DNS, etc.)
By using a fine-grained "permissions" list, based on the sender of the mail (AND SET BY THE RECIPIENT!!), one can achieve FAR better antispam/antivirus/antiworm defenses than are possible using either non-content-based, or only-content-based, antispam techniques by themselves. PLUS, this returns control of their Inbox to the owner of that Inbox, who ultimately is the only person whose opinion matters when deciding what kind of mail they want to receive, and from who.
Erm, no. While whitelisting is a perfect solution if your correspondents never change, it's practically useless in the context of anyone with more than "friends and family" type mail flows. It's also unworkable in practice, because what do you whitelist? The domain your friend sends from (get ready for all the spam that leaks from every big ISP and ESP)? The IP (forget about big webmail farms, then)? The sender address (all too often forged by the very bots you're trying to avoid)? What happens when it's okay for me to email you from this address, but I have an issue with my mail and have to resort to using my gmail account? What if it's okay for me to email you, but not okay for my wife to email you to tell you I've got a cold and won't make dinner? The only way to kill spam was demonstrated a few days ago: disconnect those who offer services to spammers. In the meantime, while there are still those who will host command-and-control hosts for botnets, we have to stop mail coming from bots and behind insecure NATs and the like; the best way I've seen for doing this is to know how end user hosts are named, treat mail from them with extreme suspicion (as they should be using their ISP's outbounds) and *never forget* those names, by storing them as patterns instead of individual hosts or IPs. DNSBLs can't do it, because they mostly work on IP addresses, not names, and because the volumes they have to deal with require them to expire listings rapidly. URIBLs are useful, but it's small comfort to a box that's already running hundreds of concurrent processes just to deal with the onslaught, to have to accept mail just so it can analyze the domains in links found in the bodies. Steve -- hesketh.com/inc. v: +1(919)834-2552 f: +1(919)834-2553 w: http://hesketh.com/ antispam news, solutions for sendmail, exim, postfix: http://enemieslist.com/ ------------------------------------------- Archives: https://www.listbox.com/member/archive/247/=now RSS Feed: https://www.listbox.com/member/archive/rss/247/ Powered by Listbox: http://www.listbox.com
Current thread:
- Re: content filters David Farber (Dec 01)
- <Possible follow-ups>
- Re: content filters David Farber (Dec 01)