Interesting People mailing list archives

Re: content filters


From: David Farber <dave () farber net>
Date: Mon, 1 Dec 2008 20:40:13 -0500



Begin forwarded message:

From: Gordon Peterson <gep2 () terabites com>
Date: December 1, 2008 8:00:01 PM EST
To: Steven Champeon <schampeo () hesketh com>
Cc: David Farber <dave () farber net>
Subject: Re: content filters

Steven Champeon wrote:
Dave, for IP if you think appropriate.
on Sun, Nov 30, 2008 at 12:31:41PM -0500, Gordon Peterson wrote:
I however STRONGLY disagree with Rich's claim about content
inspection. as an antispam technique.

The trick, however, is to COMBINE content detection SPECIFIC TO THE SENDER.
There are several different sorts of "content detection" in the antispam
world:

To begin with, you shouldn't limit yourself to only what people are doing widely NOW. If their approaches had solved the problem, we wouldn't be having this discussion.

1) keyword filters (often found in corporate gateways, to implement
  policies related to sexual harrassment or "the hostile workplace");
  these are often so poorly written that even confirmed opt-in list
  mail (like IP, or my longest running list, webdesign-l) falls victim
to their inanities. I often simply unsub members who are behind these
  things, because it's a waste of my time to even try to educate them
  about the limitations of this approach.

Simple keyword matching is too limited. An example frequently cited is a site to support breast cancer survivors being blocked because of its discussion about "breasts".

Again, it's nearly impossible to reliably consider content of e-mail messages UNLESS YOU ALSO condition that based upon a fine-grained permissions list, ESTABLISHED BY THE RECIPIENT, which specifies what kinds of content they expect AND WANT to receive from EACH GIVEN SENDER.

As in the example I mentioned, there are e-mails I will happily receive from some of my friends, where the IDENTICALLY SAME e-mail body would be spam if it came from someone I didn't know!

2) signature-driven filters; these come in several types, including
  hash-driven products like Cloudmark SpamNet or DCC; these measure
  bulkiness (DCC) or spot content common to messages marked as spam by
  some process or another (SpamNet); these are equally prone to trust
  issues spread among those who are allowed to submit samples to the
  corpus(es), as well as to malfunctions in the submission process
  and badly vetted "trap" addresses, etc. I stopped using them in
  2003, because there were too many issues. This class also includes
  all antivirus engines, which operate primarily by signatures.

Rather than an unending series of these "whack-a-mole" games between virus authors and antivirus companies, It's far simpler to simply forbid ANY executable content whatsoever by default; and then, to allow the recipient to selectively (and very guardedly) re-enable executable content in E-mail from known senders who have a legitmate reason to send such content.

By the same token, obscured URLs and IP addresses, or decryption scripting, in E-mail is virtually never found in legitimate and innocent E-mail. So again, it makes sense to simply forbid it entirely UNLESS the recipient has explicitly permitted it, from specifically identified senders who they know and trust.

3) another type of signature-driven filter, which looks for constructs
  that increase the likelihood that the message is spam, and uses a
  scoring system to reject or quarantine such messages (e.g.,
SpamAssassin) - these are useful and can catch stuff the others miss,
  but they must be properly tuned and kept up to date in order to work
  properly. I primarily use these when the signature can be isolated
  to the headers or the SMTP conversation itself.

Unfortunately, good differentiation requires access to both the content AND the header evidence.

AND it requires the judgement of the recipient, and THEIR trust issues, which is why my approach isn't much liked by ISPs... they would, understandably, rather truncate spam loading further back in the network.

Doing things RIGHT isn't always easy or cheap.

4) Naive Bayesian filters, which use word frequencies to spot garbage
  but which must also be trained in the sense Gordon is talking about
  (different people have different mail flows, and what works for some
  may not work for others, or as your own mail flow changes). They're
  also trivial to game, by appending known "good" text such as news
  or literature to a minimal spam payload.

These approaches also don't do well when people communicate in multiple languages. For example, I get mail in at least three languages (English, French, and Spanish). I know other people who get significant amounts of mail also in Hebrew. And I'm sure most international users are very familiar with this sort of concern.

FWIW, my approach involves less training of Bayesian filters, and more simply an issue of having a fine-grained permissions list with details on a sender-by-sender basis, and saying what I am willing to accept (or not) from each of those specific senders. If I get nonconforming E-mails, I would be glad to have those stored in a "suspense" folder somewhere which I could peruse (or not) at my leisure.

I would also be glad to be able to squash various sorts of E-mails mercilessly... some of them, I don't expect to lament missing, even if they had been legitimate.

5) URIBLs, which look for domains in message bodies (or, to a lesser
  extent, in the message headers) and compare them to databases full
  of domains that are known to be registered to spammers or otherwise
  bad actors; these are useful against a wide variety of spam, and
  especially so as spammers collaborate with evil registrars (see
  Estdomains, which ICANN recently shut down, or directi, which took
  over where Estdomains left off) and purchase and burn through large
  quantities of throwaway domains.

Yes, and that can be useful too, although as you point out spammers have realized that they can burn through large numbers of disposable domains.... registering a domain for $4-10 on a one-shot basis isn't particularly expensive, based on what they presumably take in by sending out a million copies of some spam message. And of course, they don't even necessarily need a domain name... they could just use an IP address. And there are nearly infinitely many ways to obscure either a URL, or an IP address.... which will defeat the most simple kinds of domain blacklisting. Again, it's often simpler and better to simply state that "legitimate E-mail doesn't use obscured URLs or obscured IP addresses" and simply trash such attempts as a class.

6) another type of signature filter looks for common mailing addresses,
  phone numbers, and the like in order to take advantage of the fact
  that some spammers actually do include their addresses in order to
  "comply" with CAN-SPAM. IIRC, Catherine Hampton's Spambouncer does
  a bit of this, among others.

Doesn't hurt, I suppose, to add the capability... although again, I prefer to use blocks that spammers can't trivially evade.

The biggest problem with content-driven filtering is that it can be
gamed (inserting a link to a spammy URL in an editorial or news item,
for example), worked around (always use new domains, updated templates)
and evaded (hashbusting, t|-|05e weird Subject liness).

Again, the fact is that I simply would be FAR more selective what I'm receiving when it comes from someone I don't have an established e- mail relationship with. I don't EVER want executables, or even attachments, from anybody I don't know yet. If they have something important to send me, they can send me a plain text E-mail first, and then if I'm interested, I can open the gate... _just_ wide enough to allow what I expect AND WANT to receive from them.

It's also important to recognize the SpamAssassin-type scoring systems can work DRAMATICALLY better if they are provided "clean" e-mail which has already been screened to not contain obscured content, HTML, encryption, attachments, and things of that sort which are traditionally used by spammers specifically to evade content filtering.

In sum, the strengths of content-driven filtering are few, though they
can be combined in scoring systems (if Bayes doesn't detect, maybe the
URI lookup will) and do tend to be good with spammy content that doesn't
change very often (419 scams, for example).

Sure. But again, YOU MUST CONSIDER WHO THE SENDER IS!!! That, combined with a much-more-restrictive default rule applied to mail from unknown senders, creates a far twistier and narrower path which is much harder for the random blasts of typical spam to penetrate very far into.

Prof. Ezor has already pointed out that expecting anyone to even know
they're sending HTML mail or multipart/alternative is and has been futile
for years, as end users migrate to new mail clients that default to
HTML when sending.

Sure, and until we begin to ENFORCE reasonable etiquette, it's naive to presume that people will pick it up natively. There will be at least some sort of learning curve, but I think it's not a particularly steep one. Encouraging people to blithely send HTML-burdened mail is just stupid, not just because it's so much more conducive to spammer evasions, but also because it tends to be so wasteful of Internet bandwidth and storage... typically 3-5x bulkier, and (often) for NOTHING.

Once sending HTML-burdened e-mail to someone you don't know is widely recognized as equivalent to a guarantee of non-reception, I don't think people will take too long to figure out what they need to do.

I can confirm that few know they are even sending
HTML, as we have to educate them whenever they join webdesign-l, which
doesn't permit HTML email (it screws up the digests, and is wasteful
in any case).

Yes. Again, SOMEONE needs to educate people.... these aren't things that most of us learned back in school. The fact that they haven't YET been educated, I think, isn't a good enough reason to never undertake that.

By eliminating HTML, you also eliminate malicious ActiveX, malicious images, hidden/misrepresented links, and a lot more.
Agreed. And yet, it's the default in nearly every non-console mail
program in existence. I've even seen Exchange servers configured to
*translate* outbound plain text mail to multipart/alternative, in order
to append lawyer-mandated disclaimers and the like.

As I said, there are a number of "impolite"/wasteful senders who will find that they need to rethink their strategies. That is NOT necessarily a bad thing.

In particular, by forbidding HTML (and thus also, scripting) and attachments in E-mails from unknown/untrusted senders, we can put a MAJOR crimp in the kinds of e-mails that are used to recruit zombie spambot armies.... which are themselves one of the biggest problems in the antispam fight. You don't necessarily need "antispam signature checking" and the whack-a-mole fight; just simply FORBID executable content, unless the sender is known and trusted. True, some recipients are stupid and will turn on whatever an incoming message tells them to turn on; as long as the software posts adequately dire- sounding warnings when they do so, there are some people who WILL shoot themselves in the foot, and there's little that can maybe be done about that kind of Darwinian herd-thinning...!

Most of the non-content-inspection schemes (like SPF, which is stupid)
unreasonably (and unnecessarily) limit senders (who might, for
example, be sending from an inhabitual location, such as a cruise ship Internet cafe), and do nothing to stop mail from zombie spambot armies
which have commandeered friends' machines, and are sending their
infected or objectionable mail under that legitimate user's
qualifications or reputation.
I'd beg to differ. My project, enemieslist, is aimed squarely at helping
antispam companies and services do exactly that: stop mail from hosts
that have a high likelihood of being an end user with a bot (or multiple
bots) on their computer.

_I_ didn't claim that your software approach was a good one. In fact, I would suggest that it's probably fatally flawed.

It's NOT reasonable to (say) block mail entirely from even a compromised IP address which is sending some percentage of spam. A gateway machine (or DHCP/NAT router) might have many machines behind it, and ONE of those might be compromised and pouring spam into the Net. It's not particularly helpful to crudely block all the others... that's a ham-fisted approach. Note that my concept, for example, would reliably and rapidly block all of Aunt Gertrude's outgoing mail which contains worms or executable malware, while continuing to reliably and efficiently deliver all her normal and correct mail.

Companies which (for example) use a vanity domain might pass their outgoing e-mail through an SMTP server shared with hundreds or thousands of other companies. Blocking anything you decide you don't like that transits through that gateway machine, even temporarily, just because SOME machine that's relaying through there is infected... that's just a very ill-conceived approach to the problem.

> We use very few content-driven checks in the
sendmail package I wrote and which we use here, the vast vast bulk of
mail we reject here is rejected on the basis of its /origin/ and its
membership of a class of IP space (dynamic end user, university resnet,
generic, provider-assigned reverse DNS, etc.)

That's fine, to the extent that it works. And I understand that it's nice, in principle, to only have to look at the header of the message before deciding whether you're going to grab the rest of the message. Certainly, that's efficient.

The problem is that it's simply not very good.

And a person can't always determine where their outgoing mail is being processed through. For example, I've sent E-mail using my "terabites.com" domain from storefront Internet cafes in Mexico, post office E-mail access cubicles in Beijing, and cruise ship Internet cafes. In none of these cases did I have ANY control whatsoever about what SMTP server(s) were going to process my outgoing E-mail... but it's VERY safe that none of them were the same servers that traditionally or habitually are used to handle my "terabites.com" outgoing E-mail messages.

And by the same token, those of us who work out of home offices ABSOLUTELY might have a VERY legitimate reason to have our own (outgoing, at least) SMTP servers running here, even (yes) on a dynamically allocated IP address pool.

Again, I understand the motivation to WANT to do a halfassed job of antispam control... since it is MUCH easier, and cheaper, and handles "the majority" of the cases. But it causes all kinds of grief for many VERY legitimate users and systems, and therefore is a poor choice, no matter how seductive it looks to the casual (or clusless) observer.

By using a fine-grained "permissions" list, based on the sender of the
mail (AND SET BY THE RECIPIENT!!), one can achieve FAR better
antispam/antivirus/antiworm defenses than are possible using either
non-content-based, or only-content-based, antispam techniques by
themselves. PLUS, this returns control of their Inbox to the owner of
that Inbox, who ultimately is the only person whose opinion matters
when deciding what kind of mail they want to receive, and from who.
Erm, no. While whitelisting is a perfect solution if your correspondents
never change, it's practically useless in the context of anyone with
more than "friends and family" type mail flows.

What I am proposing is ABSOLUTELY NOT a simple, braindead, "whitelist".

Consider the case of (say) Proctor and Gamble's customer service department. Or even, for that matter, someone who has handed out several hundred (or thousand) business cards at a trade show. They ABSOLUTELY want to receive E-mails from previously unknown correspondents.

ON THE OTHER HAND, it's perfectly reasonable to establish some ground rules as to what kinds of E-mails those are expected and wanted to be. It would be hard to argue, for example, that (at least in the 'introduction' stage) there should EVER be any kind of executable attachments in those messages. It's equally reasonable to, in most cases, state categorically that scripting in those messages is undesired.

A reasonable first-cut default for previously unknown senders is: no HTML, no attachments, and a total message size less than (say) 50K bytes. (Hey, if you want bigger, then set the default limit bigger. But I doubt you want some unknown spambot to send you thousands of 10Mb junk e-mails that completely and rapidly fill up your inbox to where it blocks further reception, either.)

> It's also unworkable in
practice, because what do you whitelist?

You whitelist INDIVIDUAL E-MAIL ADDRESSES, based upon the SPECIFIC characteristics you expect in mail from THAT PERSON.... but note that MANY senders you probably won't have to enable additional non-default permissions at all.)

The domain your friend sends
from (get ready for all the spam that leaks from every big ISP and ESP)?

No, of course not. You whitelist individual senders, at least most of the time. You should be able to whitelist whole domains, or even IP addresses, for those (rare) cases where that's appropriate.

Again, with a good default rule, the GREAT majority of senders won't need additional permissions at all.

(And note that it's probably appropriate, again for most senders, to have their mail go through a SpamAssassin-type filter AFTER the first- level sender-based permissions acceptance criteria).

The IP (forget about big webmail farms, then)? The sender address (all
too often forged by the very bots you're trying to avoid)?

They CAN forge a sender address, but most spam also won't LOOK LIKE the mail you get from that person. Again, once you deny them HTML, attachments, scripting, obscured URLs and such, you've in one fell swoop almost completely eliminated (and in a very robust way) the great majority of the tricks used by spammers/phishers/worms to distribute their malware, and to evade antispam filters.

What happens
when it's okay for me to email you from this address, but I have an
issue with my mail and have to resort to using my gmail account?

You simply need to alert the recipient that you will also be using that address... IF you are sending mail of a sort which cannot fall within the range of "default" acceptance rules. I think it's fair to point out that the GREAT majority of legitimate mail can fit within the default rules, which still block the great majority of malware- containing messages.

What
if it's okay for me to email you, but not okay for my wife to email you
to tell you I've got a cold and won't make dinner?

Again, default mail rules should allow her simple and direct e-mails to pass through just fine.

The only way to kill spam was demonstrated a few days ago: disconnect
those who offer services to spammers.

NO.

The problem is that a LARGE percentage of spam is sent out by compromised zombie spambot armies, using the reputation of their (usually clueless) owner.

Even in the case you're referring to, the machines that were disconnected were NOT the ones that were sending the spam... it was the ones the zombies were going to to get their orders.

In the meantime, while there are
still those who will host command-and-control hosts for botnets, we
have to stop mail coming from bots and behind insecure NATs and the
like;

The problem is when tens of thousands of legitimate users share the same NAT router or SMTP server(s). It is stupid and braindead to block the mail of all those other responsible users too.

> ...the best way I've seen for doing this is to know how end user
hosts are named, treat mail from them with extreme suspicion (as they
should be using their ISP's outbounds) and *never forget* those names,
by storing them as patterns instead of individual hosts or IPs.

It's often not practical to use the ISP's SMTP servers if you are using a personal/vanity/small-business domain. So I might agree that it's reasonable to treat such mail as "of interest" but it's certainly NOT reasonable to crudely block it as a class.

DNSBLs can't do it, because they mostly work on IP addresses, not names,
and because the volumes they have to deal with require them to expire
listings rapidly. URIBLs are useful, but it's small comfort to a box
that's already running hundreds of concurrent processes just to deal
with the onslaught, to have to accept mail just so it can analyze the
domains in links found in the bodies.

At some point, it makes sense to offload a lot of the spam- discrimination to the recipient-end-user's system... system-wide, there is in aggregate far more computing power there than there probably is in the mail-handling systems along the Net enroute.

Agreed that involves more bandwidth, but the bandwidth is probably cheaper (for many cases at least) than the computing power is.

Besides, it keeps the recipient more in control (and the control SHOULD be theirs) when the final determination is done under THEIR control.

Steve

--

Gordon Peterson II
http://personal.terabites.com
1977-2007:  Thirty year anniversary of local area networking




-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/
Powered by Listbox: http://www.listbox.com


Current thread: