Interesting People mailing list archives

Re: content filters

From: David Farber <dave () farber net>
Date: Mon, 1 Dec 2008 20:40:13 -0500



Begin forwarded message:

From: Gordon Peterson <gep2 () terabites com>
Date: December 1, 2008 8:00:01 PM EST
To: Steven Champeon <schampeo () hesketh com>
Cc: David Farber <dave () farber net>
Subject: Re: content filters

Steven Champeon wrote:

Dave, for IP if you think appropriate.
on Sun, Nov 30, 2008 at 12:31:41PM -0500, Gordon Peterson wrote:
I however STRONGLY disagree with Rich's claim about content
inspection. as an antispam technique.
The trick, however, is to COMBINE content detection SPECIFIC TO THESENDER.
There are several different sorts of "content detection" in theantispam
world:

To begin with, you shouldn't limit yourself to only what people aredoing widely NOW. If their approaches had solved the problem, wewouldn't be having this discussion.

1) keyword filters (often found in corporate gateways, to implement
  policies related to sexual harrassment or "the hostile workplace");
  these are often so poorly written that even confirmed opt-in list
  mail (like IP, or my longest running list, webdesign-l) falls victim

to their inanities. I often simply unsub members who are behindthese

  things, because it's a waste of my time to even try to educate them
  about the limitations of this approach.

Simple keyword matching is too limited. An example frequently citedis a site to support breast cancer survivors being blocked because ofits discussion about "breasts".

Again, it's nearly impossible to reliably consider content of e-mailmessages UNLESS YOU ALSO condition that based upon a fine-grainedpermissions list, ESTABLISHED BY THE RECIPIENT, which specifies whatkinds of content they expect AND WANT to receive from EACH GIVEN SENDER.

As in the example I mentioned, there are e-mails I will happilyreceive from some of my friends, where the IDENTICALLY SAME e-mailbody would be spam if it came from someone I didn't know!

2) signature-driven filters; these come in several types, including
  hash-driven products like Cloudmark SpamNet or DCC; these measure
  bulkiness (DCC) or spot content common to messages marked as spam by
  some process or another (SpamNet); these are equally prone to trust
  issues spread among those who are allowed to submit samples to the
  corpus(es), as well as to malfunctions in the submission process
  and badly vetted "trap" addresses, etc. I stopped using them in
  2003, because there were too many issues. This class also includes
  all antivirus engines, which operate primarily by signatures.

Rather than an unending series of these "whack-a-mole" games betweenvirus authors and antivirus companies, It's far simpler to simplyforbid ANY executable content whatsoever by default; and then, toallow the recipient to selectively (and very guardedly) re-enableexecutable content in E-mail from known senders who have a legitmatereason to send such content.

By the same token, obscured URLs and IP addresses, or decryptionscripting, in E-mail is virtually never found in legitimate andinnocent E-mail. So again, it makes sense to simply forbid itentirely UNLESS the recipient has explicitly permitted it, fromspecifically identified senders who they know and trust.

3) another type of signature-driven filter, which looks for constructs
  that increase the likelihood that the message is spam, and uses a
  scoring system to reject or quarantine such messages (e.g.,

SpamAssassin) - these are useful and can catch stuff the othersmiss,

  but they must be properly tuned and kept up to date in order to work
  properly. I primarily use these when the signature can be isolated
  to the headers or the SMTP conversation itself.

Unfortunately, good differentiation requires access to both thecontent AND the header evidence.

AND it requires the judgement of the recipient, and THEIR trustissues, which is why my approach isn't much liked by ISPs... theywould, understandably, rather truncate spam loading further back inthe network.


Doing things RIGHT isn't always easy or cheap.

4) Naive Bayesian filters, which use word frequencies to spot garbage
  but which must also be trained in the sense Gordon is talking about
  (different people have different mail flows, and what works for some
  may not work for others, or as your own mail flow changes). They're
  also trivial to game, by appending known "good" text such as news
  or literature to a minimal spam payload.

These approaches also don't do well when people communicate inmultiple languages. For example, I get mail in at least threelanguages (English, French, and Spanish). I know other people who getsignificant amounts of mail also in Hebrew. And I'm sure mostinternational users are very familiar with this sort of concern.

FWIW, my approach involves less training of Bayesian filters, and moresimply an issue of having a fine-grained permissions list with detailson a sender-by-sender basis, and saying what I am willing to accept(or not) from each of those specific senders. If I get nonconformingE-mails, I would be glad to have those stored in a "suspense" foldersomewhere which I could peruse (or not) at my leisure.

I would also be glad to be able to squash various sorts of E-mailsmercilessly... some of them, I don't expect to lament missing, even ifthey had been legitimate.

5) URIBLs, which look for domains in message bodies (or, to a lesser
  extent, in the message headers) and compare them to databases full
  of domains that are known to be registered to spammers or otherwise
  bad actors; these are useful against a wide variety of spam, and
  especially so as spammers collaborate with evil registrars (see
  Estdomains, which ICANN recently shut down, or directi, which took
  over where Estdomains left off) and purchase and burn through large
  quantities of throwaway domains.

Yes, and that can be useful too, although as you point out spammershave realized that they can burn through large numbers of disposabledomains.... registering a domain for $4-10 on a one-shot basis isn'tparticularly expensive, based on what they presumably take in bysending out a million copies of some spam message. And of course,they don't even necessarily need a domain name... they could just usean IP address. And there are nearly infinitely many ways to obscureeither a URL, or an IP address.... which will defeat the most simplekinds of domain blacklisting. Again, it's often simpler and better tosimply state that "legitimate E-mail doesn't use obscured URLs orobscured IP addresses" and simply trash such attempts as a class.

6) another type of signature filter looks for common mailingaddresses,

  phone numbers, and the like in order to take advantage of the fact
  that some spammers actually do include their addresses in order to
  "comply" with CAN-SPAM. IIRC, Catherine Hampton's Spambouncer does
  a bit of this, among others.

Doesn't hurt, I suppose, to add the capability... although again, Iprefer to use blocks that spammers can't trivially evade.

The biggest problem with content-driven filtering is that it can be
gamed (inserting a link to a spammy URL in an editorial or news item,

for example), worked around (always use new domains, updatedtemplates)

and evaded (hashbusting, t|-|05e weird Subject liness).

Again, the fact is that I simply would be FAR more selective what I'mreceiving when it comes from someone I don't have an established e-mail relationship with. I don't EVER want executables, or evenattachments, from anybody I don't know yet. If they have somethingimportant to send me, they can send me a plain text E-mail first, andthen if I'm interested, I can open the gate... _just_ wide enough toallow what I expect AND WANT to receive from them.

It's also important to recognize the SpamAssassin-type scoring systemscan work DRAMATICALLY better if they are provided "clean" e-mail whichhas already been screened to not contain obscured content, HTML,encryption, attachments, and things of that sort which aretraditionally used by spammers specifically to evade content filtering.

In sum, the strengths of content-driven filtering are few, though they
can be combined in scoring systems (if Bayes doesn't detect, maybe the

URI lookup will) and do tend to be good with spammy content thatdoesn't

change very often (419 scams, for example).

Sure. But again, YOU MUST CONSIDER WHO THE SENDER IS!!! That,combined with a much-more-restrictive default rule applied to mailfrom unknown senders, creates a far twistier and narrower path whichis much harder for the random blasts of typical spam to penetrate veryfar into.

Prof. Ezor has already pointed out that expecting anyone to even know
they're sending HTML mail or multipart/alternative is and has beenfutile
for years, as end users migrate to new mail clients that default to
HTML when sending.

Sure, and until we begin to ENFORCE reasonable etiquette, it's naiveto presume that people will pick it up natively. There will be atleast some sort of learning curve, but I think it's not a particularlysteep one. Encouraging people to blithely send HTML-burdened mail isjust stupid, not just because it's so much more conducive to spammerevasions, but also because it tends to be so wasteful of Internetbandwidth and storage... typically 3-5x bulkier, and (often) forNOTHING.

Once sending HTML-burdened e-mail to someone you don't know is widelyrecognized as equivalent to a guarantee of non-reception, I don'tthink people will take too long to figure out what they need to do.

I can confirm that few know they are even sending
HTML, as we have to educate them whenever they join webdesign-l, which
doesn't permit HTML email (it screws up the digests, and is wasteful
in any case).

Yes. Again, SOMEONE needs to educate people.... these aren't thingsthat most of us learned back in school. The fact that they haven'tYET been educated, I think, isn't a good enough reason to neverundertake that.

By eliminating HTML, you also eliminate malicious ActiveX,malicious images, hidden/misrepresented links, and a lot more.
Agreed. And yet, it's the default in nearly every non-console mail
program in existence. I've even seen Exchange servers configured to
*translate* outbound plain text mail to multipart/alternative, inorder
to append lawyer-mandated disclaimers and the like.

As I said, there are a number of "impolite"/wasteful senders who willfind that they need to rethink their strategies. That is NOTnecessarily a bad thing.

In particular, by forbidding HTML (and thus also, scripting) andattachments in E-mails from unknown/untrusted senders, we can put aMAJOR crimp in the kinds of e-mails that are used to recruit zombiespambot armies.... which are themselves one of the biggest problems inthe antispam fight. You don't necessarily need "antispam signaturechecking" and the whack-a-mole fight; just simply FORBID executablecontent, unless the sender is known and trusted. True, somerecipients are stupid and will turn on whatever an incoming messagetells them to turn on; as long as the software posts adequately dire-sounding warnings when they do so, there are some people who WILLshoot themselves in the foot, and there's little that can maybe bedone about that kind of Darwinian herd-thinning...!

Most of the non-content-inspection schemes (like SPF, which isstupid)
unreasonably (and unnecessarily) limit senders (who might, for
example, be sending from an inhabitual location, such as a cruiseshipInternet cafe), and do nothing to stop mail from zombie spambotarmies
which have commandeered friends' machines, and are sending their
infected or objectionable mail under that legitimate user's
qualifications or reputation.
I'd beg to differ. My project, enemieslist, is aimed squarely athelping
antispam companies and services do exactly that: stop mail from hosts
that have a high likelihood of being an end user with a bot (ormultiple
bots) on their computer.

_I_ didn't claim that your software approach was a good one. In fact,I would suggest that it's probably fatally flawed.

It's NOT reasonable to (say) block mail entirely from even acompromised IP address which is sending some percentage of spam. Agateway machine (or DHCP/NAT router) might have many machines behindit, and ONE of those might be compromised and pouring spam into theNet. It's not particularly helpful to crudely block all the others...that's a ham-fisted approach. Note that my concept, for example,would reliably and rapidly block all of Aunt Gertrude's outgoing mailwhich contains worms or executable malware, while continuing toreliably and efficiently deliver all her normal and correct mail.

Companies which (for example) use a vanity domain might pass theiroutgoing e-mail through an SMTP server shared with hundreds orthousands of other companies. Blocking anything you decide you don'tlike that transits through that gateway machine, even temporarily,just because SOME machine that's relaying through there is infected...that's just a very ill-conceived approach to the problem.


> We use very few content-driven checks in the

sendmail package I wrote and which we use here, the vast vast bulk of
mail we reject here is rejected on the basis of its /origin/ and its

membership of a class of IP space (dynamic end user, universityresnet,

generic, provider-assigned reverse DNS, etc.)

That's fine, to the extent that it works. And I understand that it'snice, in principle, to only have to look at the header of the messagebefore deciding whether you're going to grab the rest of the message.Certainly, that's efficient.


The problem is that it's simply not very good.

And a person can't always determine where their outgoing mail is beingprocessed through. For example, I've sent E-mail using my"terabites.com" domain from storefront Internet cafes in Mexico, postoffice E-mail access cubicles in Beijing, and cruise ship Internetcafes. In none of these cases did I have ANY control whatsoever aboutwhat SMTP server(s) were going to process my outgoing E-mail... butit's VERY safe that none of them were the same servers thattraditionally or habitually are used to handle my "terabites.com"outgoing E-mail messages.

And by the same token, those of us who work out of home officesABSOLUTELY might have a VERY legitimate reason to have our own(outgoing, at least) SMTP servers running here, even (yes) on adynamically allocated IP address pool.

Again, I understand the motivation to WANT to do a halfassed job ofantispam control... since it is MUCH easier, and cheaper, and handles"the majority" of the cases. But it causes all kinds of grief formany VERY legitimate users and systems, and therefore is a poorchoice, no matter how seductive it looks to the casual (or clusless)observer.

By using a fine-grained "permissions" list, based on the sender ofthe

mail (AND SET BY THE RECIPIENT!!), one can achieve FAR better
antispam/antivirus/antiworm defenses than are possible using either
non-content-based, or only-content-based, antispam techniques by
themselves. PLUS, this returns control of their Inbox to the owner of
that Inbox, who ultimately is the only person whose opinion matters
when deciding what kind of mail they want to receive, and from who.

Erm, no. While whitelisting is a perfect solution if yourcorrespondents

never change, it's practically useless in the context of anyone with
more than "friends and family" type mail flows.


What I am proposing is ABSOLUTELY NOT a simple, braindead, "whitelist".

Consider the case of (say) Proctor and Gamble's customer servicedepartment. Or even, for that matter, someone who has handed outseveral hundred (or thousand) business cards at a trade show. TheyABSOLUTELY want to receive E-mails from previously unknowncorrespondents.

ON THE OTHER HAND, it's perfectly reasonable to establish some groundrules as to what kinds of E-mails those are expected and wanted to be.It would be hard to argue, for example, that (at least in the'introduction' stage) there should EVER be any kind of executableattachments in those messages. It's equally reasonable to, in mostcases, state categorically that scripting in those messages isundesired.

A reasonable first-cut default for previously unknown senders is: noHTML, no attachments, and a total message size less than (say) 50Kbytes. (Hey, if you want bigger, then set the default limit bigger.But I doubt you want some unknown spambot to send you thousands of10Mb junk e-mails that completely and rapidly fill up your inbox towhere it blocks further reception, either.)


> It's also unworkable in

practice, because what do you whitelist?

You whitelist INDIVIDUAL E-MAIL ADDRESSES, based upon the SPECIFICcharacteristics you expect in mail from THAT PERSON.... but note thatMANY senders you probably won't have to enable additional non-defaultpermissions at all.)

The domain your friend sends
from (get ready for all the spam that leaks from every big ISP andESP)?

No, of course not. You whitelist individual senders, at least most ofthe time. You should be able to whitelist whole domains, or even IPaddresses, for those (rare) cases where that's appropriate.

Again, with a good default rule, the GREAT majority of senders won'tneed additional permissions at all.

(And note that it's probably appropriate, again for most senders, tohave their mail go through a SpamAssassin-type filter AFTER the first-level sender-based permissions acceptance criteria).

The IP (forget about big webmail farms, then)? The sender address (all
too often forged by the very bots you're trying to avoid)?

They CAN forge a sender address, but most spam also won't LOOK LIKEthe mail you get from that person. Again, once you deny them HTML,attachments, scripting, obscured URLs and such, you've in one fellswoop almost completely eliminated (and in a very robust way) thegreat majority of the tricks used by spammers/phishers/worms todistribute their malware, and to evade antispam filters.

What happens
when it's okay for me to email you from this address, but I have an
issue with my mail and have to resort to using my gmail account?

You simply need to alert the recipient that you will also be usingthat address... IF you are sending mail of a sort which cannot fallwithin the range of "default" acceptance rules. I think it's fair topoint out that the GREAT majority of legitimate mail can fit withinthe default rules, which still block the great majority of malware-containing messages.

What
if it's okay for me to email you, but not okay for my wife to emailyou
to tell you I've got a cold and won't make dinner?

Again, default mail rules should allow her simple and direct e-mailsto pass through just fine.

The only way to kill spam was demonstrated a few days ago: disconnect
those who offer services to spammers.

NO.

The problem is that a LARGE percentage of spam is sent out bycompromised zombie spambot armies, using the reputation of their(usually clueless) owner.

Even in the case you're referring to, the machines that weredisconnected were NOT the ones that were sending the spam... it wasthe ones the zombies were going to to get their orders.

In the meantime, while there are
still those who will host command-and-control hosts for botnets, we
have to stop mail coming from bots and behind insecure NATs and the
like;

The problem is when tens of thousands of legitimate users share thesame NAT router or SMTP server(s). It is stupid and braindead toblock the mail of all those other responsible users too.


> ...the best way I've seen for doing this is to know how end user

hosts are named, treat mail from them with extreme suspicion (as they
should be using their ISP's outbounds) and *never forget* those names,
by storing them as patterns instead of individual hosts or IPs.

It's often not practical to use the ISP's SMTP servers if you areusing a personal/vanity/small-business domain. So I might agree thatit's reasonable to treat such mail as "of interest" but it's certainlyNOT reasonable to crudely block it as a class.

DNSBLs can't do it, because they mostly work on IP addresses, notnames,

and because the volumes they have to deal with require them to expire
listings rapidly. URIBLs are useful, but it's small comfort to a box
that's already running hundreds of concurrent processes just to deal
with the onslaught, to have to accept mail just so it can analyze the
domains in links found in the bodies.

At some point, it makes sense to offload a lot of the spam-discrimination to the recipient-end-user's system... system-wide,there is in aggregate far more computing power there than thereprobably is in the mail-handling systems along the Net enroute.

Agreed that involves more bandwidth, but the bandwidth is probablycheaper (for many cases at least) than the computing power is.

Besides, it keeps the recipient more in control (and the controlSHOULD be theirs) when the final determination is done under THEIRcontrol.

Steve


--

Gordon Peterson II
http://personal.terabites.com
1977-2007:  Thirty year anniversary of local area networking




-------------------------------------------
Archives: https://www.listbox.com/member/archive/247/=now
RSS Feed: https://www.listbox.com/member/archive/rss/247/
Powered by Listbox: http://www.listbox.com

Current thread:

Re: content filters David Farber (Dec 01)
- <Possible follow-ups>
- Re: content filters David Farber (Dec 01)