Full Disclosure mailing list archives

Re: Spam with PGP


From: Bob Apthorpe <apthorpe+fd () cynistar net>
Date: Tue, 7 Oct 2003 15:18:28 -0500 (CDT)

Hi,

I suggest that before you start explaining what SpamAssassin does and how
it does it that you visit http://www.spamassassin.org/, specifically the
README at http://www.spamassassin.org/full/2.7x/dist/README

On Tue, 7 Oct 2003, Jonathan A. Zdziarski wrote:
[missing attribution] wrote:
Of course, SpamAssassin does bayesian filtering as well.

heuristic + bayesian is better than either alone, IMHO.

Actually the way SA does it weakens filtering.  SA's bayesian filtering
is only a very small piece of SA, and unfortunately not much attention
has been given to it.  The filter's final calculation is only a small
percentage of the actual final score.

Here are SA's Bayesian scores; the four columns of scores are:
1: no network tests (DNSBLs, Razor, DCC, Pyzor), no Bayes
2: network tests, no Bayes
3: no network test, Bayes
4: network tests, Bayes

score BAYES_00 0 0 -4.901 -4.900
score BAYES_01 0 0 -0.600 -1.524
score BAYES_10 0 0 -0.734 -0.908
score BAYES_20 0 0 -0.127 -1.428
score BAYES_30 0 0 -0.349 -0.904
score BAYES_40 0 0 -0.001 -0.001
score BAYES_44 0 0 -0.001 -0.001
score BAYES_50 0 0 0.001 0.001
score BAYES_56 0 0 0.001 0.001
score BAYES_60 0 0 1.789 1.592
score BAYES_70 0 0 2.142 2.255
score BAYES_80 0 0 2.442 1.657
score BAYES_90 0 0 2.454 2.101
score BAYES_99 0 0 5.400 5.400

The lowest positive Bayesian score (BAYES_60 w/network tests) is 1.592,
providing ~32% of the (default) 5 points necessary for a message to be
flagged as spam. This would appear to counter your claims that SA's
Bayesian classifier provides only a small fraction of the total score.

Because true Bayesian filtering
performs a huge majority of the same tests that SA performs, SA's own
ruleset easily waters down any bayesian findings whenever there are
opposing values between the two.

The Bayesian classifier does not perform the same rule-based heuristic
tests. Depending on how vigilant the end-user was in training the Bayesian
classifier, it's rare that the statistical scores and the heuristic scores
are both large and of opposite signs.

For example, a pine MUA...SA thinks a
pine MUA suggests an innocent message, but a majority of the emails with
a pine MUA my wife receives are spams.  In this case, the hard-coded MUA
rule will unfortunately water down the score, even if Bayes thinks a
pine MUA is spam.  Obviously the pine MUA is just a small rule, but if
you apply this to the other rules, you get the same results.

SA 2.5x had a number of negative-scoring tests that were easily forged
(various MUA signatures, REFERENCES, IN_REP_TO, PGP signatures, etc.)
These rules have been dropped from SA 2.60 or have had their scores far
reduced to counter this known problem.

What's worse is that last time I looked (this may have changed), SA's
bayesian filter did not appear to have a mechanism for learning, but was
just a static dictionary.  If users got spam there was no way for the
user to forward their spams into the system for processing.  Again, this
may have changed and if it has, that's great.

SA has included sa-learn for manual training ever since the Bayesian
classifier was incorporated into the code (v2.50.) Additionally, SA
contains thresholds above/below which messages will be automatically
learned as spam/ham so the system trains itself (albeit slowly) without
user intervention.

The product of Bayesian filtering includes all the heuristic tests as
well, so having both _hurts_ you, and is not something you benefit
from.

No it does not, on all counts. You need to review the difference between
heuristic and statistical classifiers.

It is much better to focus on creating a strong probability-based
filter IMHO...and I think the statistics agree with me.

Then perhaps you should join forces with the people already performing
such statistical comparisons between SpamAssassin, CRM114, bogofilter, and
the like. The SA development list is at
http://lists.sourceforge.net/mailman/listinfo/spamassassin-devel

This problem (evading spam-filtering by including a bogus PGP sig) is a
recognized and dead issue. The solution is to keep your security tools
up-to-date. As SA filters more spam, spammers will find new ways around
the filters, heuristic, statistical, or otherwise.

-- 
Bob Apthorpe

_______________________________________________
Full-Disclosure - We believe in it.
Charter: http://lists.netsys.com/full-disclosure-charter.html


Current thread: