Security Basics mailing list archives

Re: Where to get spam?


From: "Micheal Espinola Jr" <michealespinola () gmail com>
Date: Mon, 19 Feb 2007 22:33:17 -0500

That's where signatures and heuristics fail, and a properly tuned
(truly randomized) Bayesian database succeeds.

For instance:  The anti-spam product ASSP keeps two directories of
saved emails: ham and spam.  These collections are referred to as the
corpuses.  All messages ham and spam are saved into the corpuses.  On
a scheduled basis, the corpuses are trimmed to a set amount of maximum
messages in which to use to [re]build the Bayesian database with.

BUT to prevent intentional spammer abuse, the cleanup of the corpus is
randomized. Deletion of corpus messages are not based on date, thus
leaving in the corpus messages that can be years old.  This also helps
prevent newer waves of spam from intentionally skewing the Bayesian
ham/spam word tables.

You might thing that this poses a problem of the database becoming
"stale" to newer forms of spam, but this simply is not the case as
demonstrated IRL.

I would never claim its perfect; because I don't believe any anti-spam
product is, but anything that does make it through is quickly
corrected and compensated for once the user that received the
false-negative spam properly reports it back to ASSP.

ASSP is able to manage this appropriately by maintaining ~18,000
messages in each corpus.

I can't speak for other products, but this illustrates where the
Bayesian aspects of ASSP prevail over other products that rely more
heavily on signatures and heuristics.


On 2/19/07, Mark Teicher <mht3 () earthlink net> wrote:
This is a very interesting question.  Why do you need spam from 2006/2007, SPAM TTL is <24, most SPAM engines will not 
detect SPAM > 30 days old.  I have researched this problem for over a long period of time, most anti-spam products out 
there will have issues detecting any type of spam over 2 weeks old, since keeping signature/heuristic bases that huge will 
slow down the performance of the product, which is an interesting question in of itself.  Why..


You are better off working with a university or local school that retains their mail for some period of time

Mark

At 04:16 PM 2/17/2007, secbasics () dusty ece cmu edu wrote:
That was almost perfect. Unfortunately since I am correlating spam data against other traffic types, I need the spam to 
be from 2006/2007, and the most recent
one there is 2005.

Thanks anyway though.

Aaron

On Sat, Feb 17, 2007 at 01:23:39PM +1100, David West wrote:
> Try the SpamAssassin public mail corpus..
> http://spamassassin.apache.org/publiccorpus/
>
> David West
>
> On 2/16/07, secbasics () dusty ece cmu edu <secbasics () dusty ece cmu edu> wrote:
> >Does anyone know organizations which give away spam captures? I mean,
> >obviously I will get lots of spam just from posting on this list (;)) but
> >I would like to
> >get more to analyze. It seems like every couple months some student does a
> >project which requires spam but they always have to start from ground
> >zero. Isn't
> >there anywhere which gives spam to security researchers?
> >
> >Thanks
> >
> >Aaron
> >




--
ME2


Current thread: