IDS mailing list archives

Re: Intrusion Detection Evaluation Datasets


From: "Stuart Staniford" <sstaniford () FireEye com>
Date: Thu, 12 Mar 2009 15:55:04 -0700


On Mar 12, 2009, at 8:40 AM, Zow Terry Brugger wrote:

I see a lot of people saying (correctly) that advanced (non-signature
based) NIDS can't be researched until we have good evaluation
datasets, and I see a lot of people ignoring them and doing it anyway.
Is anyone (else) actually working on fixing the data problem?

There's a number of things about the framing of this discussion that are bugging me (I come at this from the perspective of having spent quite a bit of time on both the research and the commercial sides of the field).

For one, the nature of the intrusion detection problem is very dynamic. Ten+ years ago, the biggest problem was interactive attacks. Five years ago, the biggest headache for organizations was automated random scanning worms. Today, RS worms have become much less of a big deal, and most of the action is attacks on clients primarily via the web, and the resulting remote control of systems via bots. These are very different problems requiring pretty different approaches. And in another five years, I'm sure the main problem will be something else again. So the main nuisances on the wire keep changing, and any dataset is necessarily going to get stale very quickly. In particular, quite a lot of staleness will happen between the start of a hypothetical graduate student starting and finishing a thesis.

Secondly, I think there's an assumption lurking implicitly in the search for datasets that the appropriate focus for research is the inference algorithm. Much like the machine learning community does - get a fixed data set, and then try all kinds of inference algorithms to see what works best. For our problem set, I don't think that's a great way of doing things. For us, the main focus is "What are the bad guys doing now?" and "What features do we need to detect what they are now doing". Usually, if you have good features with high discrimination, most algorithms can be tweaked to do ok. If you don't have good features, no inference algorithm will save you. And if you have good features today, they'll be a lot less useful in a couple of years and new ones will have to be invented.

I think there's a lot of contribution that researchers can continue to make in this field. But you can't think of it that you are discovering timeless principles or something - this is much too applied a field. It's about figuring out what's happening on the wire *now*, and what can be done about it.

So forget looking for a dataset. Look for a wire. Do whatever it takes to get your institution let you sniff the egress link - it's just about guaranteed to have plenty of attacks on it. Build, or adapt, some software to look at the packets with respect to some problem that interests you and that seems like a currently rising challenge. Spend a lot of time manually poring over the packets to figure out what is going on, and label your own data. You need to get your hands dirty. If you look at the most influential highly cited researchers (Todd Heberlein, Vern Paxson, etc, etc) their influential contributions were always driven off actually trying to detect attacks on real networks. In the end, intrusion detection is about detecting intrusions, just like the name says. Any amount of theoretical or algorithmic sophistication is a waste of time unless it directly contributes to that goal, and no amount of sophistication will be very exciting if it only improves the detection of five-year-old attacks (this is not to say that technical sophistication is not required for current problems - I believe it is).

I think the problem of producing regular timely datasets that can be safely published is probably just about intractable, even if one of the funding agencies were to step up to try and fill the shoes DARPA long ago left behind. Synthetic datasets would not be that interesting, and since most attacks are now inside packet content, the challenge of reliably anonymizing the data while not affecting the traffic materially would be just about impossible (what algorithm is going to sanitize every single web developer's cookie format, for example? How could one be sure that obfuscated javascript didn't contain any personal information?).

Stuart Staniford.



Current thread: