Nmap Development mailing list archives

Re: Idea for GSOC 2013

From: Muhammad Junaid Muzammil <mjunaidmuzammil () gmail com>
Date: Thu, 21 Mar 2013 00:27:14 +0500

Hi,

I found your comments, quite useful. My response to them is listed below.

1) I totally agree with you that real networks do not have a base-line and
the idea is not to model the entire network. The model will be generated
only for the training data provided. So there will be only three classes,
Known Friendly Class, Known Hostile Class & Unknown Class (treated as
hostile) . The training data will vary in different scenarios. Lets take an
example of a University campus network. For that, the P2P data even if it
is non-intrusive will be detected as an anomaly.

2) Yes, the choice of the classifier will be the issue. We have performed
some initial tests over some classifiers using the Weka tool suite. There
is a trade off obtained between accuracy and computational complexity of
the classifiers.  Naive Bayesian classifier was found to be a good one in
terms of both the metrics. The accuracy of this classifier can be further
improved by using Bayesian Chain Classifiers.

Surely I, do agree with you that this is a pretty much long term project
and it shouldn't be considered in any way as a summer project. Some parts
of the project relevant to Bayesian Classifiers can be done but this will
fall outside the scope of nmap.

Regards,
Junaid


On Wed, Mar 20, 2013 at 12:05 PM, Brandon Enright <bmenrigh () ucsd edu> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Junaid, my comments inline and below.

On Tue, 19 Mar 2013 20:11:03 +0500
Muhammad Junaid Muzammil <mjunaidmuzammil () gmail com> wrote:

Hello,

I am Junaid from National University of Sciences and Technology,
Pakistan. Currently I am a student of MS Electrical Engineering.

*I have an idea for developing an anomaly detector that can be used
for the detection of zero day attacks (where rule based IDS fails).
This will utilize the application of stochastic and machine learning
concepts. A base line model will be generated on the basis of
classification algorithm from a training data set.


More work on malicious traffic detection is always welcome however the
problem is deceptively hard.  Networks, devices on
networks, and the applications that make use of networks, and other
things that generate network traffic are highly varied.

In order to have any chance of classifying attacks (new or old) you
need a huge corpus of real network traffic and you need to classify
this traffic in some way so that it can be used as a training set.
Perhaps instead you're planning to use unsupervised machine learning?

It's my firm belief that real networks do not have a baseline.  Real,
legitimate network traffic contains a huge number of anomalies and
anomaly detection will tell you about some of them but anomaly
detection is not a good classifier for malicious versus legitimate
traffic.

The training data
set represents the statistical discriminators as indicated in [1].


I read this paper.  It's little more than a list of features that can
be used.  It's also entirely limited to features of layers 3 and 4.  It
also doesn't provide any information on identification of new features
or what constitutes a good features.

A huge amount of malicious traffic uses off-the-shelf client libraries
(HTTP library for example), common standard protocols (like HTTP), to
common off-the-shelf servers (Apache for example).

I find it very unlikely that features extracted from just layers 3 and
4 are going to detect this sort of traffic.

I certainly understand a desire to restrict features to just IP and TCP
because layer 7 features are much more diversified and and are often
protocol specific.

Due to multi dimensional nature of the problem, the classification of
traffic (whether it falls into friendly or hostile class) depends
upon multiple features. Hence, Bayesian Chain Classifier appears to
be a suitable candidate (work is in progress) [2].* *
*
*The implementation will include Bayesian Chain Classifier, Machine
learning mechanism from training data set (format of training data
set will be ARFF as in [1]), testing over real network traffic.  The
programming languages will be C/C++. Test cases will be generated
using Python Scapy lib.*


I read this paper too.  It's primarily an algorithm for implementing a
scalable multi-label classifier.  I don't know enough about machine
learning to comment much further however I'd like to see a persuasive
argument for why this would be a good classifier versus the multitude
of other machine learning models.

*
*
*References:*
*[1] A. Moore, D. Zuev and M. Crogan, “Discriminators for use in flow
based Classification,” Queen Marry University of London, August 2005,
ISSN 1470-5559*
*
*
*[2] J. H. Zaragoza, L. E. Sucar, E. F. Morales, C. Bielza and P.
Larranaga, "Bayesian Chain Classifiers for Multidimensional
Classification," Proceedings of 22nd International Joint Conference on
Artificial Intelligence.*

I would like to have feedback over this idea. I think that this will
not be able to meet up the three month gsoc timeline.

Regards,
Junaid



Alright, so first, I should be clear that I don't think Nmap is the
right fit for your idea.  Nmap is a active scanner.  Your idea is
about traffic classification (an inherently passive activity).  Almost
nothing about Nmap would enable you to do traffic classification or
even get you traffic to classify.

Also, I think traffic classification and novelty / anomaly detection is
extremely hard.  It's especially hard when you're limited to just flow
information (layer 3 and 4 features).  I've used a lot of various
netflow products over the years and they never deliver anything close
to what the marketing suggests they will.  I don't find this surprising
though, network traffic is extraordinarily complex.

When it comes to machine learning for anomaly detection I think you'll
find a huge amount of skepticism no matter where you go.  Any proposal
along these lines must be very detailed about related work, prior work,
the scope of the proposal, the main hard problems and how you plan to
overcome them, etc.  To be honest, right now your proposal is little
more than "I plan to use machine learning magic to do magical things".
The problem is very complicated so unless your proposal reflects your
recognition of the intricacies of the problem it will continue to sound
like you're hoping for magic to solve all of the major issues.

Finally, if you do end up succeeding in your ideas in a meaningful way
the work would be worthy of a PhD thesis.  You should consider doing
this work as part of a long-term research project rather than a summer
or two in some other project.

Regards,

Brandon




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)

iEYEARECAAYFAlFJX74ACgkQqaGPzAsl94J3QQCgnOb8dcB8i+VJAQNIaiQ2ch/3
RIcAoJhB+EdPfgn2GikvCRluYu1aKPlu
=6gO+
-----END PGP SIGNATURE-----

_______________________________________________
Sent through the dev mailing list
http://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Idea for GSOC 2013 Muhammad Junaid Muzammil (Mar 19)
- Re: Idea for GSOC 2013 Brandon Enright (Mar 20)
  - Re: Idea for GSOC 2013 Muhammad Junaid Muzammil (Mar 20)