Nmap Development mailing list archives
Re: Idea for GSOC 2013
From: Muhammad Junaid Muzammil <mjunaidmuzammil () gmail com>
Date: Thu, 21 Mar 2013 00:27:14 +0500
Hi, I found your comments, quite useful. My response to them is listed below. 1) I totally agree with you that real networks do not have a base-line and the idea is not to model the entire network. The model will be generated only for the training data provided. So there will be only three classes, Known Friendly Class, Known Hostile Class & Unknown Class (treated as hostile) . The training data will vary in different scenarios. Lets take an example of a University campus network. For that, the P2P data even if it is non-intrusive will be detected as an anomaly. 2) Yes, the choice of the classifier will be the issue. We have performed some initial tests over some classifiers using the Weka tool suite. There is a trade off obtained between accuracy and computational complexity of the classifiers. Naive Bayesian classifier was found to be a good one in terms of both the metrics. The accuracy of this classifier can be further improved by using Bayesian Chain Classifiers. Surely I, do agree with you that this is a pretty much long term project and it shouldn't be considered in any way as a summer project. Some parts of the project relevant to Bayesian Classifiers can be done but this will fall outside the scope of nmap. Regards, Junaid On Wed, Mar 20, 2013 at 12:05 PM, Brandon Enright <bmenrigh () ucsd edu> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Junaid, my comments inline and below. On Tue, 19 Mar 2013 20:11:03 +0500 Muhammad Junaid Muzammil <mjunaidmuzammil () gmail com> wrote:Hello, I am Junaid from National University of Sciences and Technology, Pakistan. Currently I am a student of MS Electrical Engineering. *I have an idea for developing an anomaly detector that can be used for the detection of zero day attacks (where rule based IDS fails). This will utilize the application of stochastic and machine learning concepts. A base line model will be generated on the basis of classification algorithm from a training data set.More work on malicious traffic detection is always welcome however the problem is deceptively hard. Networks, devices on networks, and the applications that make use of networks, and other things that generate network traffic are highly varied. In order to have any chance of classifying attacks (new or old) you need a huge corpus of real network traffic and you need to classify this traffic in some way so that it can be used as a training set. Perhaps instead you're planning to use unsupervised machine learning? It's my firm belief that real networks do not have a baseline. Real, legitimate network traffic contains a huge number of anomalies and anomaly detection will tell you about some of them but anomaly detection is not a good classifier for malicious versus legitimate traffic.The training data set represents the statistical discriminators as indicated in [1].I read this paper. It's little more than a list of features that can be used. It's also entirely limited to features of layers 3 and 4. It also doesn't provide any information on identification of new features or what constitutes a good features. A huge amount of malicious traffic uses off-the-shelf client libraries (HTTP library for example), common standard protocols (like HTTP), to common off-the-shelf servers (Apache for example). I find it very unlikely that features extracted from just layers 3 and 4 are going to detect this sort of traffic. I certainly understand a desire to restrict features to just IP and TCP because layer 7 features are much more diversified and and are often protocol specific.Due to multi dimensional nature of the problem, the classification of traffic (whether it falls into friendly or hostile class) depends upon multiple features. Hence, Bayesian Chain Classifier appears to be a suitable candidate (work is in progress) [2].* * * *The implementation will include Bayesian Chain Classifier, Machine learning mechanism from training data set (format of training data set will be ARFF as in [1]), testing over real network traffic. The programming languages will be C/C++. Test cases will be generated using Python Scapy lib.*I read this paper too. It's primarily an algorithm for implementing a scalable multi-label classifier. I don't know enough about machine learning to comment much further however I'd like to see a persuasive argument for why this would be a good classifier versus the multitude of other machine learning models.* * *References:* *[1] A. Moore, D. Zuev and M. Crogan, “Discriminators for use in flow based Classification,” Queen Marry University of London, August 2005, ISSN 1470-5559* * * *[2] J. H. Zaragoza, L. E. Sucar, E. F. Morales, C. Bielza and P. Larranaga, "Bayesian Chain Classifiers for Multidimensional Classification," Proceedings of 22nd International Joint Conference on Artificial Intelligence.* I would like to have feedback over this idea. I think that this will not be able to meet up the three month gsoc timeline. Regards, JunaidAlright, so first, I should be clear that I don't think Nmap is the right fit for your idea. Nmap is a active scanner. Your idea is about traffic classification (an inherently passive activity). Almost nothing about Nmap would enable you to do traffic classification or even get you traffic to classify. Also, I think traffic classification and novelty / anomaly detection is extremely hard. It's especially hard when you're limited to just flow information (layer 3 and 4 features). I've used a lot of various netflow products over the years and they never deliver anything close to what the marketing suggests they will. I don't find this surprising though, network traffic is extraordinarily complex. When it comes to machine learning for anomaly detection I think you'll find a huge amount of skepticism no matter where you go. Any proposal along these lines must be very detailed about related work, prior work, the scope of the proposal, the main hard problems and how you plan to overcome them, etc. To be honest, right now your proposal is little more than "I plan to use machine learning magic to do magical things". The problem is very complicated so unless your proposal reflects your recognition of the intricacies of the problem it will continue to sound like you're hoping for magic to solve all of the major issues. Finally, if you do end up succeeding in your ideas in a meaningful way the work would be worthy of a PhD thesis. You should consider doing this work as part of a long-term research project rather than a summer or two in some other project. Regards, Brandon -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEARECAAYFAlFJX74ACgkQqaGPzAsl94J3QQCgnOb8dcB8i+VJAQNIaiQ2ch/3 RIcAoJhB+EdPfgn2GikvCRluYu1aKPlu =6gO+ -----END PGP SIGNATURE-----
_______________________________________________ Sent through the dev mailing list http://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Idea for GSOC 2013 Muhammad Junaid Muzammil (Mar 19)
- Re: Idea for GSOC 2013 Brandon Enright (Mar 20)
- Re: Idea for GSOC 2013 Muhammad Junaid Muzammil (Mar 20)
- Re: Idea for GSOC 2013 Brandon Enright (Mar 20)