Nmap Development mailing list archives

Re: Request for Comments: New IPv6 OS detection machine learning engine

From: Mathias Morbitzer <m.morbitzer () runbox com>
Date: Fri, 20 Jan 2017 18:31:38 +0100

Hi everyone,

Meanwhile, I managed to get feedback on the new implementation westarted last summer from some people who know their ML.

Let me start by saying that we are doing quite good! :) However, ofcourse there are things we could improve / comments on future work:

1) Considering the size of our DB (300+ fingerprints), the random forestmodel is a good choice. To make use of more complex models,

such as neural networks or deep learning, we would need a much biggerdatabase. Therefore, I suggest to stick with random forest


for the near future and instead focus on improving in other areas.

2) Since the random forest (and other types of ensemble models) arealready multi-modal, multi-stage would not improve accuracy.

However, it should also not make it worse. Since we have multiplereasons to prefer the multistage approach, this is good news.

The reason why the multistage approach performed slightly worse in ourtests is probably the way we did the test, which brings me to

3) In terms of evaluation, the 80:20 split is not a good idea since thetest size is too small, this will create variance on precision.

It would be better to re-run the tests multiple times with a 50:50split, and then check the mean average precision and variance.

Also, for the multistage, it would be interesting to analyze for wrongclassifications if they are already incorrectly classified in stage 1,


or in stage 2. This brings me to

4) We could reconsider our choice for the stage1 classifier. For thecurrent first stage, we took the 4 main operating systems plus

a group "others". It could make more sense to create different groupsbased on similar behavior.

5) As we already thought, having 695 features is quite a lot. Approachesto reduce the amount of features could be for example

using neural networks or principal component analysis (PCA). We did playaround with such things a bit before, but it might be interesting


to have another look.

6) I also learned that ML might not always be the best solution when itcomes to figuring out exactly one perfect match. ML is good in

providing the top k results, from which there is a high probabilitythat one is correct. So this might be also something to consider in

the future for our tests (consider the top k results will give a betteroverview of how the model performs), and also when OS detection isperformed,


we could give the user the top k OS guesses, or at least have this option.

7) And finally, I've been told that we could also try the non-MLapproach of signature based checking.

So that's it regarding feedback. I hope we can increase accuracy evenfurther with this information!



Cheers,

Mathias

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Jan 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Fyodor (Feb 09)
  - Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Feb 20)
- <Possible follow-ups>
- Re: Request for Comments: New IPv6 OS detection machine learning engine Varunram Ganesh (Feb 20)
  - Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Mar 02)