Nmap Development mailing list archives

Re: Request for Comments: New IPv6 OS detection machine learning engine

From: Mathias Morbitzer <m.morbitzer () runbox com>
Date: Mon, 20 Feb 2017 10:55:20 +0100

On 02/10/2017 02:07 AM, Fyodor wrote:

    1) Considering the size of our DB (300+ fingerprints), the random
    forest model is a good choice. To make use of more complex models,

    such as neural networks or deep learning, we would need a much
    bigger database. Therefore, I suggest to stick with random forest

    for the near future and instead focus on improving in other areas.
That makes sense. The IPv6 DB is definitely not mature yet (in termsof number of fingerprints), but for comparison we can look at IPv4.It has 5,336 fingerprints but in terms of unique class lines (which Ithink is a more apt comparison to what we use as IPv6 fingerprints) wehave 1,384. So I'd say our IPv6 OS DB will stay below 2,000 printsfor the forseeable future.

This seems like a legitimate assumption. I'm wondering if this willchanges once all our light bulps have IPv6 Internet access... :)

    2) Since the random forest (and other types of ensemble models)
    are already multi-modal, multi-stage would not improve accuracy.

    However, it should also not make it worse. Since we have multiple
    reasons to prefer the multistage approach, this is good news.
What are the other reasons to prefer the multistage approach if itdoesn't improve accuracy? I guess maybe as an easy way to give thebroad/rough match of OS family (such as "Windows") even if we don'thave full confidence in a precise version?


I'm quoting Prabhjyot's "RFC" email [1] here:

"Advantages of MSRF:

i) It represents the actual hierarchy of operating systems more closely.What I mean is that, two linux kernel are more similar than

a linux kernel and a windows system.

ii) The combined size of all models in msrf is 3.6 MBs which is half ofRF's (8 MBs).iii) There is a lot of scope of plugging in more features when usingMSRF. For example, we may choose to send a different set of probes

if the first stage tells us that it seems like a Linux device."

I also like the idea of giving are rough match of the OS family if weare not confident in a precise version.

However, my personal favorite is iii): Having two stages allows us tocompletely rethink our probes. With the two stage model, we couldsend for example only five probes for the first stage to determine theOS family. Once we did that, we can send out more probes that will

help us nailing down the exact version.

Currently, we usually consider a probe a "good" probe if it is able todistinguish between more than only two operating systems. Let's say wehave a probe that is able to distinguish between Windows 8 and Windows8.1, but nothing else. This probe wouldn't make it in our current system,because we only take the most powerful probes. Using multistaging, wecould send such a probe once we determined that we are dealing with

a Windows host, and only for a Windows host.

This could allow us to decrease the number of probes we are sending (Forexample sending 5 probes to determine the OS family, andanother 5 to determine the OS version) because we wouldn't have to sendfor example the probe I mentioned above that distinguishesbetween Windows OSs if we already figured out we are dealing with aLinux host. Further, since we would be able to include probes that wepreviously excluded because they weren't powerful enough, this wouldhopefully also increase accuracy.

But I'm also aware of the fact that messing around with our probes couldinvalidate our current fingerprint database. Maybe we could come upwith an imputation solution or some other way to transfer fingerprintsfrom the old to a new scheme....


    3) In terms of evaluation, the 80:20 split is not a good idea
    since the test size is too small, this will create variance on
    precision.

    It would be better to re-run the tests multiple times with a 50:50
    split, and then check the mean average precision and variance.


I'm not exactly sure what this means but will take your word for it :).

80:20 split means that we take 20% of our fingerprints and try toclassify them with the remaining 80% of the fingerprints.Taking 50%, and try to classify them with the other 50% and repeatingthis process multiple times would give us more accurate test results.

From what I understand, this is mostly because it would reduce bias.



    5) As we already thought, having 695 features is quite a lot.
    Approaches to reduce the amount of features could be for example

    using neural networks or principal component analysis (PCA). We
    did play around with such things a bit before, but it might be
    interesting

    to have another look.

I'm not exactly familiar with these either, but it definitely soundslike it's worth a look!

Unfortunately, me neither. Anyone who is is welcome to apply to the nextGoogle Summer of Code!


    6) I also learned that ML might not always be the best solution
    when it comes to figuring out exactly one perfect match. ML is good in

    providing the top k results, from which there is a  high
    probability that one is correct. So this might be also something
    to consider in


Interesting...

It is indeed something to consider. I could for example imagine givingthe user the top 3 matches (if they have a certain score),

or have an option to do so.

    7) And finally, I've been told that we could also try the non-ML
    approach of signature based checking.
Well we at least have the signature based IPv4 OS detection system forcomparison. That has worked pretty well for us althoughour hope wasthat the machine learning IPv6 system would prove to be more powerful(and easier to maintain) method than relying on our own experts tocreate signatures.

From what I understood, the idea was more to have a database along thelines of "Linux 4.9.1 = SHA256(reponse1|response2|....|response16)".Of course, this wouldn't work once a single bit is changed in theresponse, for example by an intermediary node.Maybe more something like "Linux 4.9.1: Response 1= 0x12345..., Response2=0x6789..."? Then we could calculate the distance betweenthe fingerprint and the responses we got, and decide if we have a match.Not sure how much work it would be to maintain such a system...



Cheers,
Mathias

[1] http://seclists.org/nmap-dev/2016/q3/82

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Jan 20)
- Re: Request for Comments: New IPv6 OS detection machine learning engine Fyodor (Feb 09)
  - Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Feb 20)
- <Possible follow-ups>
- Re: Request for Comments: New IPv6 OS detection machine learning engine Varunram Ganesh (Feb 20)
  - Re: Request for Comments: New IPv6 OS detection machine learning engine Mathias Morbitzer (Mar 02)