Nmap Development mailing list archives

Re: Request for Comments: New IPv6 OS detection machine learning engine


From: Mathias Morbitzer <m.morbitzer () runbox com>
Date: Mon, 20 Feb 2017 10:55:20 +0100

On 02/10/2017 02:07 AM, Fyodor wrote:

    1) Considering the size of our DB (300+ fingerprints), the random
    forest model is a good choice. To make use of more complex models,

    such as neural networks or deep learning, we would need a much
    bigger database. Therefore, I suggest to stick with random forest

    for the near future and instead focus on improving in other areas.


That makes sense. The IPv6 DB is definitely not mature yet (in terms of number of fingerprints), but for comparison we can look at IPv4. It has 5,336 fingerprints but in terms of unique class lines (which I think is a more apt comparison to what we use as IPv6 fingerprints) we have 1,384. So I'd say our IPv6 OS DB will stay below 2,000 prints for the forseeable future.

This seems like a legitimate assumption. I'm wondering if this will changes once all our light bulps have IPv6 Internet access... :)

    2) Since the random forest (and other types of ensemble models)
    are already multi-modal, multi-stage would not improve accuracy.

    However, it should also not make it worse. Since we have multiple
    reasons to prefer the multistage approach, this is good news.


What are the other reasons to prefer the multistage approach if it doesn't improve accuracy? I guess maybe as an easy way to give the broad/rough match of OS family (such as "Windows") even if we don't have full confidence in a precise version?

I'm quoting Prabhjyot's "RFC" email [1] here:

"Advantages of MSRF:
i) It represents the actual hierarchy of operating systems more closely. What I mean is that, two linux kernel are more similar than
a linux kernel and a windows system.
ii) The combined size of all models in msrf is 3.6 MBs which is half of RF's (8 MBs). iii) There is a lot of scope of plugging in more features when using MSRF. For example, we may choose to send a different set of probes
if the first stage tells us that it seems like a Linux device."

I also like the idea of giving are rough match of the OS family if we are not confident in a precise version.

However, my personal favorite is iii): Having two stages allows us to completely rethink our probes. With the two stage model, we could send for example only five probes for the first stage to determine the OS family. Once we did that, we can send out more probes that will
help us nailing down the exact version.

Currently, we usually consider a probe a "good" probe if it is able to distinguish between more than only two operating systems. Let's say we have a probe that is able to distinguish between Windows 8 and Windows 8.1, but nothing else. This probe wouldn't make it in our current system, because we only take the most powerful probes. Using multistaging, we could send such a probe once we determined that we are dealing with
a Windows host, and only for a Windows host.

This could allow us to decrease the number of probes we are sending (For example sending 5 probes to determine the OS family, and another 5 to determine the OS version) because we wouldn't have to send for example the probe I mentioned above that distinguishes between Windows OSs if we already figured out we are dealing with a Linux host. Further, since we would be able to include probes that we previously excluded because they weren't powerful enough, this would hopefully also increase accuracy.

But I'm also aware of the fact that messing around with our probes could invalidate our current fingerprint database. Maybe we could come up with an imputation solution or some other way to transfer fingerprints from the old to a new scheme....


    3) In terms of evaluation, the 80:20 split is not a good idea
    since the test size is too small, this will create variance on
    precision.

    It would be better to re-run the tests multiple times with a 50:50
    split, and then check the mean average precision and variance.


I'm not exactly sure what this means but will take your word for it :).

80:20 split means that we take 20% of our fingerprints and try to classify them with the remaining 80% of the fingerprints. Taking 50%, and try to classify them with the other 50% and repeating this process multiple times would give us more accurate test results.
From what I understand, this is mostly because it would reduce bias.


    5) As we already thought, having 695 features is quite a lot.
    Approaches to reduce the amount of features could be for example

    using neural networks or principal component analysis (PCA). We
    did play around with such things a bit before, but it might be
    interesting

    to have another look.


I'm not exactly familiar with these either, but it definitely sounds like it's worth a look!

Unfortunately, me neither. Anyone who is is welcome to apply to the next Google Summer of Code!


    6) I also learned that ML might not always be the best solution
    when it comes to figuring out exactly one perfect match. ML is good in

    providing the top k results, from which there is a  high
    probability that one is correct. So this might be also something
    to consider in


Interesting...

It is indeed something to consider. I could for example imagine giving the user the top 3 matches (if they have a certain score),
or have an option to do so.


    7) And finally, I've been told that we could also try the non-ML
    approach of signature based checking.


Well we at least have the signature based IPv4 OS detection system for comparison. That has worked pretty well for us althoughour hope was that the machine learning IPv6 system would prove to be more powerful (and easier to maintain) method than relying on our own experts to create signatures.

From what I understood, the idea was more to have a database along the lines of "Linux 4.9.1 = SHA256(reponse1|response2|....|response16)". Of course, this wouldn't work once a single bit is changed in the response, for example by an intermediary node. Maybe more something like "Linux 4.9.1: Response 1= 0x12345..., Response 2=0x6789..."? Then we could calculate the distance between the fingerprint and the responses we got, and decide if we have a match. Not sure how much work it would be to maintain such a system...


Cheers,
Mathias

[1] http://seclists.org/nmap-dev/2016/q3/82

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: