Nmap Development mailing list archives
Re: IPv6 fingerprint database imputation of missing values
From: Alexandru Geana <alex () alegen net>
Date: Mon, 7 Sep 2015 12:46:55 +0200
Hello list, After some discussions on IRC, we decided to check what would make a suitable value for the novelty threshold, provided the imputation feature is included into the training stage. In order to achieve this, I wrote a script which would train a classifier 90 times (there are 90 OS groups in the training set), each time excluding one group. Then, for each of the prints in the excluded group, classification would be performed and the novelty score saved. By finding the minimum of these scores, the idea was to be able to find a new value for the threshold. The results were the following: *) without imputation, the smallest score is 3.95506283351 obtained by the second print in group "Equinox CCM4850 ..." (index 14, 0-based). *) with imputation, the smallest score is 4.68074027082 obtained by the only print in group "HP OfficeJet 8500 printer" (index 40). Both of these values are below the current threshold so I decided to look a bit deeper to find a good value for a new threshold. I made two histogram plots of the scores. While both plots have a similar mean, the bins from the imputed scores follow the distribution more closely. Based on the plot for the imputed scores, I would suggest a new novelty threshold of 25, which is roughly just before the highest bins start. I am attaching the script I used to perform the calculations, two python pickle files with the results and two images of the plots. Best regards, Alexandru Geana alegen.net On 06/30, Alexandru Geana wrote:
Hello list, Last time I was busy with finding the right parameters to apply imputation. Today I am submitting a new set of patches (minor modifications) and explaining some of my findings. One of the early issues I discovered was that there is a lot of variability with fingerprinting and even more with imputation. My workflow was to generate an imputed feature matrix, train the model on it, recompile nmap and fire a scan. This was not an optimal approach since I later found out that scanning the same host in the same environment twice may yield different results w.r.t. reported accuracy. As a result, I changed my approach to reusing the same fingerprint(s) and checking results with the predict.py script. There was no straightforward way to search for what imputation method to apply to which sets of features and the adequate number of imputed sets plus iterations per set. An exhaustive search was too much so instead I considered the following: 1) What is the value range for a feature and how many different values can be found? The answer to this question would tell me if I could treat the feature as either continuous or categorical. Based on this choice, I would select the imputation method. 2) Post-imputation, what does predict.py print? There were multiple things which influence my decision here. I discovered that running predict.py with the same test fingerprint but different imputed feature matrices, would yield different accuracy values and/or different reported OS classes. My aim here was to "stabilize" the results, meaning that if I run imputation 10 times and test the same print, I get 10 rougly similar results back. Each feature is a bit different and after some educated trial and error, I would find the adequate parameters. Imputing categorical variables (i.e. TC, HLIM) are easier to stabilize than imputing continuous variables (i.e. TCP_WINDOW). When imputing the latter, I can obtain either very bad decreases or impressive increases in accuracy. While trying to decrease this variability for continuous variables I tried two things: a) for the purpose of imputation, replace MISSING with the average of the class values and b) when integrating the labels into the imputed matrix, instead of having one column with values ranging from 0 to $no_classes, have $no_classes columns with values of 0 and 1 depending on which class a print belongs to. While these did show improvements, the overall performance was not satisfactory and the variability was still too great. The source of the variability lies with the mice library which performs the actual imputation. I have not gone through the code itself, but based on some papers I read (some of them I shared with my previous email), there is an initialization step which I believe is the cause of the randomness. I have not had time to go through the mice code and see exactly where this takes place. Enough talk, not for some results. I will show the results from applying the complete imputation process a number of 10 times. This is to show that the results are generally stable. 1) Fedora VM Without imputation: 8.17% 19.12 Linux 3.12 - 3.18 With imputation: 63.14% 23.66 Linux 3.12 - 3.18 22.57% 23.66 Linux 3.12 - 3.18 22.48% 23.66 Linux 3.12 - 3.18 22.80% 23.66 Linux 3.12 - 3.18 22.46% 23.66 Linux 3.12 - 3.18 22.45% 23.66 Linux 3.12 - 3.18 22.69% 23.66 Linux 3.12 - 3.18 22.52% 23.66 Linux 3.12 - 3.18 22.64% 23.66 Linux 3.12 - 3.18 22.37% 23.66 Linux 3.12 - 3.18 Minor increase in the novelty factor, but a larger one with regards to the accuracy. The first entry shows a much higher accuracy level than the other as a result of the variability, but for the rest it is quite stable. 2) scanme.nmap.org from a hetzner dedicated Without imputation: 76.59% 5.58 Linux 3.13 - 3.19 With imputation: 85.50% 15.01 Linux 3.13 - 3.19 85.44% 15.01 Linux 3.13 - 3.19 85.35% 15.01 Linux 3.13 - 3.19 85.38% 15.01 Linux 3.13 - 3.19 86.92% 15.01 Linux 3.13 - 3.19 85.43% 15.01 Linux 3.13 - 3.19 85.47% 15.01 Linux 3.13 - 3.19 85.42% 15.01 Linux 3.13 - 3.19 97.34% 15.01 Linux 3.13 - 3.19 85.43% 15.01 Linux 3.13 - 3.19 This follows the same as the previous result. 3) Windows 8.1 VM Without imputation: 99.67% 2.96 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview With imputation: 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.55% 15.27 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.67% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview 99.66% 16.46 Microsoft Windows Vista SP2 or Windows 7 SP1 or Windows Server 2008 R2 SP1 or Windows 8 Consumer Preview The accuracy is already rather high and only the novelty is slightly increased. You can see the results of these tests (e.g. nmap.model files, used fingerprints) here: Fedora: https://github.com/alegen/nmap/blob/7ce1a7791fd78ddf858824f2d6412021164ae7d1/ipv6tests/fedora_test.tar.gz Scanme: https://github.com/alegen/nmap/blob/7ce1a7791fd78ddf858824f2d6412021164ae7d1/ipv6tests/scanme_test.tar.gz Win 8: https://github.com/alegen/nmap/blob/7ce1a7791fd78ddf858824f2d6412021164ae7d1/ipv6tests/windows_test.tar.gz Let me know what you think and if you have any further suggestions! Best regards, Alexandru Geana alegen.net
Attachment:
novelty_threshold.py
Description:
Attachment:
imputed_novelty_score_statistics.pickle
Description:
Attachment:
unimputed_novelty_score_statistics.pickle
Description:
Attachment:
signature.asc
Description: Digital signature
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Sep 07)