Nmap Development mailing list archives
Re: IPv6 fingerprint database imputation of missing values
From: Alexandru Geana <alex () alegen net>
Date: Wed, 3 Jun 2015 12:08:50 +0200
Hello list, I have been quite bussy with this imputation topic for the past couple of weeks and I have some interesting results that I would like to share. I will to explain my findings/code on a per source file/diff basis. 1) impute.py This contains the main entry-point for handling imputation of the feature matrix. I was told that during integration of submitted fingerprints the logistic regression model is trained multiple times. In order to reduce the necessary time, I coded imputation to reuse results from previous executions (stored in a file features.npy) if they are available. Another decision I took was to impute a minimal number of features, but which offer maximum benefit. In order to find the features with the highest merit (for classification), I used scikit-learn and their implementation of a feature ranking algorithm [1]. This RFE algorithm also takes a lot of time and stores/reuses intermediate results if available (file called features_to_impute). The nice part about scikit-learn is that it uses liblinear also. Imputation can have different results, depending on how features are grouped together. During imputation, each feature influences all other features. Throughout my testing, I discovered that grouping these features is an important decision (more on this later). I added the concept of imputation stages - one stage is equivalent to applying imputation to one group of features. One feature can be part of only one group. This means that groups are mutually exclusive and theoretically, stages can be parallelized, but unfortunately rpy2 does not allow this (found out after implementing it). 2) impute_mice.R The actual imputation is done with an R library called mice [2]. It uses the multiple imputation technique, which is more of a framework for applying imputation and not an algorithm itself. The glue between python and R is done via rpy2. An imputation strategy can be specified for each variable depending on the type of the variable. These strategies are listed and explained in chapter 3 (page 16) of the very well written manual of this library [3]. I kept the original names throughout the code. One of the problems that I had with mice was that values of imputed categorical variables are remapped (e.g. values 64, 128 and 255 become 1, 2 and 3). In theory this should not be a huge problem, but in our case, we need the original values to be present during training to reflect the values that are seen during classification. For this, I wrote some extra code to remap the values back, so everything should be fine now. 3) rfe.py This file contains the code for handling the calls to scikit-learn. It is used withing impute.py, but can also be executed as a script on its own. I tried to make '--help' as self-explanatory as possible. There are two functions which give the features with the highest merit. One of them is "give me the best N features" and accepts N as an argument and the other uses cross validation to find N. The algorithm takes a bit of time to finish so I am including a precomputed "features_to_impute" file obtained with the 2nd approach. 4) parse.py.diff Added the code to parse nmap.set so that the imputation strategies and stages are also returned. The format is: <feature> / <imp_strategy_1> [ | <imp_strategy_2> ] / <imp_stage> 5) train.py.diff Some minor changes to account for new function signatures, some extra cli arguments and writing nmap.model to a file instead of stdout. 6) nmap.set.diff Proof of concept of how nmap.set would look like. 7) nmap.diff (optional) Extra debugging information for what probes receive answers. This is not (directly) related to imputation, but I used it for my testing and thought to share it. Furthermore, I also want to share some theoretical findings. This multiple imputation technique applies learning algorithms to groups of features and tries to "learn" the missing values from the existing ones. These are some of the conclusions that I have reached: 1) Features need to be grouped together such that there is relevant information for the algorithm to learn missing values. If irrelevant features are imputed together, then the algorithm will just learn a model that seems to fit, but the imputed values will not be realistic. My hypothesis is that this problem becomes smaller and smaller as the training set increases in the number of examples, but currently there are not enough examples to get away from it. The exact groups for features still needs a bit more research. 2) As said in previous emails, multiple iterations are performed per imputed set. An iteration is equivalent to applying a learning algorithm to each feature (explained very well in [4], page 3, MICE steps). After a certain number of iterations, the values should stabilize and converge, meaning that after each iteration the missing values change very little or not at all. The exact minimum number of iterations still needs a bit more work. I thought that the number had to be very high (~100), but it seems that it is better to group the features in the right way and have a smaller number of iterations to achieve better results. [1] http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html [2] http://cran.r-project.org/web/packages/mice/index.html [3] http://www.jstatsoft.org/v45/i03/paper [4] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/pdf/nihms267760.pdf Best regards, Alexandru Geana alegen.net On 04/22, Alexandru Geana wrote:
Hello devs, Attached to this e-mail I am sending a complete set of patches for imputation of missing values. There are not many differencess from the previous versions, just a mechanism for reusing previous output, cleaner code and diffs against the newest versions of the python scripts in nmap-exp. Some of the files are diffs while others are the complete new versions of the files since this makes them more readable considering the ratio between existing and added code. I am open to new suggestions and more feedback! Best regards, Alexandru Geana alegen.net
Attachment:
features_to_impute
Description:
Attachment:
impute_mice.R
Description:
Attachment:
impute.py
Description:
Attachment:
nmap.diff
Description:
Attachment:
nmap.set.diff
Description:
Attachment:
parse.py.diff
Description:
Attachment:
rfe.py
Description:
Attachment:
train.py.diff
Description:
Attachment:
signature.asc
Description: Digital signature
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 10)
- Re: IPv6 fingerprint database imputation of missing values David Fifield (Apr 10)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 13)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 22)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Jun 03)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Jun 30)
- Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 13)
- Re: IPv6 fingerprint database imputation of missing values David Fifield (Apr 10)