Nmap Development mailing list archives

Re: IPv6 fingerprint database imputation of missing values

From: Alexandru Geana <alex () alegen net>
Date: Wed, 3 Jun 2015 12:08:50 +0200

Hello list,

I have been quite bussy with this imputation topic for the past couple
of weeks and I have some interesting results that I would like to share.
I will to explain my findings/code on a per source file/diff basis.

1) impute.py
This contains the main entry-point for handling imputation of the
feature matrix. I was told that during integration of submitted
fingerprints the logistic regression model is trained multiple times. In
order to reduce the necessary time, I coded imputation to reuse results
from previous executions (stored in a file features.npy) if they are
available. Another decision I took was to impute a minimal number of
features, but which offer maximum benefit. In order to find the features
with the highest merit (for classification), I used scikit-learn and
their implementation of a feature ranking algorithm [1]. This RFE
algorithm also takes a lot of time and stores/reuses intermediate results
if available (file called features_to_impute). The nice part about
scikit-learn is that it uses liblinear also.

Imputation can have different results, depending on how features are
grouped together. During imputation, each feature influences all other
features. Throughout my testing, I discovered that grouping these
features is an important decision (more on this later). I added the
concept of imputation stages - one stage is equivalent to applying
imputation to one group of features. One feature can be part of only one
group. This means that groups are mutually exclusive and theoretically,
stages can be parallelized, but unfortunately rpy2 does not allow this
(found out after implementing it).

2) impute_mice.R
The actual imputation is done with an R library called mice [2]. It uses
the multiple imputation technique, which is more of a framework for
applying imputation and not an algorithm itself. The glue between python
and R is done via rpy2.

An imputation strategy can be specified for each variable depending on
the type of the variable. These strategies are listed and explained in
chapter 3 (page 16) of the very well written manual of this library [3].
I kept the original names throughout the code.

One of the problems that I had with mice was that values of imputed
categorical variables are remapped (e.g. values 64, 128 and 255 become
1, 2 and 3). In theory this should not be a huge problem, but in our
case, we need the original values to be present during training to
reflect the values that are seen during classification. For this, I
wrote some extra code to remap the values back, so everything should be
fine now.

3) rfe.py
This file contains the code for handling the calls to scikit-learn. It
is used withing impute.py, but can also be executed as a script on its
own. I tried to make '--help' as self-explanatory as possible.

There are two functions which give the features with the highest merit.
One of them is "give me the best N features" and accepts N as an
argument and the other uses cross validation to find N. The algorithm
takes a bit of time to finish so I am including a precomputed
"features_to_impute" file obtained with the 2nd approach.

4) parse.py.diff
Added the code to parse nmap.set so that the imputation strategies and
stages are also returned. The format is:
<feature> / <imp_strategy_1> [ | <imp_strategy_2> ] / <imp_stage>

5) train.py.diff
Some minor changes to account for new function signatures, some extra
cli arguments and writing nmap.model to a file instead of stdout.

6) nmap.set.diff
Proof of concept of how nmap.set would look like.

7) nmap.diff (optional)
Extra debugging information for what probes receive answers. This is not
(directly) related to imputation, but I used it for my testing and
thought to share it.

Furthermore, I also want to share some theoretical findings. This
multiple imputation technique applies learning algorithms to groups of
features and tries to "learn" the missing values from the existing ones.
These are some of the conclusions that I have reached:

1) Features need to be grouped together such that there is relevant
information for the algorithm to learn missing values. If irrelevant
features are imputed together, then the algorithm will just learn a
model that seems to fit, but the imputed values will not be realistic.
My hypothesis is that this problem becomes smaller and smaller as the
training set increases in the number of examples, but currently there
are not enough examples to get away from it. The exact groups for
features still needs a bit more research.

2) As said in previous emails, multiple iterations are performed per
imputed set. An iteration is equivalent to applying a learning algorithm
to each feature (explained very well in [4], page 3, MICE steps). After
a certain number of iterations, the values should stabilize and
converge, meaning that after each iteration the missing values change
very little or not at all. The exact minimum number of iterations still
needs a bit more work. I thought that the number had to be very high
(~100), but it seems that it is better to group the features in the
right way and have a smaller number of iterations to achieve better
results.

[1] http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html

[2] http://cran.r-project.org/web/packages/mice/index.html

[3] http://www.jstatsoft.org/v45/i03/paper

[4] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/pdf/nihms267760.pdf

Best regards,
Alexandru Geana
alegen.net

On 04/22, Alexandru Geana wrote:

Hello devs,

Attached to this e-mail I am sending a complete set of patches for
imputation of missing values. There are not many differencess from the
previous versions, just a mechanism for reusing previous output, cleaner
code and diffs against the newest versions of the python scripts in
nmap-exp.

Some of the files are diffs while others are the complete new versions of
the files since this makes them more readable considering the ratio
between existing and added code.

I am open to new suggestions and more feedback!

Best regards,
Alexandru Geana
alegen.net

Attachment: features_to_impute
Description:

Attachment: impute_mice.R
Description:

Attachment: impute.py
Description:

Attachment: nmap.diff
Description:

Attachment: nmap.set.diff
Description:

Attachment: parse.py.diff
Description:

Attachment: rfe.py
Description:

Attachment: train.py.diff
Description:

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread:

IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 10)
- Re: IPv6 fingerprint database imputation of missing values David Fifield (Apr 10)
  - Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 13)
    - Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Apr 22)
    - Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Jun 03)
    - Re: IPv6 fingerprint database imputation of missing values Alexandru Geana (Jun 30)