Nmap Development mailing list archives

Prabhjyot's Status Report #8 of 17

From: Prabhjyot Singh Sodhi <prabhjyotsingh95 () gmail com>
Date: Tue, 21 Jun 2016 11:25:27 +0530

Hi list!

It was another great week. I have finished working on changing on the
database representation.

Accomplishments:
- Figured out a seg fault that I was stuck on when trying to get the top 3
predictions from opencv (as of now opencv just returns one predicted class)
- Figured out the bare minimum modules (from opencv) that'll be required
for shipping purposes if we choose to go with the random forest model.
(This is because shipping the entire library will be too costly for nmap
memory wise)
- As I mentioned above, finished changing db representation. After the
change, we have 73 groups instead of the original 96.

Okay so I have been working on changing the database representation in the
past few weeks, but why change it at all?
So the current database representation which is used by the logistic
regression model is based on fact that all prints which are members of a
group are very similar to each other (value wise). This is in contrast to
how classes are in normal learning systems (A learning system is one
wherein you are trying to teach a system to do something (prediction of os
in our case)). Usually we'd have a target variable (operating system in our
case) and have one group for some operating system (or a set of operating
systems) and all prints corresponding to the operating system would go into
the group. And this is exactly what we have attempted with the new
representation.

Now, to achieve this, one simple solution could have been to just have one
group for each operating system (each version, so one per linux kernel).
Given the low number of prints this would have resulted in a very high
number of groups with very less number of prints in each group which would
have made prediction more difficult. That is why we tried to keep similar
versions of the same operating system in the same group.

We were able to do this for Windows, IBM, Macintosh, FreeBSD type systems.
For Linux, we decided to stick with the existing representation (with small
changes) due to complexity in the way the groups were made.

- I did some basic tests (testing and training on the entire db) and the
model was able to identify all prints correctly. (Though this is
misleading, I'll have better test results (80:20 split training: testing)
and some comparison test results by next week)

Priorities:
- Try to figure out top 3 output in opencv model
- Test on 80:20 training: testing split for the model on new db.
- Compare with existing logistic regression model.
- Document the procedure to go from the previous db representation to the
new one (and the decisions that were taken on the way)

I know I have written a lot in this report, Please feel free to ask away if
something doesn't make sense or if you are interested in discussing
something.

Cheers,
Prabhjyot

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Prabhjyot's Status Report #8 of 17 Prabhjyot Singh Sodhi (Jun 20)