Nmap Development mailing list archives
Prabhjyot's Status Report #8 of 17
From: Prabhjyot Singh Sodhi <prabhjyotsingh95 () gmail com>
Date: Tue, 21 Jun 2016 11:25:27 +0530
Hi list! It was another great week. I have finished working on changing on the database representation. Accomplishments: - Figured out a seg fault that I was stuck on when trying to get the top 3 predictions from opencv (as of now opencv just returns one predicted class) - Figured out the bare minimum modules (from opencv) that'll be required for shipping purposes if we choose to go with the random forest model. (This is because shipping the entire library will be too costly for nmap memory wise) - As I mentioned above, finished changing db representation. After the change, we have 73 groups instead of the original 96. Okay so I have been working on changing the database representation in the past few weeks, but why change it at all? So the current database representation which is used by the logistic regression model is based on fact that all prints which are members of a group are very similar to each other (value wise). This is in contrast to how classes are in normal learning systems (A learning system is one wherein you are trying to teach a system to do something (prediction of os in our case)). Usually we'd have a target variable (operating system in our case) and have one group for some operating system (or a set of operating systems) and all prints corresponding to the operating system would go into the group. And this is exactly what we have attempted with the new representation. Now, to achieve this, one simple solution could have been to just have one group for each operating system (each version, so one per linux kernel). Given the low number of prints this would have resulted in a very high number of groups with very less number of prints in each group which would have made prediction more difficult. That is why we tried to keep similar versions of the same operating system in the same group. We were able to do this for Windows, IBM, Macintosh, FreeBSD type systems. For Linux, we decided to stick with the existing representation (with small changes) due to complexity in the way the groups were made. - I did some basic tests (testing and training on the entire db) and the model was able to identify all prints correctly. (Though this is misleading, I'll have better test results (80:20 split training: testing) and some comparison test results by next week) Priorities: - Try to figure out top 3 output in opencv model - Test on 80:20 training: testing split for the model on new db. - Compare with existing logistic regression model. - Document the procedure to go from the previous db representation to the new one (and the decisions that were taken on the way) I know I have written a lot in this report, Please feel free to ask away if something doesn't make sense or if you are interested in discussing something. Cheers, Prabhjyot
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Prabhjyot's Status Report #8 of 17 Prabhjyot Singh Sodhi (Jun 20)