Nmap Development mailing list archives

RE: Developing html parsing- question


From: Giacomo Mantani <giacomo.mantani () studio unibo it>
Date: Thu, 14 Jul 2016 16:40:59 +0000

Hi Daniel and Johanna,

I have worked on an HTML parsing library using JPeg some times ago (in order to partecipate to GSoC).
I have attached the code and a TODO. [v] = already implemented in html-parser.lua.

Cheers
________________________________________
From: dev [dev-bounces () nmap org] on behalf of Johanna Curiel [johannapcuriel () gmail com]
Sent: Thursday, July 14, 2016 4:50 PM
To: Daniel Miller
Cc: nmap list
Subject: Re: Developing html parsing- question

Thank you Daniel for the feedback.

We'd be glad to help answer any other questions you might have or offer feedback on early drafts. Even a library that 
collects the existing parsing functions together would be useful, as we can then make incremental improvements that 
would apply across all the scripts using it.

Giving the challenges, we are looking into building this core library with specific functions (in C, C++) that can be 
used by other modules such as nse engine, correct me if I'm wrong.

I havent get too much familiar with development of the core as nse scripting. I'm quite familiar developing in C and I 
would like to tryout.

Do you have any guidelines regarding core library development/code conventions that should be used for proper 
development?

Cheers


On Thu, Jul 14, 2016 at 9:34 AM, Daniel Miller <bonsaiviking () gmail com<mailto:bonsaiviking () gmail com>> wrote:
Johanna,

Thanks for putting thought towards this problem. We aren't looking for a script, but for a library of functions to 
replace and improve the parsing portions of existing scripts.

We are also not looking for a DOM-model parser, since in most cases that would be overkill. NSE's needs are for quick 
extraction of values and location of forms, comments, and "interesting" data from HTML pages. We want a library of 
functions to handle these kinds of tasks that could be used to replace the various pattern matching portions of 
existing http-* scripts as well as the HTML-parsing code in http.lua and httpspider.lua.

The challenges we face:

* Unicode and other multi-byte encodings. We should at least be robust enough to handle UTF-8, since the HTML tags 
would still be ASCII-equivalent.
* Quirks-mode HTML. That means improperly nested tags like <font><a>text</font></a> or unescaped entities like & or < 
within quoted attributes, and other things that would generally break an XML parser. These can introduce ambiguity in 
the DOM model, which is part of why we would avoid that method.
* Mixed-case, strange whitespace, irregular use of quote characters, HTML within javascript strings between <script> 
tags, XHTML vs HTML4 vs HTML5, and other general weirdness.

We'd be glad to help answer any other questions you might have or offer feedback on early drafts. Even a library that 
collects the existing parsing functions together would be useful, as we can then make incremental improvements that 
would apply across all the scripts using it.

Dan


On Tue, Jul 12, 2016 at 10:45 PM, Johanna Curiel <johannapcuriel () gmail com<mailto:johannapcuriel () gmail com>> 
wrote:
Hello,

Taking a look to the prio-list for nse scripts:
https://secwiki.org/w/Nmap/Script_Ideas#HTML_parsing

I checked to the http-title.nse script, correct me if I'm wrong but that script does not seem to be using the slaxml 
library.

Is the idea to create this html-parsing script for the entire DOM (not just title or per HTML-tag) using slaxml?

Example, the script could be called http-parsing.nse and it will dissect an entire HTML page with its tags.

Your feedback appreciated

cheers


_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/


Attachment: html-parser.lua
Description: html-parser.lua

Attachment: TODO
Description: TODO

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: