Nmap Development mailing list archives

Re: Developing html parsing- question


From: Johanna Curiel <johannapcuriel () gmail com>
Date: Thu, 14 Jul 2016 10:50:10 -0400

Thank you Daniel for the feedback.

We'd be glad to help answer any other questions you might have or offer
feedback on early drafts. Even a library that collects the existing parsing
functions together would be useful, as we can then make incremental
improvements that would apply across all the scripts using it.

Giving the challenges, we are looking into building this core library with
specific functions (in C, C++) that can be used by other modules such as
nse engine, correct me if I'm wrong.

I havent get too much familiar with development of the core as nse
scripting. I'm quite familiar developing in C and I would like to tryout.

Do you have any guidelines regarding core library development/code
conventions that should be used for proper development?

Cheers


On Thu, Jul 14, 2016 at 9:34 AM, Daniel Miller <bonsaiviking () gmail com>
wrote:

Johanna,

Thanks for putting thought towards this problem. We aren't looking for a
script, but for a library of functions to replace and improve the parsing
portions of existing scripts.

We are also not looking for a DOM-model parser, since in most cases that
would be overkill. NSE's needs are for quick extraction of values and
location of forms, comments, and "interesting" data from HTML pages. We
want a library of functions to handle these kinds of tasks that could be
used to replace the various pattern matching portions of existing http-*
scripts as well as the HTML-parsing code in http.lua and httpspider.lua.

The challenges we face:

* Unicode and other multi-byte encodings. We should at least be robust
enough to handle UTF-8, since the HTML tags would still be ASCII-equivalent.
* Quirks-mode HTML. That means improperly nested tags like
<font><a>text</font></a> or unescaped entities like & or < within quoted
attributes, and other things that would generally break an XML parser.
These can introduce ambiguity in the DOM model, which is part of why we
would avoid that method.
* Mixed-case, strange whitespace, irregular use of quote characters, HTML
within javascript strings between <script> tags, XHTML vs HTML4 vs HTML5,
and other general weirdness.

We'd be glad to help answer any other questions you might have or offer
feedback on early drafts. Even a library that collects the existing parsing
functions together would be useful, as we can then make incremental
improvements that would apply across all the scripts using it.

Dan


On Tue, Jul 12, 2016 at 10:45 PM, Johanna Curiel <johannapcuriel () gmail com
wrote:

Hello,

Taking a look to the prio-list for nse scripts:
https://secwiki.org/w/Nmap/Script_Ideas#HTML_parsing

I checked to the http-title.nse script, correct me if I'm wrong but that
script does not seem to be using the slaxml library.

Is the idea to create this html-parsing script for the entire DOM (not
just title or per HTML-tag) using slaxml?

Example, the script could be called http-parsing.nse and it will dissect
an entire HTML page with its tags.

Your feedback appreciated

cheers


_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/



_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: