Nmap Development mailing list archives

Re: Developing html parsing- question


From: Daniel Miller <bonsaiviking () gmail com>
Date: Thu, 14 Jul 2016 08:34:50 -0500

Johanna,

Thanks for putting thought towards this problem. We aren't looking for a
script, but for a library of functions to replace and improve the parsing
portions of existing scripts.

We are also not looking for a DOM-model parser, since in most cases that
would be overkill. NSE's needs are for quick extraction of values and
location of forms, comments, and "interesting" data from HTML pages. We
want a library of functions to handle these kinds of tasks that could be
used to replace the various pattern matching portions of existing http-*
scripts as well as the HTML-parsing code in http.lua and httpspider.lua.

The challenges we face:

* Unicode and other multi-byte encodings. We should at least be robust
enough to handle UTF-8, since the HTML tags would still be ASCII-equivalent.
* Quirks-mode HTML. That means improperly nested tags like
<font><a>text</font></a> or unescaped entities like & or < within quoted
attributes, and other things that would generally break an XML parser.
These can introduce ambiguity in the DOM model, which is part of why we
would avoid that method.
* Mixed-case, strange whitespace, irregular use of quote characters, HTML
within javascript strings between <script> tags, XHTML vs HTML4 vs HTML5,
and other general weirdness.

We'd be glad to help answer any other questions you might have or offer
feedback on early drafts. Even a library that collects the existing parsing
functions together would be useful, as we can then make incremental
improvements that would apply across all the scripts using it.

Dan


On Tue, Jul 12, 2016 at 10:45 PM, Johanna Curiel <johannapcuriel () gmail com>
wrote:

Hello,

Taking a look to the prio-list for nse scripts:
https://secwiki.org/w/Nmap/Script_Ideas#HTML_parsing

I checked to the http-title.nse script, correct me if I'm wrong but that
script does not seem to be using the slaxml library.

Is the idea to create this html-parsing script for the entire DOM (not
just title or per HTML-tag) using slaxml?

Example, the script could be called http-parsing.nse and it will dissect
an entire HTML page with its tags.

Your feedback appreciated

cheers


_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: