Nmap Development mailing list archives

Re: [NSE] XML Parser RFC


From: Daniel Miller <bonsaiviking () gmail com>
Date: Wed, 29 Jun 2011 16:25:24 -0500

Patrick,

This sounds like a neat idea. I don't have specific plans to use such a
parser, but a crucial design decision you'll have to make early on is
whether to design a DOM or SAX parser. DOM parsers do what you described:
load the whole document into a data structure. This is simple to use, but
will not work for not-well-formed or incomplete documents, and can use a lot
of memory, if the document is large enough. The alternative, SAX parsing,
treats each element as an event, and allows the user to register callbacks
for starting and ending tags (simplified explanation). This has the
advantage of being able to handle incomplete documents, as well as reduced
memory requirements (think of registering a callback for the <host> element
in Nmap XML--no need to store the whole scan in memory). Downside is that it
is sequential-access, not random-access, so it's not suited to all document
types or processing schemes, and generally takes more programming know-how
to implement and to use.

Just some things to think about!
Dan

On Wed, Jun 29, 2011 at 3:32 PM, Patrick Donnelly <batrick () batbytes com>wrote:

Hi list,

I'm working on an XML parser that should be done pretty soon in lpeg
[1] for NSE. I'm admittedly not an XML savant and
rarely use it myself so I'm relying heavily on the standards and their
description of the syntax which fortunately maps easily to lpeg as a
grammar.

Most parsers I've seen just decompose the XML file into (nested)
tables using basic pattern matching. The new lpeg module would require
XML well-formedness (at least to the degree of syntax, not necessarily
semantics; e.g. a closing tag may not need to match an opening tag).
I'm thinking of just extracting the information into tables
(similarly).

An example of a simple XML parser [2] given:

<methodCall kind="xuxu">
 <methodName>examples.getStateName</methodName>
 <params>
   <param>
     <value><i4>41</i4></value>
   </param>
 </params>
</methodCall>

would produce this table (printed for humans):

[1] => table
   (
      [1] => table
          (
             [1] => examples.getStateName
             [xarg] => table
                 (
                 )
             [label] => methodName
          )
      [2] => table
          (
             [1] => table
                 (
                    [1] => table
                        (
                           [1] => table
                               (
                                  [1] => 41
                                  [xarg] => table
                                      (
                                      )
                                  [label] => i4
                               )
                           [xarg] => table
                               (
                               )
                           [label] => value
                        )
                    [xarg] => table
                        (
                        )
                    [label] => param
                 )
             [xarg] => table
                 (
                 )
             [label] => params
          )
      [xarg] => table
          (
             [kind] => xuxu
          )
      [label] => methodCall
   )

I'm planning to have similar output but also account for all of XML's
various oddities.

I'm hoping the list can provide feedback on what they would like to
see come out of the parser and in particular what they would like to
see supported.

[1] http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
[2] http://lua-users.org/wiki/LuaXml

--
- Patrick Donnelly
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: