Nmap Development mailing list archives

Re: [NSE] XML Parser RFC


From: David Fifield <david () bamsoftware com>
Date: Fri, 1 Jul 2011 17:28:10 -0700

On Wed, Jun 29, 2011 at 04:32:07PM -0400, Patrick Donnelly wrote:
Hi list,

I'm working on an XML parser that should be done pretty soon in lpeg
[1] for NSE. I'm admittedly not an XML savant and
rarely use it myself so I'm relying heavily on the standards and their
description of the syntax which fortunately maps easily to lpeg as a
grammar.

Most parsers I've seen just decompose the XML file into (nested)
tables using basic pattern matching. The new lpeg module would require
XML well-formedness (at least to the degree of syntax, not necessarily
semantics; e.g. a closing tag may not need to match an opening tag).
I'm thinking of just extracting the information into tables
(similarly).

An example of a simple XML parser [2] given:

<methodCall kind="xuxu">
  <methodName>examples.getStateName</methodName>
  <params>
    <param>
      <value><i4>41</i4></value>
    </param>
  </params>
</methodCall>

would produce this table (printed for humans):

[1] => table
    (
       [1] => table
           (
              [1] => examples.getStateName
              [xarg] => table
                  (
                  )
              [label] => methodName
           )
       [2] => table
           (
              [1] => table
                  (
                     [1] => table
                         (
                            [1] => table
                                (
                                   [1] => 41
                                   [xarg] => table
                                       (
                                       )
                                   [label] => i4
                                )
                            [xarg] => table
                                (
                                )
                            [label] => value
                         )
                     [xarg] => table
                         (
                         )
                     [label] => param
                  )
              [xarg] => table
                  (
                  )
              [label] => params
           )
       [xarg] => table
           (
              [kind] => xuxu
           )
       [label] => methodCall
    )

I'm planning to have similar output but also account for all of XML's
various oddities.

I'm hoping the list can provide feedback on what they would like to
see come out of the parser and in particular what they would like to
see supported.

I will second the call for an event-driven parser like SAX versus one
that parses an entire document en masse. This is especially important
for the kinds of scraping/grepping things we often do in NSE.

I think the table representation above would be fairly hard to use.
There is an implicit tree structure in the tables but I have to do the
bookkeeping myself; to know the siblings of an element I also have to
know its parent. Forward/back links would be easier than array indexes.

Relying on well-formedness for XML is a slight handicap. For better or
worse, XML processors require processors to halt when they encounter a
well-formedness error (search for "XML draconian"), which is why you get
a useless yellow error page instead of a graceful degradation when
someone makes a mistake in an XHTML document as opposed to plain HTML.
But: requiring well-formedness means that the same library can't be used
to parse HTML.

The nicest XML/HTML parsing API I have used is called Beautiful Soup. To
be equivalent to it is probably way too ambitious for an NSE parser. It
makes it easy, for example, to find an HTML element with a given id,
then iterator over its children.

We need to answer the questions: what will people want to accomplish
with this library? Unfortunately, both SAX and DOM are far less than
ideal for processing XML. They seems to be biased toward ease of
implementation rather than ease of use. Here are a few simple use cases:

1. Given an element, find its parent element.
2. Given an element, iterate over its children.
3. Given an element, print all its attributes.
4. Iterate over all <a> elements with a href attribute ending in ".html".
5. Find a <name> element with text contents of "configuration", and
   return the contents of its sibling <value> element (both are children
   of a <param> element).

David Fifield
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: