Nmap Development mailing list archives
Re: [NSE] XML Parser RFC
From: David Fifield <david () bamsoftware com>
Date: Fri, 1 Jul 2011 17:28:10 -0700
On Wed, Jun 29, 2011 at 04:32:07PM -0400, Patrick Donnelly wrote:
Hi list, I'm working on an XML parser that should be done pretty soon in lpeg [1] for NSE. I'm admittedly not an XML savant and rarely use it myself so I'm relying heavily on the standards and their description of the syntax which fortunately maps easily to lpeg as a grammar. Most parsers I've seen just decompose the XML file into (nested) tables using basic pattern matching. The new lpeg module would require XML well-formedness (at least to the degree of syntax, not necessarily semantics; e.g. a closing tag may not need to match an opening tag). I'm thinking of just extracting the information into tables (similarly). An example of a simple XML parser [2] given: <methodCall kind="xuxu"> <methodName>examples.getStateName</methodName> <params> <param> <value><i4>41</i4></value> </param> </params> </methodCall> would produce this table (printed for humans): [1] => table ( [1] => table ( [1] => examples.getStateName [xarg] => table ( ) [label] => methodName ) [2] => table ( [1] => table ( [1] => table ( [1] => table ( [1] => 41 [xarg] => table ( ) [label] => i4 ) [xarg] => table ( ) [label] => value ) [xarg] => table ( ) [label] => param ) [xarg] => table ( ) [label] => params ) [xarg] => table ( [kind] => xuxu ) [label] => methodCall ) I'm planning to have similar output but also account for all of XML's various oddities. I'm hoping the list can provide feedback on what they would like to see come out of the parser and in particular what they would like to see supported.
I will second the call for an event-driven parser like SAX versus one that parses an entire document en masse. This is especially important for the kinds of scraping/grepping things we often do in NSE. I think the table representation above would be fairly hard to use. There is an implicit tree structure in the tables but I have to do the bookkeeping myself; to know the siblings of an element I also have to know its parent. Forward/back links would be easier than array indexes. Relying on well-formedness for XML is a slight handicap. For better or worse, XML processors require processors to halt when they encounter a well-formedness error (search for "XML draconian"), which is why you get a useless yellow error page instead of a graceful degradation when someone makes a mistake in an XHTML document as opposed to plain HTML. But: requiring well-formedness means that the same library can't be used to parse HTML. The nicest XML/HTML parsing API I have used is called Beautiful Soup. To be equivalent to it is probably way too ambitious for an NSE parser. It makes it easy, for example, to find an HTML element with a given id, then iterator over its children. We need to answer the questions: what will people want to accomplish with this library? Unfortunately, both SAX and DOM are far less than ideal for processing XML. They seems to be biased toward ease of implementation rather than ease of use. Here are a few simple use cases: 1. Given an element, find its parent element. 2. Given an element, iterate over its children. 3. Given an element, print all its attributes. 4. Iterate over all <a> elements with a href attribute ending in ".html". 5. Find a <name> element with text contents of "configuration", and return the contents of its sibling <value> element (both are children of a <param> element). David Fifield _______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Re: [NSE] XML Parser RFC David Fifield (Jul 01)
- Re: [NSE] XML Parser RFC David Fifield (Jul 01)