Nmap Development mailing list archives
Re: [NSE] Sketch for XML/HTML parsing API
From: David Fifield <david () bamsoftware com>
Date: Wed, 1 Feb 2012 15:28:43 -0800
On Thu, Jan 19, 2012 at 12:08:53PM +0200, Lauri Kokkonen wrote:
Hi, First off, I am a student inspired by the possible GSOC money opportunity :P I have come up with a sketch for XML/HTML parsing API. The idea is to have a method next() that returns the next bit of XML (start tag, attribute name, etc) from the input string. Along with next() there is state information for keeping track whether we are inside a tag or between tags (basically). Then we could build a set of useful methods around the core. For example, find_start_tag() could find the next occurrence of the given start tag and parse_attributes() could return a set of attributes given that we are currently inside a tag. If needed it should be possible to extend the interface with a SAX-style facility or even add DOM-like features such as parsing a subtree into a data structure (like it was sketched in another related thread on this list [1]).
I think you're right that the XML parser should exist in at least two layers. A low layer like you have described, which ideally uses a constant amount of memory (or perhaps linear in nexting depth). And then a higher layer (or two) that allows things like finding elements with a given name and building tables of element attributes. Where does the XML parser draw its input from? It would not be okay, for example, to require loading the whole document into memory before parsing. We are typically reading data in chunks from a socket, so it would be really nice if it were possible to feed variable-sized chunks of data into the XML parser, and have it report some error code when it doesn't have enough data to continue. For example, repeat status, data = socket:receive_bytes(1024) if not status and data == "EOF" then xp:end() end if not status then error("socket error") end xp:feed(data) while xp:next() do ... end -- Out of input. until not status and data == "EOF" This is obviously a bit cumbersome to interlace reading and processing in this way, so we would make some kind of wrapper coroutine that contructs this loop and only exposes a next method.
Something like the following would be useful for httpspider.lua: while x:find_start_tag({"a","img","script"}) do a = x:parse_attributes() if a["href"] then ... end if a["src"] then ... end end or maybe: while x:find_attribute({"href","src"}) do url = x:next().data ... end
Okay, those are believable. How would you write these test cases I outlined before? 1. Given an element, find its parent element. 2. Given an element, iterate over its children. 3. Given an element, print all its attributes. 4. Iterate over all <a> elements with a href attribute ending in ".html". 5. Find a <name> element with text contents of "configuration", and return the contents of its sibling <value> element (both are children of a <param> element). They don't all have to be one-liners, and it's acceptable to build a small DOM tree when necessary. You should also think about/demonstrate how these scripts would be written using an XML parser: servicetags (simple iteration of elements and attributes) voldemort-info (element bodies; need to know parent) Parsing HTML is almost a separate question. Using an XML parser will work for the tiny fraction of HTML documents that also happen to be XML. But even leaving aside the question of broken or nonstandard HTML, even 100% correct HTML doesn't have to look like XML, for example you don't have to quote all attribute values in HTML. I almost think we should ignore HTML parsing for now and only do XML parsing, which is easier. An HTML parser having the same programmer interface would be very nice.
One option is to implement this completely in Lua, maybe with the help of LPeg. Another option is to use a combination of C/C++ and Lua. Is XML parsing needed elsewhere in Nmap? Looking at a few scripts that parse XML/HTML files I think that at least libraries like expat and libxml2 are an overkill for the purpose. For reference, that approach was suggested in threads [2] and [3].
I think I prefer that it be written all in Lua. LPeg is a fine way to do it. David Fifield _______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- [NSE] Sketch for XML/HTML parsing API Lauri Kokkonen (Jan 19)
- Re: [NSE] Sketch for XML/HTML parsing API David Fifield (Feb 01)
- Re: [NSE] Sketch for XML/HTML parsing API Lauri Kokkonen (Feb 06)
- Re: [NSE] Sketch for XML/HTML parsing API David Fifield (Feb 01)