Nmap Development mailing list archives

Re: [NSE] Sketch for XML/HTML parsing API


From: David Fifield <david () bamsoftware com>
Date: Wed, 1 Feb 2012 15:28:43 -0800

On Thu, Jan 19, 2012 at 12:08:53PM +0200, Lauri Kokkonen wrote:
Hi,

First off, I am a student inspired by the possible GSOC money opportunity :P

I have come up with a sketch for XML/HTML parsing API. The idea is to have a
method next() that returns the next bit of XML (start tag, attribute name,
etc) from the input string. Along with next() there is state information for
keeping track whether we are inside a tag or between tags (basically).

Then we could build a set of useful methods around the core. For example,
find_start_tag() could find the next occurrence of the given start tag and
parse_attributes() could return a set of attributes given that we are
currently inside a tag. If needed it should be possible to extend the
interface with a SAX-style facility or even add DOM-like features such as
parsing a subtree into a data structure (like it was sketched in another
related thread on this list [1]).

I think you're right that the XML parser should exist in at least two
layers. A low layer like you have described, which ideally uses a
constant amount of memory (or perhaps linear in nexting depth). And then
a higher layer (or two) that allows things like finding elements with a
given name and building tables of element attributes.

Where does the XML parser draw its input from? It would not be okay, for
example, to require loading the whole document into memory before
parsing. We are typically reading data in chunks from a socket, so it
would be really nice if it were possible to feed variable-sized chunks
of data into the XML parser, and have it report some error code when it
doesn't have enough data to continue. For example,
        repeat
                status, data = socket:receive_bytes(1024)
                if not status and data == "EOF" then
                        xp:end()
                end
                if not status then
                        error("socket error")
                end
                xp:feed(data)
                while xp:next() do
                        ...
                end
                -- Out of input.
        until not status and data == "EOF"
This is obviously a bit cumbersome to interlace reading and processing
in this way, so we would make some kind of wrapper coroutine that
contructs this loop and only exposes a next method.

Something like the following would be useful for httpspider.lua:

  while x:find_start_tag({"a","img","script"}) do
    a = x:parse_attributes()
    if a["href"] then ... end
    if a["src"] then ... end
  end

or maybe:

  while x:find_attribute({"href","src"}) do
    url = x:next().data
    ...
  end

Okay, those are believable. How would you write these test cases I
outlined before?
1. Given an element, find its parent element.
2. Given an element, iterate over its children.
3. Given an element, print all its attributes.
4. Iterate over all <a> elements with a href attribute ending in ".html".
5. Find a <name> element with text contents of "configuration", and
   return the contents of its sibling <value> element (both are children
   of a <param> element).
They don't all have to be one-liners, and it's acceptable to build a
small DOM tree when necessary.

You should also think about/demonstrate how these scripts would be
written using an XML parser:
        servicetags     (simple iteration of elements and attributes)
        voldemort-info  (element bodies; need to know parent)

Parsing HTML is almost a separate question. Using an XML parser will
work for the tiny fraction of HTML documents that also happen to be XML.
But even leaving aside the question of broken or nonstandard HTML, even
100% correct HTML doesn't have to look like XML, for example you don't
have to quote all attribute values in HTML. I almost think we should
ignore HTML parsing for now and only do XML parsing, which is easier. An
HTML parser having the same programmer interface would be very nice.

One option is to implement this completely in Lua, maybe with the help of
LPeg. Another option is to use a combination of C/C++ and Lua. Is XML
parsing needed elsewhere in Nmap? Looking at a few scripts that parse
XML/HTML files I think that at least libraries like expat and libxml2 are an
overkill for the purpose. For reference, that approach was suggested in
threads [2] and [3].

I think I prefer that it be written all in Lua. LPeg is a fine way to do
it.

David Fifield
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: