Nmap Development mailing list archives

Re: [NSE] Sketch for XML/HTML parsing API


From: Lauri Kokkonen <lauri.u.kokkonen () gmail com>
Date: Mon, 6 Feb 2012 14:09:22 +0200

On Wed, Feb 01, 2012 at 03:28:43PM -0800, David Fifield wrote:
On Thu, Jan 19, 2012 at 12:08:53PM +0200, Lauri Kokkonen wrote:
I have come up with a sketch for XML/HTML parsing API. The idea is to have a
method next() that returns the next bit of XML (start tag, attribute name,
etc) from the input string. Along with next() there is state information for
keeping track whether we are inside a tag or between tags (basically).

Then we could build a set of useful methods around the core. For example,
find_start_tag() could find the next occurrence of the given start tag and
parse_attributes() could return a set of attributes given that we are
currently inside a tag. If needed it should be possible to extend the
interface with a SAX-style facility or even add DOM-like features such as
parsing a subtree into a data structure (like it was sketched in another
related thread on this list [1]).

I think you're right that the XML parser should exist in at least two
layers. A low layer like you have described, which ideally uses a
constant amount of memory (or perhaps linear in nexting depth). And then
a higher layer (or two) that allows things like finding elements with a
given name and building tables of element attributes.

After looking at voldemort-info, for instance, I'm starting to see that
indeed a method like parse() that creates a tree structure of a subtree
would be convenient. This tree structure could be an instance of a class
that has methods for finding elements and getting elements and parents.
This would allow the creation of small trees of only those parts of the
document that are needed.

Where does the XML parser draw its input from? It would not be okay, for
example, to require loading the whole document into memory before
parsing. We are typically reading data in chunks from a socket, so it
would be really nice if it were possible to feed variable-sized chunks
of data into the XML parser, and have it report some error code when it
doesn't have enough data to continue. For example,
      repeat
              status, data = socket:receive_bytes(1024)
              if not status and data == "EOF" then
                      xp:end()
              end
              if not status then
                      error("socket error")
              end
              xp:feed(data)
              while xp:next() do
                      ...
              end
              -- Out of input.
      until not status and data == "EOF"
This is obviously a bit cumbersome to interlace reading and processing
in this way, so we would make some kind of wrapper coroutine that
contructs this loop and only exposes a next method.

Yes, sounds like a good idea.

Something like the following would be useful for httpspider.lua:

  while x:find_start_tag({"a","img","script"}) do
    a = x:parse_attributes()
    if a["href"] then ... end
    if a["src"] then ... end
  end

or maybe:

  while x:find_attribute({"href","src"}) do
    url = x:next().data
    ...
  end

Okay, those are believable. How would you write these test cases I
outlined before?
1. Given an element, find its parent element.
2. Given an element, iterate over its children.
3. Given an element, print all its attributes.
4. Iterate over all <a> elements with a href attribute ending in ".html".
5. Find a <name> element with text contents of "configuration", and
   return the contents of its sibling <value> element (both are children
   of a <param> element).
They don't all have to be one-liners, and it's acceptable to build a
small DOM tree when necessary.

You should also think about/demonstrate how these scripts would be
written using an XML parser:
      servicetags     (simple iteration of elements and attributes)
      voldemort-info  (element bodies; need to know parent)

(5): Let's assume that <param> elements are children of <tree>.
  xp:find_start_tag({"tree"})
  tree = xp:parse("tree")
  value = tree:findElement("name","configuration"):getParent():getElement("value")
where tree is an instance of XMLTree. Similar technique can be used for
voldemort-info and (1).

(2):
  xp:find_start_tag(element)
  while xp:is_next() and not(xp:is_next("end-tag",element)) do
    c = xp:next()
    ...
  end

Parsing HTML is almost a separate question. Using an XML parser will
work for the tiny fraction of HTML documents that also happen to be XML.
But even leaving aside the question of broken or nonstandard HTML, even
100% correct HTML doesn't have to look like XML, for example you don't
have to quote all attribute values in HTML. I almost think we should
ignore HTML parsing for now and only do XML parsing, which is easier. An
HTML parser having the same programmer interface would be very nice.

Ah, so it seems. I researched this a bit further and found this:
* HTML allows attributes without values
* <script> and <style> elements are handled like CDATA in HTML

These alone are enough to create dangerous special cases for the XML
parser so I agree that the next() parsers have to be separated.


I will come back later with barebone Lua classes (XMLParser and XMLTree,
I think) or something, presenting an initial version of the API in the
hope of getting comments on it.

Lauri
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: