Nmap Development mailing list archives

Re: Web crawling library proposal

From: Patrik Karlsson <patrik () cqure net>
Date: Wed, 30 Nov 2011 21:23:01 +0100

On Sat, Nov 5, 2011 at 6:08 PM, David Fifield <david () bamsoftware com> wrote:

On Wed, Oct 19, 2011 at 12:25:19AM -0700, Paulino Calderon wrote:

Hi list,

I'm attaching my working copies of the web crawling library and  a
few scripts that use it. It would be great if I can get some
feedback.

All the documentation is here:
https://secwiki.org/w/Nmap/Spidering_Library

I'm including 3 scripts using the library:
* http-sitemap - Returns a list of URIs found. (Useful for target enum)
* http-phpselfxss-scan - Returns a list of PHP files vulnerable to
Cross Site Scripting via infecting the variable
$_SERVER["PHP_SELF"].
* http-email-harvest - Returns a list of the email accounts found in
the web server.

NSE scripts would start a crawling process and then get a list of
URIs to be processed as the programmer wishes. For example if we
wanted to write a script to look for backup files we could simply
do:

  httpspider.crawl(host, port)
  local uris = httpspider.get_sitemap()
  for _, uri in pairs(uris) do
    local obj = http.get(uri .. ".bak")
    if page_exists(obj and other params...) then
        results[#results+1] = uri
  end


I'll repeat others' sentiments that the crawler needs to work
incrementally, rather than building a whole site map in advance. In
other words, the code you wrote above should look like this instead:

for obj in httpspider.crawl(host, port) do
 if page_exists(obj) then
   results[#results+1] = obj.uri
 end
end

I'm attaching two scripts that show what I think spidering scripts
should look like. The first is a simple port of http-email-harvest. The
second, http-find, shows one reason why the spidering has to be
incremental: the script needs to stop after finding the first page that
matches a pattern.

In http-email-harvest, I included a proposed object-oriented API sketch.
You would use the crawler like this:

crawler = httpspider.Crawler:new()
crawler:set_timeout(600)
crawler:set_url_filter(url_filter)
for response in crawler:crawl(host, port) do
 ...
end

http-find uses a convenience function httpspider.crawl, which will
create a Crawler with default options and run it.

The way I think this should work is, the crawler should only keep a
short queue of pages fetched in advance (maybe 10 of them). When you
call the crawler iterator, it will just return the first element of the
queue (and set a condition variable to allow another page to be
fetched), or else block until a page is available. We will rely on the
HTTP cache to prevent downloading a file twice, when there is more than
one crawler operating. Multiple crawlers shouldn't need to interact
otherwise.

The default crawler would use a blacklist or whitelist of extensions,
and a default time limit. But you should be able to set your own options
before beginning the crawl. You should be able to control which URLs get
downloaded by setting a predicate function. And ideally, you should be
able to modify what pages get crawled during the crawl itself. What I
mean is, if you change the predicate function with set_url_filter, the
crawler will remove anything from the queue of downloaded pages that
doesn't match, and continue with the new options. And really great would
be a way to call crawler:stop_recursion_this_page() and have it stop
following links from the page most recently returned.

In summary, iterative crawling is an essential feature. Interactive
control of a running crawl would be nice, and probably wouldn't be too
hard to add on top of iterativity.

David Fifield

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


I was experimenting a little with this today but ran into some problems.
If I run http.get from within a closure I get the following error message:
"attempt to yield across metamethod/C-call boundary"
The same happens if I simply try to connect a socket? Have you hade any
succes with this Paulino? Anyone else have any ideas on how to get around
this? The attached script can be used to trigger the error.

-- 
Patrik Karlsson
http://www.cqure.net
http://twitter.com/nevdull77

Attachment: test-iter-sock.nse
Description:

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Re: Web crawling library proposal Paulino Calderon (Oct 18)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
  - Re: Web crawling library proposal Paulino Calderon (Oct 19)
    - Re: Web crawling library proposal Patrick Donnelly (Oct 19)
    - Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Patrik Karlsson (Oct 19)
- Re: Web crawling library proposal Fyodor (Nov 01)
- Re: Web crawling library proposal David Fifield (Nov 05)
  - Re: Web crawling library proposal Paulino Calderon (Nov 07)
  - Re: Web crawling library proposal Patrik Karlsson (Nov 30)