Nmap Development mailing list archives
Re: Web crawling library proposal
From: David Fifield <david () bamsoftware com>
Date: Sat, 5 Nov 2011 10:08:57 -0700
On Wed, Oct 19, 2011 at 12:25:19AM -0700, Paulino Calderon wrote:
Hi list, I'm attaching my working copies of the web crawling library and a few scripts that use it. It would be great if I can get some feedback. All the documentation is here: https://secwiki.org/w/Nmap/Spidering_Library I'm including 3 scripts using the library: * http-sitemap - Returns a list of URIs found. (Useful for target enum) * http-phpselfxss-scan - Returns a list of PHP files vulnerable to Cross Site Scripting via infecting the variable $_SERVER["PHP_SELF"]. * http-email-harvest - Returns a list of the email accounts found in the web server. NSE scripts would start a crawling process and then get a list of URIs to be processed as the programmer wishes. For example if we wanted to write a script to look for backup files we could simply do: httpspider.crawl(host, port) local uris = httpspider.get_sitemap() for _, uri in pairs(uris) do local obj = http.get(uri .. ".bak") if page_exists(obj and other params...) then results[#results+1] = uri end
I'll repeat others' sentiments that the crawler needs to work incrementally, rather than building a whole site map in advance. In other words, the code you wrote above should look like this instead: for obj in httpspider.crawl(host, port) do if page_exists(obj) then results[#results+1] = obj.uri end end I'm attaching two scripts that show what I think spidering scripts should look like. The first is a simple port of http-email-harvest. The second, http-find, shows one reason why the spidering has to be incremental: the script needs to stop after finding the first page that matches a pattern. In http-email-harvest, I included a proposed object-oriented API sketch. You would use the crawler like this: crawler = httpspider.Crawler:new() crawler:set_timeout(600) crawler:set_url_filter(url_filter) for response in crawler:crawl(host, port) do ... end http-find uses a convenience function httpspider.crawl, which will create a Crawler with default options and run it. The way I think this should work is, the crawler should only keep a short queue of pages fetched in advance (maybe 10 of them). When you call the crawler iterator, it will just return the first element of the queue (and set a condition variable to allow another page to be fetched), or else block until a page is available. We will rely on the HTTP cache to prevent downloading a file twice, when there is more than one crawler operating. Multiple crawlers shouldn't need to interact otherwise. The default crawler would use a blacklist or whitelist of extensions, and a default time limit. But you should be able to set your own options before beginning the crawl. You should be able to control which URLs get downloaded by setting a predicate function. And ideally, you should be able to modify what pages get crawled during the crawl itself. What I mean is, if you change the predicate function with set_url_filter, the crawler will remove anything from the queue of downloaded pages that doesn't match, and continue with the new options. And really great would be a way to call crawler:stop_recursion_this_page() and have it stop following links from the page most recently returned. In summary, iterative crawling is an essential feature. Interactive control of a running crawl would be nice, and probably wouldn't be too hard to add on top of iterativity. David Fifield
Attachment:
http-email-harvest.nse
Description:
Attachment:
http-find.nse
Description:
_______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Re: Web crawling library proposal Paulino Calderon (Oct 18)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
- Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
- Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
- Re: Web crawling library proposal Patrik Karlsson (Oct 19)
- Re: Web crawling library proposal Fyodor (Nov 01)
- Re: Web crawling library proposal David Fifield (Nov 05)
- Re: Web crawling library proposal Paulino Calderon (Nov 07)
- Re: Web crawling library proposal Patrik Karlsson (Nov 30)