Nmap Development mailing list archives

Re: Proposed improvements on httpspider.lua


From: George Chatzisofroniou <sophron () latthi com>
Date: Mon, 24 Jun 2013 14:00:18 +0300

On Sun, Jun 23, 2013 at 04:59:42PM -0700, David Fifield wrote:
Even better, can't you use a design where you provide a callback
function that is called for each resource? The function returns true if
spidering should continue or false if not. A script that just wants to
list all the external resources can make note of them in the callback,
and return false for them. The default callback would just exclude other
domains, the way it works now.

I don't think that a boolean callback function can do the job.

There are two different operations that need approval:

spidering - If this is true, the crawler will return the resource.

scraping - If this is true, the crawler will scrape the resource for any links.

There are cases where we want to spider a link, but we don't want to scrape it.

To make this work using a callback function, the function should return a tuple
of two boolean values (one for each operation). For example, [true, true] would
mean, both spidering and scraping are enabled for this resource. 

Or we could use two different callback functions (one for each operation), or
provide the user with the method "urlqueue:add(links)" to scrape the links and
add them to url queue manually by choice. But isn't this making things more
complicated?

-- 
George Chatzisofroniou
http://sophron.latthi.com

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Sent through the dev mailing list
http://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: