Nmap Development mailing list archives

Re: Web crawling library proposal

From: Paulino Calderon <paulino () calderonpale com>
Date: Wed, 19 Oct 2011 15:17:18 -0700

On 10/19/2011 12:45 PM, Patrick Donnelly wrote:

On Wed, Oct 19, 2011 at 3:25 AM, Paulino Calderon
<paulino () calderonpale com>  wrote:

Hi list,

I'm attaching my working copies of the web crawling library and  a few
scripts that use it. It would be great if I can get some feedback.

For the library itself:

o I'm not convinced a Queue implementation is necessary. I'd prefer
just using table.insert/table.remove until evidence is presented it is
a performance block.

I thought table.insert adds the element at the end and table.removeremoves the last element. The purpose of this implementation was to usea FIFO mechanism (Oldest item inserted gets removed first). I'll lookinto it to see if I can use these standard table methods.

o Libraries should not use the registry. Provide an interface to
access private data instead.

The registry was used to keep a record of states between multipleinstances of the library. For example, If I run script A and B, script Bchecks if there is a crawler already running by checking the registryentry created first by script A. I did not find another way of knowingthat the library was called already to avoid running multiple crawlers.How can I use a private data interface to know when there is a copy ofthe library running already?

o is_url_absolute should anchor the pattern search to the beginning of the URI
o Make get_sitemap return an iterator instead of a table of results.

o Does get_sitemap return the URI for every site that's been crawled?

Shouldn't it return what we requested it to crawl? It would appear if
two scripts try to crawl at the same time, bad things happen with the
global queue structures (among other things)

It does return the list of URIs we requested. When two scripts arerunning at the same time, only one of them runs the crawler and theother one waits for the results. That means that even when runningmultiple scripts the web crawler only runs once. The disadvantage ofthis is that we can't have different "crawling profiles" for differentscripts right now. It's a simple thing to add but during my tests Ifound no use for it.

Did you have a chance to try out the scripts and library? Thank you forthe feedback!


Cheers.

--
Paulino Calderón Pale
Web: http://calderonpale.com
Twitter: http://www.twitter.com/paulinocaIderon

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Re: Web crawling library proposal Paulino Calderon (Oct 18)
- Re: Web crawling library proposal Patrick Donnelly (Oct 19)
  - Re: Web crawling library proposal Paulino Calderon (Oct 19)
    - Re: Web crawling library proposal Patrick Donnelly (Oct 19)
    - Re: Web crawling library proposal Paulino Calderon (Oct 19)
- Re: Web crawling library proposal Patrik Karlsson (Oct 19)
- Re: Web crawling library proposal Fyodor (Nov 01)
- Re: Web crawling library proposal David Fifield (Nov 05)
  - Re: Web crawling library proposal Paulino Calderon (Nov 07)
  - Re: Web crawling library proposal Patrik Karlsson (Nov 30)