Nmap Development mailing list archives

Re: Web crawling library proposal

From: Fyodor <fyodor () insecure org>
Date: Fri, 29 Jul 2011 00:15:58 -0700

On Thu, Jul 28, 2011 at 11:38:27AM -0700, Paulino Calderon wrote:

Hi nmap-dev,

I've created a wiki page where you can find a proposal for the web 
crawling library I started working on. I'd be awesome that the people 
who have looked into it before share their thoughts about this draft.

https://secwiki.org/w/Nmap/Spidering_Library


Thanks Paulino.  I think the spidering system will be a great
addition, as indicated by the list of potential scripts on this page.
And this is a good start toward a spidering proposal.

A few notes:

o It would be good to expand the "function list" and "options"
  sections with descriptions/parameters.  As is, they are probably
  useful for you to jog your mind so you don't forget them, but the
  rest of us don't really know what you mean by these one or two word
  entries.  It is easy to just write "crawl" for the main function,
  but harder to define its exact API.

o A usage example (e.g. the API calls that a very trivial script would
  make) would be very useful to see what you have in mind and help flesh
  out the details.

o The idea here seems to be a script calls crawl(), then the system
  crawls and saves the whole web site, then returns all of that to the
  script after it is done.  This might not scale well to huge web
  sites.  It would probably be better for the
  scripts to be able to obtain one page at a time.  I suppose the
  spidering system could then avoid getting too far ahead of what the
  script has requested.

o This is a minor point, but the "MT checks sitemap.xml" part of the
  algorithm should probably be an option.  I don't know if it should
  be on or off by default, but people who want a pure crawling
  experience (based on parsing discovered pages) should be able to get
  that.  If we let the crawler use other mechanisms to find pages,
  there are a lot of directions we could go about that (Google "site:"
  queries, etc.).

o Rearding 'When the "not visited" list is empty we are done.', I
  suppose there will have to be other (optional) criteria too.  For
  example, we tend to use time-based limits for the brute force
  scripts.

Anyway, I'm looking forward to seeing where this goes!

Cheers,
Fyodor
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Web crawling library proposal Paulino Calderon (Jul 28)
- Re: Web crawling library proposal Fyodor (Jul 29)
  - Re: Web crawling library proposal Brendan Coles (Jul 29)
- Re: Web crawling library proposal Paulino Calderon (Aug 01)
  - Re: Web crawling library proposal Paul Johnston (Aug 02)
  - Re: Web crawling library proposal Martin Holst Swende (Aug 02)