Nmap Development mailing list archives

Re: Web crawling library proposal

From: Brendan Coles <bcoles () gmail com>
Date: Sat, 30 Jul 2011 03:40:54 +1000

Hi Paulino,

Nice work. Just briefly;

   - It might be worth elaborating on what the default and configurable
   crawl options are likely to be. I noticed you haven't specifically mentioned
   white/black listing sub-domains, IPs, vhosts and ports beyond the "path
   blacklist", for example.


   - HTTP headers can sometimes yield additional paths, in particular the
   "Location", "PHP-Warning" and "PHP-Error" headers, amongst others. This
   might be worth looking into.


b


On Fri, Jul 29, 2011 at 5:15 PM, Fyodor <fyodor () insecure org> wrote:

On Thu, Jul 28, 2011 at 11:38:27AM -0700, Paulino Calderon wrote:

Hi nmap-dev,

I've created a wiki page where you can find a proposal for the web
crawling library I started working on. I'd be awesome that the people
who have looked into it before share their thoughts about this draft.

https://secwiki.org/w/Nmap/Spidering_Library


Thanks Paulino.  I think the spidering system will be a great
addition, as indicated by the list of potential scripts on this page.
And this is a good start toward a spidering proposal.

A few notes:

o It would be good to expand the "function list" and "options"
 sections with descriptions/parameters.  As is, they are probably
 useful for you to jog your mind so you don't forget them, but the
 rest of us don't really know what you mean by these one or two word
 entries.  It is easy to just write "crawl" for the main function,
 but harder to define its exact API.

o A usage example (e.g. the API calls that a very trivial script would
 make) would be very useful to see what you have in mind and help flesh
 out the details.

o The idea here seems to be a script calls crawl(), then the system
 crawls and saves the whole web site, then returns all of that to the
 script after it is done.  This might not scale well to huge web
 sites.  It would probably be better for the
 scripts to be able to obtain one page at a time.  I suppose the
 spidering system could then avoid getting too far ahead of what the
 script has requested.

o This is a minor point, but the "MT checks sitemap.xml" part of the
 algorithm should probably be an option.  I don't know if it should
 be on or off by default, but people who want a pure crawling
 experience (based on parsing discovered pages) should be able to get
 that.  If we let the crawler use other mechanisms to find pages,
 there are a lot of directions we could go about that (Google "site:"
 queries, etc.).

o Rearding 'When the "not visited" list is empty we are done.', I
 suppose there will have to be other (optional) criteria too.  For
 example, we tend to use time-based limits for the brute force
 scripts.

Anyway, I'm looking forward to seeing where this goes!

Cheers,
Fyodor
_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/

Current thread:

Web crawling library proposal Paulino Calderon (Jul 28)
- Re: Web crawling library proposal Fyodor (Jul 29)
  - Re: Web crawling library proposal Brendan Coles (Jul 29)
- Re: Web crawling library proposal Paulino Calderon (Aug 01)
  - Re: Web crawling library proposal Paul Johnston (Aug 02)
  - Re: Web crawling library proposal Martin Holst Swende (Aug 02)