Nmap Development mailing list archives
Re: Web crawling library proposal
From: Brendan Coles <bcoles () gmail com>
Date: Sat, 30 Jul 2011 03:40:54 +1000
Hi Paulino, Nice work. Just briefly; - It might be worth elaborating on what the default and configurable crawl options are likely to be. I noticed you haven't specifically mentioned white/black listing sub-domains, IPs, vhosts and ports beyond the "path blacklist", for example. - HTTP headers can sometimes yield additional paths, in particular the "Location", "PHP-Warning" and "PHP-Error" headers, amongst others. This might be worth looking into. b On Fri, Jul 29, 2011 at 5:15 PM, Fyodor <fyodor () insecure org> wrote:
On Thu, Jul 28, 2011 at 11:38:27AM -0700, Paulino Calderon wrote:Hi nmap-dev, I've created a wiki page where you can find a proposal for the web crawling library I started working on. I'd be awesome that the people who have looked into it before share their thoughts about this draft. https://secwiki.org/w/Nmap/Spidering_LibraryThanks Paulino. I think the spidering system will be a great addition, as indicated by the list of potential scripts on this page. And this is a good start toward a spidering proposal. A few notes: o It would be good to expand the "function list" and "options" sections with descriptions/parameters. As is, they are probably useful for you to jog your mind so you don't forget them, but the rest of us don't really know what you mean by these one or two word entries. It is easy to just write "crawl" for the main function, but harder to define its exact API. o A usage example (e.g. the API calls that a very trivial script would make) would be very useful to see what you have in mind and help flesh out the details. o The idea here seems to be a script calls crawl(), then the system crawls and saves the whole web site, then returns all of that to the script after it is done. This might not scale well to huge web sites. It would probably be better for the scripts to be able to obtain one page at a time. I suppose the spidering system could then avoid getting too far ahead of what the script has requested. o This is a minor point, but the "MT checks sitemap.xml" part of the algorithm should probably be an option. I don't know if it should be on or off by default, but people who want a pure crawling experience (based on parsing discovered pages) should be able to get that. If we let the crawler use other mechanisms to find pages, there are a lot of directions we could go about that (Google "site:" queries, etc.). o Rearding 'When the "not visited" list is empty we are done.', I suppose there will have to be other (optional) criteria too. For example, we tend to use time-based limits for the brute force scripts. Anyway, I'm looking forward to seeing where this goes! Cheers, Fyodor _______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
_______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Web crawling library proposal Paulino Calderon (Jul 28)
- Re: Web crawling library proposal Fyodor (Jul 29)
- Re: Web crawling library proposal Brendan Coles (Jul 29)
- Re: Web crawling library proposal Paulino Calderon (Aug 01)
- Re: Web crawling library proposal Paul Johnston (Aug 02)
- Re: Web crawling library proposal Martin Holst Swende (Aug 02)
- Re: Web crawling library proposal Fyodor (Jul 29)