Nmap Development mailing list archives
Re: Web crawling library proposal
From: Paul Johnston <paj () pajhome org uk>
Date: Tue, 2 Aug 2011 11:23:18 +0100
Hi, It looks like a good start. Regarding the options, I think it's important to separate general HTTP options from options specific to the crawler. Ideally, we would have a general HTTP layer that is used by all HTTP plugins. Here we'd have options like: cookie support, user agent, path blacklist, etc.. In the crawler we'd have options like crawl_depth. Cookie support - I suggest for version 1 you just have the option for the user to manually specify one or more cookies, just as skipfish works. There is future potential for nmap keeping its own "cookie jar" - but that is a major area in itself. Randomize VA agent - not sure this would be a common requirement. Suggest default agent is something like "nmap". Have an option that makes it easy to impersonate IE, and another option to manually specify the full UA string. I suggest you include an option about whether to attempt POST requests as part of a crawl. This is generally necessary for a thorough crawl - but it can cause issues, like defacing a message board with test data. A major issue for a crawler is how to handle the query string. Ignoring the query string is not a good option - consider sites that have URLs like "/app?page=main" "/app?page=inbox". Similarly, treating every query string as unique can result in very long crawls - consider "/news?item=xx" on a site with 1000s of news items. Some crawlers deal with this by have a limit of number of times they'll request a particular page - 30 may be a good default. Another major issue is how to extract links from a page. Extracting static HTML links is relatively straightforward. Of course, many sites use JavaScript, and advanced commercial crawlers include an embedded browser to run the JavaScript and extract links this way. I suggest only doing static links for version 1. Many plugins that depend on the crawler will need to know form parameter that are posted to a page - e.g. XSS, SQL injection detectors. Where you encounter a link as a form submission target, the crawler will need to record these. Some plugins will also depend on page content, e.g. find email addresses. If the user has not opted to save the page contents, these kind of plugins will need to be called at the time the page is fetched. I think w3af calls this type of plugin a "grep plugin" and it may be worth looking at how they architect this. Hope this helps, Paul On Mon, Aug 1, 2011 at 10:52 PM, Paulino Calderon <paulino () calderonpale com>wrote:
On 07/28/2011 11:38 AM, Paulino Calderon wrote:Hi nmap-dev, I've created a wiki page where you can find a proposal for the web crawling library I started working on. I'd be awesome that the people who have looked into it before share their thoughts about this draft. https://secwiki.org/w/Nmap/**Spidering_Library<https://secwiki.org/w/Nmap/Spidering_Library> Cheers. I've updated the proposal, any more suggestions?https://secwiki.org/w/Nmap/**Spidering_Library<https://secwiki.org/w/Nmap/Spidering_Library> -- Paulino Calderón Pale Web: http://calderonpale.com Twitter: http://www.twitter.com/**paulinocaIderon<http://www.twitter.com/paulinocaIderon> ______________________________**_________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/**mailman/listinfo/nmap-dev<http://cgi.insecure.org/mailman/listinfo/nmap-dev> Archived at http://seclists.org/nmap-dev/
_______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Web crawling library proposal Paulino Calderon (Jul 28)
- Re: Web crawling library proposal Fyodor (Jul 29)
- Re: Web crawling library proposal Brendan Coles (Jul 29)
- Re: Web crawling library proposal Paulino Calderon (Aug 01)
- Re: Web crawling library proposal Paul Johnston (Aug 02)
- Re: Web crawling library proposal Martin Holst Swende (Aug 02)
- Re: Web crawling library proposal Fyodor (Jul 29)