Nmap Development mailing list archives

Re: Web crawling library proposal


From: Paul Johnston <paj () pajhome org uk>
Date: Tue, 2 Aug 2011 11:23:18 +0100

Hi,

It looks like a good start.

Regarding the options, I think it's important to separate general HTTP
options from options specific to the crawler. Ideally, we would have a
general HTTP layer that is used by all HTTP plugins. Here we'd have options
like: cookie support, user agent, path blacklist, etc.. In the crawler we'd
have options like crawl_depth.

Cookie support - I suggest for version 1 you just have the option for the
user to manually specify one or more cookies, just as skipfish works. There
is future potential for nmap keeping its own "cookie jar" - but that is a
major area in itself.

Randomize VA agent - not sure this would be a common requirement. Suggest
default agent is something like "nmap". Have an option that makes it easy to
impersonate IE, and another option to manually specify the full UA string.

I suggest you include an option about whether to attempt POST requests as
part of a crawl. This is generally necessary for a thorough crawl - but it
can cause issues, like defacing a message board with test data.

A major issue for a crawler is how to handle the query string. Ignoring the
query string is not a good option - consider sites that have URLs like
"/app?page=main" "/app?page=inbox". Similarly, treating every query string
as unique can result in very long crawls - consider "/news?item=xx" on a
site with 1000s of news items. Some crawlers deal with this by have a limit
of number of times they'll request a particular page - 30 may be a good
default.

Another major issue is how to extract links from a page. Extracting static
HTML links is relatively straightforward. Of course, many sites use
JavaScript, and advanced commercial crawlers include an embedded browser to
run the JavaScript and extract links this way. I suggest only doing static
links for version 1.

Many plugins that depend on the crawler will need to know form parameter
that are posted to a page - e.g. XSS, SQL injection detectors. Where you
encounter a link as a form submission target, the crawler will need to
record these.

Some plugins will also depend on page content, e.g. find email addresses. If
the user has not opted to save the page contents, these kind of plugins will
need to be called at the time the page is fetched. I think w3af calls this
type of plugin a "grep plugin" and it may be worth looking at how they
architect this.

Hope this helps,

Paul


On Mon, Aug 1, 2011 at 10:52 PM, Paulino Calderon
<paulino () calderonpale com>wrote:

On 07/28/2011 11:38 AM, Paulino Calderon wrote:

Hi nmap-dev,

I've created a wiki page where you can find a proposal for the web
crawling library I started working on. I'd be awesome that the people who
have looked into it before share their thoughts about this draft.

https://secwiki.org/w/Nmap/**Spidering_Library<https://secwiki.org/w/Nmap/Spidering_Library>

Cheers.

 I've updated the proposal, any more suggestions?

https://secwiki.org/w/Nmap/**Spidering_Library<https://secwiki.org/w/Nmap/Spidering_Library>

--
Paulino Calderón Pale
Web: http://calderonpale.com
Twitter: http://www.twitter.com/**paulinocaIderon<http://www.twitter.com/paulinocaIderon>

______________________________**_________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/**mailman/listinfo/nmap-dev<http://cgi.insecure.org/mailman/listinfo/nmap-dev>
Archived at http://seclists.org/nmap-dev/

_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: