Nmap Development mailing list archives

Re: Web crawling library proposal


From: Martin Holst Swende <martin () swende se>
Date: Tue, 02 Aug 2011 19:21:56 +0200

On 08/01/2011 11:52 PM, Paulino Calderon wrote:
On 07/28/2011 11:38 AM, Paulino Calderon wrote:
Hi nmap-dev,

I've created a wiki page where you can find a proposal for the web
crawling library I started working on. I'd be awesome that the people
who have looked into it before share their thoughts about this draft.

https://secwiki.org/w/Nmap/Spidering_Library

Cheers.

I've updated the proposal, any more suggestions?

https://secwiki.org/w/Nmap/Spidering_Library


Regarding the algorithm, I have some thoughts... Would it make sense to
instead make it a producer/consumer setup based on coroutines.
(http://www.lua.org/pil/9.2.html )
For example, let's say I want use the library to get links. I could do
spider = spider.initialise(<all my options and base url>)
status, link = spider.next()

next would fetch the site and start parsing, until it came across a
link. Once it finds a new link, the link is stored internally in the
spidering 'to-visit-list' and then yielded , which will halt the
processing and send the link to the caller. To continue, the app would
call 'resume' and the parsing would continue.

Using such an approach, it would be up to the caller to determine how
much spidering is to be performed, and the library will not start
fetching very large amounts of data that are not needed.

To achieve multi-threaded spidering, the spidering library could just
provide a method which starts a number of threads which uses the
approach described above to fetch in paralell.


The spidering library could also have a wrapper arount http. So that
instead of using the http library to fetch data, the spidering library
is used - in exactly the same manner. THe only difference being that the
spidering library would do two things:
- For each fetched resource, add it to the "visited"-list
- For each fetched resource, parse the response data and add links,
redirects etc to "to-visit"-list.

This way, it would perhaps be possible make scripts which already use
the http library to help populate the spider. In a pre-rule script, a
spider could be initialised, and it could 'replace' the http module. All
scripts which make use of the http module would unknowingly use the
spider instead, which transparently picks up interesting data as the
different scripts do their thing.

Just some random ideas, hope you can make sense of it :)

Regards,

/Martin












_______________________________________________
Sent through the nmap-dev mailing list
http://cgi.insecure.org/mailman/listinfo/nmap-dev
Archived at http://seclists.org/nmap-dev/


Current thread: