Nmap Development mailing list archives
Re: Web crawling library proposal
From: Martin Holst Swende <martin () swende se>
Date: Tue, 02 Aug 2011 19:21:56 +0200
On 08/01/2011 11:52 PM, Paulino Calderon wrote:
On 07/28/2011 11:38 AM, Paulino Calderon wrote:Hi nmap-dev, I've created a wiki page where you can find a proposal for the web crawling library I started working on. I'd be awesome that the people who have looked into it before share their thoughts about this draft. https://secwiki.org/w/Nmap/Spidering_Library Cheers.I've updated the proposal, any more suggestions? https://secwiki.org/w/Nmap/Spidering_Library
Regarding the algorithm, I have some thoughts... Would it make sense to instead make it a producer/consumer setup based on coroutines. (http://www.lua.org/pil/9.2.html ) For example, let's say I want use the library to get links. I could do spider = spider.initialise(<all my options and base url>) status, link = spider.next() next would fetch the site and start parsing, until it came across a link. Once it finds a new link, the link is stored internally in the spidering 'to-visit-list' and then yielded , which will halt the processing and send the link to the caller. To continue, the app would call 'resume' and the parsing would continue. Using such an approach, it would be up to the caller to determine how much spidering is to be performed, and the library will not start fetching very large amounts of data that are not needed. To achieve multi-threaded spidering, the spidering library could just provide a method which starts a number of threads which uses the approach described above to fetch in paralell. The spidering library could also have a wrapper arount http. So that instead of using the http library to fetch data, the spidering library is used - in exactly the same manner. THe only difference being that the spidering library would do two things: - For each fetched resource, add it to the "visited"-list - For each fetched resource, parse the response data and add links, redirects etc to "to-visit"-list. This way, it would perhaps be possible make scripts which already use the http library to help populate the spider. In a pre-rule script, a spider could be initialised, and it could 'replace' the http module. All scripts which make use of the http module would unknowingly use the spider instead, which transparently picks up interesting data as the different scripts do their thing. Just some random ideas, hope you can make sense of it :) Regards, /Martin _______________________________________________ Sent through the nmap-dev mailing list http://cgi.insecure.org/mailman/listinfo/nmap-dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Web crawling library proposal Paulino Calderon (Jul 28)
- Re: Web crawling library proposal Fyodor (Jul 29)
- Re: Web crawling library proposal Brendan Coles (Jul 29)
- Re: Web crawling library proposal Paulino Calderon (Aug 01)
- Re: Web crawling library proposal Paul Johnston (Aug 02)
- Re: Web crawling library proposal Martin Holst Swende (Aug 02)
- Re: Web crawling library proposal Fyodor (Jul 29)