Nmap Development mailing list archives
Proposed improvements on httpspider.lua
From: George Chatzisofroniou <sophron () latthi com>
Date: Sun, 23 Jun 2013 21:47:02 +0300
While i was developing http-referer-checker [1], i encountered some flaws in httpspider library. I made some improvements (see the attached patch) to make my script work properly but there are more that need to be done to make this library really flexible. I'll try to introduce the basic problem i encountered using an example. Let's say I want to spider "example.com". The main page ("/") contains the "/register" page. The "/register" page contains "captcha.com" page and the "captcha.com" contains "captcha.com/howtouse" page. If you can't imagine it, maybe this will help you: <example.com> <example.com> +------+ +---------------+ | / | ------------> | /register | --------> +------+ +---------------+ Resource 1 Resource 2 <example.com> <captcha.com> +--------------------+ +-------------------+ ------> | captcha.com | ---------> | /howtouse | +-------------------+ +-------------------+ Resource 3 Resource 4 So, if i run the crawler with no options, the crawler will return the first two resources ("/" and "/register"). It won't go outside the host by default so it won't return the next two resources. If i want to go outside the host, the crawler provides a number of options, like "withinhost", "withindomain", "noblacklist". Combining these options will actually form a blacklist that will adjust the behavior of the crawler while spidering. So, if i run the crawler with the option "withinhost = 0", the crawler will successfully bring all four resources. The problem is the follow. What if i want to bring all the resources that are referenced from URLs inside the target host? I don't care if the resources lie within or outside host, but i want them to be referenced from within the target host. In the above example, that means resource 1, 2 and 3 but *not* 4. Another example is the http-referer-checker [1]. I wanted to bring all the javascript resources that are referenced from URLs inside the host, like for example "http://code.jquery.com/jquery-latest.js". But by just setting the withinhost option to zero, the crawler would also bring all the resources defined in "http://code.jquery.com/jquery-latest.js" (like, "http://jquery.com") which i didn't care about. This is not possible with the current implementation. So, i added a new option that lets you form a second blacklist that is for scraping. The resources listed in there will be returned by the crawler but they won't be scrapped (the crawler won't search for URLs within them). To summarize the two blacklists: blacklist for spidering - The crawler will not return the resources defined in this blacklist at all. Currently supported by "withinhost", "withindomain", "noblacklist" options. blacklist for scraping - The crawler will return the resources defined in this blacklist but it won't scrape them (meaning it won't search for links within them). Currently supported by "blacklistforscraping" option. Two different blacklists may confuse the users, so we could use one blacklist with an option = {spidering, crawling} that will adjust crawler's behavior. Finally, i propose a more flexible syntax for the crawler's options. For example, the blacklist option I've added follows a syntax like this: blacklistforscraping = { { extension = "css", location ="withinhost" }, { extension = "*", location ="outsidehost" }, } That means, add to the blacklist all the css files within the host and all the files (character "*" stands for all files) outside the host. Again, that means that the crawler will return these files but it won't go any further by searching URLs within them. I think it would be more flexible to have a similar syntax for the blacklist for spidering as well. Does the above make any sense? Let me know what you think, [1]: http://seclists.org/nmap-dev/2013/q2/495 -- George Chatzisofroniou sophron.latthi.com
Attachment:
adding_blacklist_for_scraping.diff
Description:
_______________________________________________ Sent through the dev mailing list http://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- Proposed improvements on httpspider.lua George Chatzisofroniou (Jun 23)
- Re: Proposed improvements on httpspider.lua David Fifield (Jun 23)
- Re: Proposed improvements on httpspider.lua George Chatzisofroniou (Jun 24)
- Re: Proposed improvements on httpspider.lua David Fifield (Jun 23)