Nmap Development mailing list archives

Proposed improvements on httpspider.lua


From: George Chatzisofroniou <sophron () latthi com>
Date: Sun, 23 Jun 2013 21:47:02 +0300

While i was developing http-referer-checker [1], i encountered some
flaws in httpspider library. I made some improvements (see the
attached patch) to make my script work properly but there are more
that need to be done to make this library really flexible.

I'll try to introduce the basic problem i encountered using an
example. Let's say I want to spider "example.com". The main page ("/")
contains the "/register" page. The  "/register" page contains
"captcha.com" page and the "captcha.com" contains
"captcha.com/howtouse" page. If you can't imagine it, maybe this will
help you:

<example.com>    <example.com>
 +------+                  +---------------+
 |    /   |  ------------>   |  /register   | -------->
 +------+                  +---------------+
Resource 1              Resource 2


         <example.com>              <captcha.com>
        +--------------------+              +-------------------+
------> | captcha.com |  --------->  |   /howtouse   |
        +-------------------+               +-------------------+
            Resource 3                    Resource 4

So, if i run the crawler with no options, the crawler will return the
first two resources ("/" and "/register"). It won't go outside the
host by default so it won't return the next two resources. If i want
to go outside the host, the crawler provides a number of options, like
"withinhost", "withindomain", "noblacklist". Combining these options
will actually form a blacklist that will adjust the behavior of the
crawler while spidering. So, if i run the crawler with the option
"withinhost = 0", the crawler will successfully bring all four
resources.

The problem is the follow. What if i want to bring all the resources
that are referenced from URLs inside the target host? I don't care if
the resources lie within or outside host, but i want them to be
referenced from within the target host. In the above example, that
means resource 1, 2 and 3 but *not* 4.

Another example is the http-referer-checker [1]. I wanted to bring all
the javascript resources that are referenced from URLs inside the
host, like for example "http://code.jquery.com/jquery-latest.js";. But
by just setting the withinhost option to zero, the crawler would also
bring all the resources defined in
"http://code.jquery.com/jquery-latest.js"; (like, "http://jquery.com";)
which i didn't care about.

This is not possible with the current implementation. So, i added a
new option that lets you form a second blacklist that is for scraping.
The resources listed in there will be returned by the crawler but they
won't be scrapped (the crawler won't search for URLs within them).

To summarize the two blacklists:

blacklist for spidering - The crawler will not return the resources
defined in this blacklist at all. Currently supported by "withinhost",
"withindomain", "noblacklist" options.
blacklist for scraping - The crawler will return the resources defined
in this blacklist but it won't scrape them (meaning it won't search
for links within them). Currently supported by "blacklistforscraping"
option.

Two different blacklists may confuse the users, so we could use one
blacklist with an option = {spidering, crawling} that will adjust
crawler's behavior.

Finally, i propose a more flexible syntax for the crawler's options.
For example, the blacklist option I've added follows a syntax like
this:

blacklistforscraping = { { extension = "css", location ="withinhost" },
                                 { extension = "*", location ="outsidehost" },
                                }

That means, add to the blacklist all the css files within the host and
all the files (character "*" stands for all files) outside the host.
Again, that means that the crawler will return these files but it
won't go any further by searching URLs within them.

I think it would be more flexible to have a similar syntax for the
blacklist for spidering as well.

Does the above make any sense?

Let me know what you think,

[1]: http://seclists.org/nmap-dev/2013/q2/495

--
George Chatzisofroniou
sophron.latthi.com

Attachment: adding_blacklist_for_scraping.diff
Description:

_______________________________________________
Sent through the dev mailing list
http://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: