Nmap Development mailing list archives

[RFC] Mirroring in http-fetch


From: Gyanendra Mishra <anomaly.the () gmail com>
Date: Sun, 12 Jul 2015 22:06:43 +0530

Hi list,

Through this post I wish to discuss the http-mirror implementation in
http-fetch. I will walk you through the current implementation and then
move on to discuss possible implementations which I discussed with Dan. We
hope to discuss the need for a mirror script, how efficient(accurate?)
should it be and other possible implementations you might suggest.

---Current Implementation--- [1]

--Terms--
 * A relative URL is any URL that doesn't have the protocol or the domain
name specified specified, "/a/b/h.html" , "h.html" both are relative urls.
 * An absolute URL is a URL with the protocol and the domain name. eg
http://example.com/a/b/h.html.
 * Localized URL :  A url that has the path to file in the file system. eg
: /home/user/Documents/mirror/example.com

--args--
 * destination - the destination of the mirror,
 * mirror - The mirror argument enables mirroring and over rides any
'files' or 'builtins' you might have specified.
 * preserve - The preserve argument is used to preserve the relative URLs
on a page. This can be used to create copies that can be browsed when
hosted on localhost. I'll explain working in the working section in detail.
Copies can be shifted from one system to another.
 * localize - The localize argument converts all downloaded URLS to
localized urls, suitable for non hosted viewing on say chrome. Copies
cannot be shifted from one   system to another or even within the system.

--working--
The script calls the spider with maxpagecount=200 and maxdepth=5 and allows
non web files by default. Once it finds a page that has a 200 response and
contains a body, it calls the relative_to_absolute function on it. This
function takes the entire body of page and converts all the relative URLS
as defined above to absolute URLS.
It stores the relation between the downloaded absolute urls and localized
urls in one table(table 1) and the relation between absolute urls and
relative urls in another table(Table 2).
This is done for all the ~200 pages. After this is done there are three
cases :

* Case 1 : "preserve" and "localize" both are nil : The mirroring is over.
All the relative URLS in all webpages are now absolute URLS.
* Case 2: "preserve" = true and "localize" = nil  : The script goes through
all the downloaded pages and convert all the absolute URLs to relative URLs
using relations stored in Table 2. The script doesn't touch the URLS that
haven't been downloaded as doing that would lead to bad and inaccessible
links.
* Case 3: "preserve" = nil and "localize" = true : The script goes through
all the downloaded pages and converts all the absolute URLS to localized
URLs using relations stored in Table 3. Again all the URLS to pages that
haven't been downloaded are untouched for the same reason as stated in Case
2.

--Some Limitations and Problems--
* Having two mutually exclusive booleans maybe confusing for the user.
* Pages with GET parameters aren't handled. Say a page looks like
example.com/page?author=1. The page would get stored as page?author=1 as
the website will be served directly. "?" is an invalid character on
Windows[2]. I will add a function that encodes file names depending on
platform or maybe in a platform independent way.
* Another use case could be download everything in an as is manner. Note
this is not the same as CASE 1 as this wouldn't even make a call to the
function "relative_to_absolute()".

-- Command Line --
* For preserve  == > ./nmap --script http-fetch --script-args
"destination='/home/user/Documents/mirror',mirror=true,preserve=true"
nmap.org -p 80 -d
* For localize  ==> ./nmap --script http-fetch --script-args
"destination='/home/user/Documents/mirror',mirror=true,localize=true"
nmap.org -p 80 -d
* Default == > ./nmap --script http-fetch --script-args
"destination='/home/user/Documents/mirror',mirror=true" nmap.org -p 80 -d


--- Alternate Implementation 1 ---

-- Terms --
* Relative URLs : A URL  that does't start with a  "/". eg :
"relative.html"
* Absolute URL : A URL that starts with a "/". eg : "/a/b/h.html"
* Localized URl : A URL that has the file system path appended before the
absolute URL.
* Bad Link : A link that links to nothing. Arises mostly when something
gets converted to absolute or localized but the page linked doesn't get
downloaded. Or when relative URLs are allowed to say in an as is manner.

-- args--
* "absolute" is like the opposite of "preserve" currently: if enabled, it
converts everything to an absolute url (meaning it starts with a /),
otherwise it leaves everything alone.
* "localize", if true, adds the destination directory before that /,
overwriting the domain name if necessary.
* "mirror" and "destination" are the same as above.

--working--
I'll just run you through the various cases that might exists and what
enabling the various arguments might do. Most of the text written here have
been copied from my chat with Dan, he suggested this implementation. "/tmp"
in the cases below is the directory provided by the destination argument.

CASE 1 : Given a page at /dir/page.html that contains a relative link:
"script.js"(script.js is located on the server at /dir/script.js) :
 * with neither absolute nor localize: "script.js"
 * absolute=1,localize=nil: "/dir/script.js"
 * absolute=nil, localize=1: "script.js"
 * absolute=1, localize=1: "/tmp/dir/script.js"

CASE 2: Now given an absolute link, "/dir/page2.html". Absolute does
nothing as the link is already absolute.
 * absolute=nil, localize=nil: "/dir/page2.html"
 * absolute=1, localize=nil: "/dir/page2.html"
 * absolute=nil, localize=1: "/tmp/dir/page2.html"
 * absolute=1, localize=1: "/tmp/dir/page2.html"

CASE 3: Given a domain name link on the same host: "
http://example.com/page.html": Absolute does nothing as the link is already
absolute.
 * absolute=nil, localize=nil: "http://example.com/page.html";
 * absolute=1, localize=nil: "http://example.com/page.html";
 * absolute=nil, localize=1: "/tmp/page.html"
 * absolute=1, localize=1: "/tmp/page.html"

So absolute + localize results in an output that is similar to the output
of "localize" in the previous script. Just absolute would result in an
output like the "preserve" argument in the previous implementation. The
problem of "bad links" still exists.

--- Alternate Implementation 2 ---

--Terms--
 * A relative URL is any URL that doesn't have the protocol or the domain
name specified specified, "/a/b/h.html" , "h.html" both are relative urls.
 * An absolute URL is a URL with the protocol and the domain name. eg
http://example.com/a/b/h.html.
 * Localized URL :  A url that has the path to file in the file system. eg
: /home/user/Documents/mirror/example.com

--args--
* "localize" : creates a local browseable copy of a web page. If false then
downloads everything as is(even relative to absolute isn't run).
* "mirror" and "destination" are same as before.

--working--
This is probably the simplest case to explain with the fewest arguments. If
localize is set to true then the script appends a base href in the head
section of the webpage like. <base href="/tmp/> so that all URLs simply
link to "/tmp/dir/". This would again lead to bad links as in the previous
cases if relative_to_absolute isn't called on the body before. Putting
<base href="/tmp/dir/"> would not allow the page to be hosted in the
directory of the mirror rather the host would have to be on "/" or else all
links would lead to /tmp/dir/page while they should lead to "/page".

--suggestion--
Maybe we can have an argument called localize if set true it appends <base
href="/tmp/dir/"> (output similar to localize of present implementation)
and if not it enabled it still runs the make_relative_to_absolute function
so as to avoid bad links(output would be similar to "preserve" of the
current implementation) and have another argument called
"downlad_everything" which just download everything with no changes made
which would be exclusive to mirror.

Please throw in some feed back on the need of the script and
implementations. A possible use case is running http-grep on an already
downloaded mirror. Or other scripts that focus on the static part of the
website than the dynamic part. Please comment on the different
implementations proposed and maybe if you have another possible
implementation please suggest that.

tl;dr : How do you want mirroring in http-fetch to be implemented?

Thanks!
Gyani
[1]https://svn.nmap.org/nmap-exp/gyani/drafts/http-fetch.nse
[2]https://kb.acronis.com/content/39790
_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: