Nmap Development mailing list archives
[RFC] Mirroring in http-fetch
From: Gyanendra Mishra <anomaly.the () gmail com>
Date: Sun, 12 Jul 2015 22:06:43 +0530
Hi list, Through this post I wish to discuss the http-mirror implementation in http-fetch. I will walk you through the current implementation and then move on to discuss possible implementations which I discussed with Dan. We hope to discuss the need for a mirror script, how efficient(accurate?) should it be and other possible implementations you might suggest. ---Current Implementation--- [1] --Terms-- * A relative URL is any URL that doesn't have the protocol or the domain name specified specified, "/a/b/h.html" , "h.html" both are relative urls. * An absolute URL is a URL with the protocol and the domain name. eg http://example.com/a/b/h.html. * Localized URL : A url that has the path to file in the file system. eg : /home/user/Documents/mirror/example.com --args-- * destination - the destination of the mirror, * mirror - The mirror argument enables mirroring and over rides any 'files' or 'builtins' you might have specified. * preserve - The preserve argument is used to preserve the relative URLs on a page. This can be used to create copies that can be browsed when hosted on localhost. I'll explain working in the working section in detail. Copies can be shifted from one system to another. * localize - The localize argument converts all downloaded URLS to localized urls, suitable for non hosted viewing on say chrome. Copies cannot be shifted from one system to another or even within the system. --working-- The script calls the spider with maxpagecount=200 and maxdepth=5 and allows non web files by default. Once it finds a page that has a 200 response and contains a body, it calls the relative_to_absolute function on it. This function takes the entire body of page and converts all the relative URLS as defined above to absolute URLS. It stores the relation between the downloaded absolute urls and localized urls in one table(table 1) and the relation between absolute urls and relative urls in another table(Table 2). This is done for all the ~200 pages. After this is done there are three cases : * Case 1 : "preserve" and "localize" both are nil : The mirroring is over. All the relative URLS in all webpages are now absolute URLS. * Case 2: "preserve" = true and "localize" = nil : The script goes through all the downloaded pages and convert all the absolute URLs to relative URLs using relations stored in Table 2. The script doesn't touch the URLS that haven't been downloaded as doing that would lead to bad and inaccessible links. * Case 3: "preserve" = nil and "localize" = true : The script goes through all the downloaded pages and converts all the absolute URLS to localized URLs using relations stored in Table 3. Again all the URLS to pages that haven't been downloaded are untouched for the same reason as stated in Case 2. --Some Limitations and Problems-- * Having two mutually exclusive booleans maybe confusing for the user. * Pages with GET parameters aren't handled. Say a page looks like example.com/page?author=1. The page would get stored as page?author=1 as the website will be served directly. "?" is an invalid character on Windows[2]. I will add a function that encodes file names depending on platform or maybe in a platform independent way. * Another use case could be download everything in an as is manner. Note this is not the same as CASE 1 as this wouldn't even make a call to the function "relative_to_absolute()". -- Command Line -- * For preserve == > ./nmap --script http-fetch --script-args "destination='/home/user/Documents/mirror',mirror=true,preserve=true" nmap.org -p 80 -d * For localize ==> ./nmap --script http-fetch --script-args "destination='/home/user/Documents/mirror',mirror=true,localize=true" nmap.org -p 80 -d * Default == > ./nmap --script http-fetch --script-args "destination='/home/user/Documents/mirror',mirror=true" nmap.org -p 80 -d --- Alternate Implementation 1 --- -- Terms -- * Relative URLs : A URL that does't start with a "/". eg : "relative.html" * Absolute URL : A URL that starts with a "/". eg : "/a/b/h.html" * Localized URl : A URL that has the file system path appended before the absolute URL. * Bad Link : A link that links to nothing. Arises mostly when something gets converted to absolute or localized but the page linked doesn't get downloaded. Or when relative URLs are allowed to say in an as is manner. -- args-- * "absolute" is like the opposite of "preserve" currently: if enabled, it converts everything to an absolute url (meaning it starts with a /), otherwise it leaves everything alone. * "localize", if true, adds the destination directory before that /, overwriting the domain name if necessary. * "mirror" and "destination" are the same as above. --working-- I'll just run you through the various cases that might exists and what enabling the various arguments might do. Most of the text written here have been copied from my chat with Dan, he suggested this implementation. "/tmp" in the cases below is the directory provided by the destination argument. CASE 1 : Given a page at /dir/page.html that contains a relative link: "script.js"(script.js is located on the server at /dir/script.js) : * with neither absolute nor localize: "script.js" * absolute=1,localize=nil: "/dir/script.js" * absolute=nil, localize=1: "script.js" * absolute=1, localize=1: "/tmp/dir/script.js" CASE 2: Now given an absolute link, "/dir/page2.html". Absolute does nothing as the link is already absolute. * absolute=nil, localize=nil: "/dir/page2.html" * absolute=1, localize=nil: "/dir/page2.html" * absolute=nil, localize=1: "/tmp/dir/page2.html" * absolute=1, localize=1: "/tmp/dir/page2.html" CASE 3: Given a domain name link on the same host: " http://example.com/page.html": Absolute does nothing as the link is already absolute. * absolute=nil, localize=nil: "http://example.com/page.html" * absolute=1, localize=nil: "http://example.com/page.html" * absolute=nil, localize=1: "/tmp/page.html" * absolute=1, localize=1: "/tmp/page.html" So absolute + localize results in an output that is similar to the output of "localize" in the previous script. Just absolute would result in an output like the "preserve" argument in the previous implementation. The problem of "bad links" still exists. --- Alternate Implementation 2 --- --Terms-- * A relative URL is any URL that doesn't have the protocol or the domain name specified specified, "/a/b/h.html" , "h.html" both are relative urls. * An absolute URL is a URL with the protocol and the domain name. eg http://example.com/a/b/h.html. * Localized URL : A url that has the path to file in the file system. eg : /home/user/Documents/mirror/example.com --args-- * "localize" : creates a local browseable copy of a web page. If false then downloads everything as is(even relative to absolute isn't run). * "mirror" and "destination" are same as before. --working-- This is probably the simplest case to explain with the fewest arguments. If localize is set to true then the script appends a base href in the head section of the webpage like. <base href="/tmp/> so that all URLs simply link to "/tmp/dir/". This would again lead to bad links as in the previous cases if relative_to_absolute isn't called on the body before. Putting <base href="/tmp/dir/"> would not allow the page to be hosted in the directory of the mirror rather the host would have to be on "/" or else all links would lead to /tmp/dir/page while they should lead to "/page". --suggestion-- Maybe we can have an argument called localize if set true it appends <base href="/tmp/dir/"> (output similar to localize of present implementation) and if not it enabled it still runs the make_relative_to_absolute function so as to avoid bad links(output would be similar to "preserve" of the current implementation) and have another argument called "downlad_everything" which just download everything with no changes made which would be exclusive to mirror. Please throw in some feed back on the need of the script and implementations. A possible use case is running http-grep on an already downloaded mirror. Or other scripts that focus on the static part of the website than the dynamic part. Please comment on the different implementations proposed and maybe if you have another possible implementation please suggest that. tl;dr : How do you want mirroring in http-fetch to be implemented? Thanks! Gyani [1]https://svn.nmap.org/nmap-exp/gyani/drafts/http-fetch.nse [2]https://kb.acronis.com/content/39790
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- [RFC] Mirroring in http-fetch Gyanendra Mishra (Jul 12)
- Re: [RFC] Mirroring in http-fetch Fyodor (Jul 16)
- Re: [RFC] Mirroring in http-fetch Gyanendra Mishra (Jul 17)
- Re: [RFC] Mirroring in http-fetch Fyodor (Jul 16)