Nmap Development mailing list archives
Re: [RFC] Mirroring in http-fetch
From: Gyanendra Mishra <anomaly.the () gmail com>
Date: Fri, 17 Jul 2015 15:05:55 +0530
Hi, On Fri, Jul 17, 2015 at 1:28 AM, Fyodor <fyodor () nmap org> wrote:
On Sun, Jul 12, 2015 at 9:36 AM, Gyanendra Mishra <anomaly.the () gmail com> wrote:Hi list, Through this post I wish to discuss the http-mirror implementation in http-fetch.Hi Gyani. Thanks for sending this! Regarding your ideas: We hope to discuss the need for a mirror script,Personally I'm a big fan of having a mirror script. It's great to be able to download a site for more effictive and faster searching, for archival purposes, etc.
Great! It would make things like http-grep easier. :)
---Current Implementation--- [1]--Terms--To be honest, this redefining of terms for each approach made your proposal difficult to read. When "relative URL" and "absolute URL" mean different things in each of the sections, it can be hard to follow. It would probably work better to use different terms for each different meaning.
Sorry about that.
* A relative URL is any URL that doesn't have the protocol or the domainname specified specified, "/a/b/h.html" , "h.html" both are relative urls. * An absolute URL is a URL with the protocol and the domain name. eg http://example.com/a/b/h.html. * Localized URL : A url that has the path to file in the file system. eg : /home/user/Documents/mirror/example.comI'm not sure why we would ever want the "localized URL"? It means everything would break if I move it to a different directory on my filesystem, or a different machine, and also has minor privacy implications. Wouldn't relative URLS using "../" style paths work just as well and avoid these problems? Maybe there is some benefit to the localized URLs that I haven't thought of.
The clone created by the localize argument would be not portable as you said, the main use of it is to have a mirror that can be accessed without hosting the mirror on localhost. It would be accessible by simply opening the files in say Google chrome. The absolute paths are a quick fix for logical paths like. Say i am in a/b/c/d/ and I want to access a file in b I would have to write functions to get paths like ../../file_in_b.html while hard coded ones skip the need of "../../". This is not the main use of the mirror. The mirror would mostly be used after hosting on localhost.
* Case 1 : "preserve" and "localize" both are nil : The mirroring is over.All the relative URLS in all webpages are now absolute URLS.* Case 2: "preserve" = true and "localize" = nil : The script goesthrough all the downloaded pages and convert all the absolute URLs to relative URLs using relations stored in Table 2. The script doesn't touch the URLS that haven't been downloaded as doing that would lead to bad and inaccessible links. * Case 3: "preserve" = nil and "localize" = true : The script goes through all the downloaded pages and converts all the absolute URLS to localized URLs using relations stored in Table 3. Again all the URLS to pages that haven't been downloaded are untouched for the same reason as stated in Case 2.What do other mirroring tools such as wget and curl do by default? And by URLs, are we talking about both <a href> style links to other pages as well as embedded content such as <img src>? You've probably thought about it more than me, but my initial would be: o The default would not touch pages at all, but an option would be available which converts the links to relative paths (".. ones, not starting with /") if they are to pages/resources we have downlaoded, and would convert other (non-downloaded) URLs to absolute ones. I can also see an argument for swapping those so the default is to convert and there is also the option to preserve. It might be nice if we can just have one option with the common cases rather than require them to use the right combinations of four options. But I guess that depends if there is really a need to operate all the options independently.
Wget one of the more popular tools for mirroring currently downloads everything as is and converts when "--convert-links" is specified.
Also, keep in mind that pages can specify a base href tag ( http://www.w3schools.com/tags/tag_base.asp) to specify what relative URLs are based on. So rather than convert all the relative URLs to absolute ones, it may be better to just specify a base href with the original source URL.
Base href is very useful if we want to convert all links to absolute but while using the ""preserve" option (the one that keeps links as is for a copy that is hosted on localhost) say some file "a.html" isn't downloaded while "b.html", base href would link a.html to the right one on the domain in the specified base href while b.html has a local copy but will be linked to the one on the domain. I'll discuss mirror more with Dan in my meet tonight. I like the download everything as is approach with convert links being an option. This covers the most important use cases. We can add more options later I guess.
Cheers, Fyodor
_______________________________________________ Sent through the dev mailing list https://nmap.org/mailman/listinfo/dev Archived at http://seclists.org/nmap-dev/
Current thread:
- [RFC] Mirroring in http-fetch Gyanendra Mishra (Jul 12)
- Re: [RFC] Mirroring in http-fetch Fyodor (Jul 16)
- Re: [RFC] Mirroring in http-fetch Gyanendra Mishra (Jul 17)
- Re: [RFC] Mirroring in http-fetch Fyodor (Jul 16)