Nmap Development mailing list archives

Re: [RFC] Mirroring in http-fetch


From: Gyanendra Mishra <anomaly.the () gmail com>
Date: Fri, 17 Jul 2015 15:05:55 +0530

Hi,

On Fri, Jul 17, 2015 at 1:28 AM, Fyodor <fyodor () nmap org> wrote:

On Sun, Jul 12, 2015 at 9:36 AM, Gyanendra Mishra <anomaly.the () gmail com>
wrote:

Hi list,

Through this post I wish to discuss the http-mirror implementation in
http-fetch.


Hi Gyani.  Thanks for sending this!  Regarding your ideas:

We hope to discuss the need for a mirror script,


Personally I'm a big fan of having a mirror script.  It's great to be able
to download a site for more effictive and faster searching, for archival
purposes, etc.


Great! It would make things like http-grep easier. :)



---Current Implementation--- [1]

--Terms--


To be honest, this redefining of terms for each approach made your
proposal difficult to read.  When "relative URL" and "absolute URL" mean
different things in each of the sections, it can be hard to follow.  It
would probably work better to use different terms for each different
meaning.


Sorry about that.



 * A relative URL is any URL that doesn't have the protocol or the domain
name specified specified, "/a/b/h.html" , "h.html" both are relative urls.
 * An absolute URL is a URL with the protocol and the domain name. eg
http://example.com/a/b/h.html.
 * Localized URL :  A url that has the path to file in the file system.
eg : /home/user/Documents/mirror/example.com


I'm not sure why we would ever want the "localized URL"?  It means
everything would break if I move it to a different directory on my
filesystem, or a different machine, and also has minor privacy
implications.  Wouldn't relative URLS using "../" style paths work just as
well and avoid these problems?  Maybe there is some benefit to the
localized URLs that I haven't thought of.


The clone created by the localize argument would be not portable as you
said, the main use of it is to have  a mirror that can be accessed without
hosting the mirror on localhost. It would be accessible by simply opening
the files in say Google chrome. The absolute paths are a quick fix for
logical paths like.

Say i am in a/b/c/d/ and I want to access a file in b I would have to write
functions to get paths like ../../file_in_b.html while hard coded ones skip
the need of "../../". This is not the main use of the mirror. The mirror
would mostly be used after hosting on localhost.


* Case 1 : "preserve" and "localize" both are nil : The mirroring is over.
All the relative URLS in all webpages are now absolute URLS.

* Case 2: "preserve" = true and "localize" = nil  : The script goes
through all the downloaded pages and convert all the absolute URLs to
relative URLs using relations stored in Table 2. The script doesn't touch
the URLS that haven't been downloaded as doing that would lead to bad and
inaccessible links.
* Case 3: "preserve" = nil and "localize" = true : The script goes
through all the downloaded pages and converts all the absolute URLS to
localized URLs using relations stored in Table 3. Again all the URLS to
pages that haven't been downloaded are untouched for the same reason as
stated in Case 2.


What do other mirroring tools such as wget and curl do by default?  And by
URLs, are we talking about both <a href> style links to other pages as well
as embedded content such as <img src>?

You've probably thought about it more than me, but my initial would be:

o The default would not touch pages at all, but an option would be
available which converts the links to relative paths (".. ones, not
starting with /") if they are to pages/resources we have downlaoded, and
would convert other (non-downloaded) URLs to absolute ones.

I can also see an argument for swapping those so the default is to convert
and there is also the option to preserve.  It might be nice if we can just
have one option with the common cases rather than require them to use the
right combinations of four options.  But I guess that depends if there is
really a need to operate all the options independently.


Wget one of the more popular tools for mirroring currently downloads
everything as is and converts when "--convert-links" is specified.


Also, keep in mind that pages can specify a base href tag (
http://www.w3schools.com/tags/tag_base.asp) to specify what relative URLs
are based on.  So rather than convert all the relative URLs to absolute
ones, it may be better to just specify a base href with the original source
URL.


Base href is very useful if we want to convert all links to absolute but
while using the ""preserve" option (the one that keeps links as is for a
copy that is hosted on localhost) say some file "a.html" isn't downloaded
while "b.html", base href would link a.html to the right one on the domain
in the specified base href while b.html has a local copy but will be linked
to the one on the domain.

I'll discuss mirror more with Dan in my meet tonight. I like the download
everything as is approach with convert links being an option. This covers
the most important use cases. We can add more options later I guess.



Cheers,
Fyodor


_______________________________________________
Sent through the dev mailing list
https://nmap.org/mailman/listinfo/dev
Archived at http://seclists.org/nmap-dev/

Current thread: