Tux

...making Linux just a little more fun!

How to make wget exclude a particular link when mirroring

Suramya Tomar [security at suramya.com]


Wed, 04 Feb 2009 18:36:50 +0530

Hey Everyone,

I am trying to mirror an Invision Powerboard forum locally on my system (With permission from the admin) using wget and I am having issues.

When I start downloading wget visits each and every link and makes a local copy (like its supposed to) but in this process it also visits the "Log out" link which logs me out from the site and then I am unable to download the remaining links.

So I need to figure out how to exclude the Logout link from the process. The logout link looks like: www.website.com/index.php?act=Login&CODE=03 So I tried the following:

wget -X "*CODE*" --mirror --load-cookies=/var/www/cookiefile.txt 
http://www.website.com

but it didn't work.

I can't exclude the index.php itself because all the links are based off the index.php with parameters.

I have tried searching the web but didn't find anything relevant.

Any ideas on how to do it?

Thanks,

Suramya


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 4 Feb 2009 08:54:07 -0500

On Wed, Feb 04, 2009 at 06:36:50PM +0530, Suramya Tomar wrote:

> 
> When I start downloading wget visits each and every link and makes a 
> local copy (like its supposed to) but in this process it also visits the 
> "Log out" link which logs me out from the site and then I am unable to 
> download the remaining links.
> 
> So I need to figure out how to exclude the Logout link from the process. 
> The logout link looks like: www.website.com/index.php?act=Login&CODE=03

Seems like the '-R' should do it. From the "wget" man page:

  -R rejlist --reject rejlist
	  Specify comma-separated lists of file name suffixes or patterns to
      accept or reject (@pxref{Types of Files} for more details).
-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Suramya Tomar [security at suramya.com]


Wed, 04 Feb 2009 19:38:11 +0530

Hi Ben,

>> So I need to figure out how to exclude the Logout link from the process. 
>> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
> 
> Seems like the '-R' should do it. From the "wget" man page:
> 
> ``
>   -R rejlist --reject rejlist
> 	  Specify comma-separated lists of file name suffixes or patterns to
>       accept or reject (@pxref{Types of Files} for more details).
> ''

Unfortunately that didn't work. It still logged me out.

According to: http://www.mail-archive.com/wget@sunsite.dk/msg10956.html

-------
As I currently understand it from the code, at least for Wget 1.11,
matching is against the _URL_'s filename portion (and only that portion:
no query strings, no directories) when deciding whether it should
download something through a recursive descent (the relevant spot in the
code is in recur.c, marked by a comment starting "6. Check for
acceptance/rejection rules.").
-------

Is there any other way to do this? Maybe some other tool?

- Suramya


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 4 Feb 2009 09:28:46 -0500

On Wed, Feb 04, 2009 at 07:38:11PM +0530, Suramya Tomar wrote:

> Hi Ben,
> 
> >> So I need to figure out how to exclude the Logout link from the process. 
> >> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
> > 
> > Seems like the '-R' should do it. From the "wget" man page:
> > 
> > ``
> >   -R rejlist --reject rejlist
> > 	  Specify comma-separated lists of file name suffixes or patterns to
> >       accept or reject (@pxref{Types of Files} for more details).
> > ''
> 
> Unfortunately that didn't work. It still logged me out.

What didn't work? What, exactly, did you try?

> According to: http://www.mail-archive.com/wget@sunsite.dk/msg10956.html
> 
> -------
> As I currently understand it from the code, at least for Wget 1.11,
> matching is against the _URL_'s filename portion (and only that portion:
> no query strings, no directories) when deciding whether it should
> download something through a recursive descent (the relevant spot in the
> code is in recur.c, marked by a comment starting "6. Check for
> acceptance/rejection rules.").
> -------

I've just looked at the source, and it seems to me that the rule immediately above that one contradicts this.

  /* 5. If the file does not match the acceptance list, or is on the
     rejection list, chuck it out.  The same goes for the directory
     exclusion and inclusion lists.  */

I didn't dig into the code (I've forgotten C to such an extent that even reading it is very difficult for me), but this seems like a rejection mechanism that runs in addition to the one in #6.

> Is there any other way to do this? Maybe some other tool?

No, you can't use another tool. Since this is a decision that is made internally by "wget", you need to instruct "wget" itself to make that decision.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Kapil Hari Paranjape [kapil at imsc.res.in]


Wed, 4 Feb 2009 20:12:41 +0530

Hello,

On Wed, 04 Feb 2009, Suramya Tomar wrote:

> > ``
> >   -R rejlist --reject rejlist
> > 	  Specify comma-separated lists of file name suffixes or patterns to
> >       accept or reject (@pxref{Types of Files} for more details).
> > ''
> 
> Unfortunately that didn't work. It still logged me out.

1. The "info" pages have more information than the man pages (I think).

2. In particular, note that -R something is treated as a pattern if it contains the ? character so you will need to escape that character. Did you?

Kapil. --


Top    Back


Suramya Tomar [security at suramya.com]


Wed, 04 Feb 2009 20:39:23 +0530

Hi Ben,

>> Unfortunately that didn't work. It still logged me out.
> 
> What didn't work? What, exactly, did you try?

The command I tried was:

wget -R "*CODE*" --mirror --load-cookies=/var/www/cookiefile.txt 
http://www.website.com/index.php

The download started fine, but as soon as it hit the logout link, I got logged out and the remaining pages downloaded kept showing me the login page instead of the content.

> No, you can't use another tool. Since this is a decision that is made
> internally by "wget", you need to instruct "wget" itself to make that
> decision.

What I meant was, if wget doesn't support this then is there some other program that I can use to mirror the site?

Thanks, Suramya


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 4 Feb 2009 10:10:08 -0500

On Wed, Feb 04, 2009 at 08:12:41PM +0530, Kapil Hari Paranjape wrote:

> Hello,
> 
> On Wed, 04 Feb 2009, Suramya Tomar wrote:
> > > ``
> > >   -R rejlist --reject rejlist
> > > 	  Specify comma-separated lists of file name suffixes or patterns to
> > >       accept or reject (@pxref{Types of Files} for more details).
> > > ''
> > 
> > Unfortunately that didn't work. It still logged me out.
> 
> 1. The "info" pages have more information than the man pages (I
> think). 
> 
> 2. In particular, note that -R something is treated as a pattern if
> it contains the ? character so you will need to escape that
> character. Did you?

Actually, except in odd cases, that specific one won't matter - since '?' in the shell means 'any single character'... which would include '?'. It can, however, give you false positives: i.e., "foo?bar" will match the literal string specified, but it will also match "foo=bar" and so on.

In any case - I've just tested this out, and '-R' does indeed work as it should, at least on recursive retrievals. First, I made up a little CGI proglet, something that would return actual output when given parameters, and placed it in my WEBROOT/test directory:

#!/usr/bin/perl -w
# Created by Ben Okopnik on Wed Feb  4 09:32:13 EST 2009
use CGI qw/:standard :cgi-lib/;
 
my %params = %{Vars()};
 
print header, start_html,
    map({"$_ => $params{$_}<br>"} keys %params), end_html;

Then, I created a file containing a list of URLs to download:

http://localhost/test/foo.cgi?a=b
http://localhost/test/foo.cgi?c=d
http://localhost/test/foo.cgi?foo=bar

Last, I ran "wget" with the appropriate options: '-nd' for "no directories" - I want to see the downloaded files in the current dir; '-r' for recursive - accept and reject lists only work with recursive retrievals; '-R foo=bar' - to ignore all URLs containing that string; and '-i input_file' to read in the above URLs.

 wget -nd -rR 'foo=bar' -i input_file

Result:

ben@Tyr:/tmp/testwget$ ls -l
total 12
-rw-r--r-- 1 ben ben 355 2009-02-04 10:01 foo.cgi?a=b
-rw-r--r-- 1 ben ben 355 2009-02-04 10:01 foo.cgi?c=d
-rw-r--r-- 1 ben ben 106 2009-02-04 09:46 input_file

Three URLs, two downloaded files.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 4 Feb 2009 10:20:35 -0500

On Wed, Feb 04, 2009 at 08:39:23PM +0530, Suramya Tomar wrote:

> Hi Ben,
> 
> >> Unfortunately that didn't work. It still logged me out.
> > 
> > What didn't work? What, exactly, did you try?
> 
> The command I tried was:
> 
> wget -R "*CODE*" --mirror --load-cookies=/var/www/cookiefile.txt 
> http://www.website.com/index.php

Oh... dear. Suramya, you're not supposed to retype what you entered; you should always copy and paste your original entry. After being here in TAG for so long, I figured you'd know that; we ding people for doing that regularly.

Nobody wants to troubleshoot retyping errors or deal with the poster skipping the "non-important" parts - and that's exactly what happened here. I'm sure that you didn't actually use the string '*CODE*' in your original test - and the one thing that I actually needed to know was what you did type there. Since you retyped, I have to ask again.

> > No, you can't use another tool. Since this is a decision that is made
> > internally by "wget", you need to instruct "wget" itself to make that
> > decision.
> 
> What I meant was, if wget doesn't support this then is there some other 
> program that I can use to mirror the site?

FTP or 'rsync', since you have the admin's permission? Those wouldn't be doing any interpretation; they'd just download the files.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Suramya Tomar [security at suramya.com]


Wed, 04 Feb 2009 20:54:35 +0530

Hi Ben,

> Last, I ran "wget" with the appropriate options: '-nd' for "no
> directories" - I want to see the downloaded files in the current dir;
> '-r' for recursive - accept and reject lists only work with recursive
> retrievals; '-R foo=bar' - to ignore all URLs containing that string;
> and '-i input_file' to read in the above URLs.

It worked when I used it -r option instead of --mirror. I thought that the -R option would override the --mirror option but I guess thats not the case.

The command I used to start the download was:

wget -R "*CODE*" -r --load-cookies=/var/www/cookiefile.txt 
http://www.website.com/index.php

It correctly rejected all URL's with 'CODE' in them.

Thanks for the help.

- Suramya


Top    Back


Suramya Tomar [security at suramya.com]


Wed, 04 Feb 2009 21:01:43 +0530

Hi Ben,

> Oh... dear. Suramya, you're not supposed to retype what you entered; you
> should always copy and paste your original entry. After being here in

I did copy and paste. All I changed was the name of the website.

> Nobody wants to troubleshoot retyping errors or deal with the poster
> skipping the "non-important" parts - and that's exactly what happened
> here. I'm sure that you didn't actually use the string '*CODE*' in your
> original test - and the one thing that I actually needed to know was
> what you did type there. Since you retyped, I have to ask again.

Actually, thats exactly what I typed because I wanted it to skip links like:

index.php?act=Msg&CODE=01

and keep links like:

index.php?showforum=1

so I used CODE as my skip term.

Thanks,

Suramya


Top    Back


Suramya Tomar [security at suramya.com]


Wed, 04 Feb 2009 21:08:18 +0530

Hey,

> It worked when I used it -r option instead of --mirror. I thought that 
> the -R option would override the --mirror option but I guess thats not 
> the case.

I spoke a bit too soon. It downloaded the link to my system and then removed it. So in the process it logged me out of the system. :(

-------
Saving to: `www.website.com/index.php?act=Search&CODE=getnew'
 
     [   <=>                                          ] 19,492 
26.9K/s   in 0.7s
 
2009-02-04 21:01:19 (26.9 KB/s) - 
`www.website.com/index.php?act=Search&CODE=getnew' saved [19492]
 
Removing www.website.com/index.php?act=Search&CODE=getnew since it 
should be rejected.
--------

I am going to try using httrack and see if that works better.

Thanks,

Suramya


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 4 Feb 2009 22:20:25 -0500

On Wed, Feb 04, 2009 at 08:54:35PM +0530, Suramya Tomar wrote:

> Hi Ben,
> 
> > Last, I ran "wget" with the appropriate options: '-nd' for "no
> > directories" - I want to see the downloaded files in the current dir;
> > '-r' for recursive - accept and reject lists only work with recursive
> > retrievals; '-R foo=bar' - to ignore all URLs containing that string;
> > and '-i input_file' to read in the above URLs.
> 
> It worked when I used it -r option instead of --mirror. I thought that 
> the -R option would override the --mirror option but I guess thats not 
> the case.

Odd, since '--mirror' includes '-r' - at least according to the man page.

> The command I used to start the download was:
> 
> wget -R "*CODE*" -r --load-cookies=/var/www/cookiefile.txt 
> http://www.website.com/index.php
> 
> It correctly rejected all URL's with 'CODE' in them.

Ah - my confusion, then. In theory, typing 'CODE' would have worked just fine; the pattern match succeeds if the pattern is anywhere within the URL. This, of course, means that it's better to be overly specific; otherwise, you end up ignoring more than you expected.

> Thanks for the help.

You're welcome - glad it worked for you!

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Wed, 4 Feb 2009 22:21:12 -0500

On Wed, Feb 04, 2009 at 09:01:43PM +0530, Suramya Tomar wrote:

> Hi Ben,
> 
> > Oh... dear. Suramya, you're not supposed to retype what you entered; you
> > should always copy and paste your original entry. After being here in
> 
> I did copy and paste. All I changed was the name of the website.

Ah. That was what confused me.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]


Thu, 5 Feb 2009 13:56:08 +0000

On Wed, Feb 04, 2009 at 06:36:50PM +0530, Suramya Tomar wrote:

Hi there,

> I am trying to mirror an Invision Powerboard forum locally on my system 
> (With permission from the admin) using wget and I am having issues.
> So I need to figure out how to exclude the Logout link from the process. 
> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
> So I tried the following:
> 
> wget -X "*CODE*" --mirror --load-cookies=/var/www/cookiefile.txt 
> http://www.website.com
> 
> but it didn't work.

As the rest of the thread has shown, wget doesn't let you do this.

(The short version is: wget considers the url to be

scheme://host/directory/file?query#fragment

-X filters on "directory", -R filters on "file", nothing filters on "query", which is where you need it.)

So, the choices are:

* use something instead of wget. You mentioned you'll try httrack. I don't have a better suggestion.

* use something as well as wget. You could use a proxy server which you configure to prevent access to the specific "logout" url, so all other requests go to the origin server.

* reconsider the original question. What is it you want to achieve? The best local version of the website is probably a similarly-configured web server with the same content on the backend. "Access to the useful information" could just be a dump of the backend content in a suitable file-based format. Straight http access to the front-end web server will probably not give you that content easily.

Trying to mirror a dynamic website into local files is not always easy. You end up with weird filenames and potentially duplicated content. If you can get it to work for your case, go for it; but I'd be slow to try it (again :-() unless I was sure it was the best method.

Good luck,

f

-- 
Francis Daly        francis@daoine.org


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 5 Feb 2009 09:50:01 -0500

On Thu, Feb 05, 2009 at 01:56:08PM +0000, Francis Daly wrote:

> 
> (The short version is: wget considers the url to be
> 
> scheme://host/directory/file?query#fragment
> 
> -X filters on "directory", -R filters on "file", nothing filters on
> "query", which is where you need it.)

'-R' definitely filters on both "file" and "query". Again, using the CGI script that I put together earlier, and a list of URLs that looks like this ('Tyr' is my local hostname):

http://localhost/test/foo.cgi?a=b
http://Tyr/test/foo.cgi?c=d
http://localhost/test/foo.cgi?xyz=zap

ben@Tyr:/tmp/test_wget$ wget -q -r -nd -nv -R 'foo.cgi*' -i list; ls -1; rm -f foo*
list

Filtering on the filename succeeds - since 'foo.cgi*' matches all the URLs, none are retrieved.

ben@Tyr:/tmp/test_wget$ wget -q -r -nd -nv -R 'xyz=zap' -i list; ls -1; rm -f foo*
foo.cgi?a=b
foo.cgi?c=d
list

So does filtering on the query; since "xyz=zap" matches the last URL, only the first two are retrieved.

I think your suggestion of setting up a proxy server is excellent, though. If there are no tools that will do this kind of precise filtering, that would be the right answer.

On a slightly different topic, given Suramya's experience (i.e., "wget" still retrieves the '-R' excluded file but deletes it afterwards), it would make sense to file a bug report with the 'wget' maintainers. That method definitely fails the "least surprise" test.

http://www.faqs.org/docs/artu/ch11s01.html

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]


Thu, 5 Feb 2009 16:13:10 +0000

On Thu, Feb 05, 2009 at 09:50:01AM -0500, Ben Okopnik wrote:

> On Thu, Feb 05, 2009 at 01:56:08PM +0000, Francis Daly wrote:

Hi there,

> > (The short version is: wget considers the url to be
> > 
> > scheme://host/directory/file?query#fragment
> > 
> > -X filters on "directory", -R filters on "file", nothing filters on
> > "query", which is where you need it.)
> 
> '-R' definitely filters on both "file" and "query". 

No, it doesn't.

Or at least: it doesn't before deciding whether or not to get the url.

Check your web server logs.

> ```
> ben@Tyr:/tmp/test_wget$ wget -q -r -nd -nv -R 'xyz=zap' -i list; ls -1; rm -f foo*
> foo.cgi?a=b
> foo.cgi?c=d
> list
> '''
> 
> So does filtering on the query; since "xyz=zap" matches the last URL,
> only the first two are retrieved.

Or all are retrieved, and then some of the stored files are deleted. Which is what seems to happen when I try the same test. According to access.log, and "grep unlink" in the strace output.

And in the original case, it is the GET to the server which induces the "logout" (presumably the invalidation of the current cookie).

Arguably, using a GET for "logout" is unwise, since it effectively changes state somewhere. But it is idempotent -- make the same request repeatedly and nothing extra should happen -- so is justifiable on that basis.

The combination of GET for logout and the somewhat-unexpected observed wget behaviour does seem to break this use-case, sadly.

f

-- 
Francis Daly        francis@daoine.org


Top    Back


Neil Youngman [Neil.Youngman at youngman.org.uk]


Thu, 5 Feb 2009 16:22:50 +0000

On Thursday 05 February 2009 14:50:01 Ben Okopnik wrote:

> On a slightly different topic, given Suramya's experience (i.e., "wget"
> still retrieves the '-R' excluded file but deletes it afterwards), it
> would make sense to file a bug report with the 'wget' maintainers. That
> method definitely fails the "least surprise" test.

This could have unfortunate consequences.

Consider somebody trying to exclude large files, e.g. .iso files, because they have limited bandwidth and/or per MB bandwidth charges.

Neil


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 5 Feb 2009 11:28:55 -0500

On Thu, Feb 05, 2009 at 04:13:10PM +0000, Francis Daly wrote:

> On Thu, Feb 05, 2009 at 09:50:01AM -0500, Ben Okopnik wrote:
> > On Thu, Feb 05, 2009 at 01:56:08PM +0000, Francis Daly wrote:
> 
> Hi there,
> 
> > > (The short version is: wget considers the url to be
> > > 
> > > scheme://host/directory/file?query#fragment
> > > 
> > > -X filters on "directory", -R filters on "file", nothing filters on
> > > "query", which is where you need it.)
> > 
> > '-R' definitely filters on both "file" and "query". 
> 
> No, it doesn't.
> 
> Or at least: it doesn't before deciding whether or not to get the url.

I think we can agree that 1) "wget" does apply the filter to the 'file' and 'query' parts of the URL as evidenced by the results - but 2) does the wrong thing when processing those filter results.

It "doesn't", in that it fails to stop the retrieval. It "does", in that the retrieved file is not present on your system after "wget" is done. The real answer here is that the method is broken; "does" and "doesn't" are not nuanced enough to fully describe the actual problem.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 5 Feb 2009 11:32:23 -0500

On Thu, Feb 05, 2009 at 04:22:50PM +0000, Neil Youngman wrote:

> On Thursday 05 February 2009 14:50:01 Ben Okopnik wrote:
> > On a slightly different topic, given Suramya's experience (i.e., "wget"
> > still retrieves the '-R' excluded file but deletes it afterwards), it
> > would make sense to file a bug report with the 'wget' maintainers. That
> > method definitely fails the "least surprise" test.
> 
> This could have unfortunate consequences. 

I assume you mean "this bug", not "filing this bug report". :)

> Consider somebody trying to exclude large files, e.g. .iso files, because they 
> have limited bandwidth and/or per MB bandwidth charges.

Good point!

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]


Thu, 5 Feb 2009 17:19:58 +0000

On Thu, Feb 05, 2009 at 11:28:55AM -0500, Ben Okopnik wrote:

> On Thu, Feb 05, 2009 at 04:13:10PM +0000, Francis Daly wrote:
> > On Thu, Feb 05, 2009 at 09:50:01AM -0500, Ben Okopnik wrote:
> > > On Thu, Feb 05, 2009 at 01:56:08PM +0000, Francis Daly wrote:

Hi there,

> > > > (The short version is: wget considers the url to be
> > > > 
> > > > scheme://host/directory/file?query#fragment
> > > > 
> > > > -X filters on "directory", -R filters on "file", nothing filters on
> > > > "query", which is where you need it.)
> > > 
> > > '-R' definitely filters on both "file" and "query". 
> > 
> > No, it doesn't.
> > 
> > Or at least: it doesn't before deciding whether or not to get the url.
> 
> I think we can agree that 1) "wget" does apply the filter to the
> 'file' and 'query' parts of the URL as evidenced by the results - but 2)
> does the wrong thing when processing those filter results.

Yes to 1); maybe to 2).

I think we're actually testing two different things. The original case was "fetch this url, and get everything in it recursively". In that case, -X and -R should work the way they are supposed to, but the manual does say that HTML files will be fetched anyway. (It doesn't immediately-obviously say what a HTML file is, though.)

This case is "here is a list of urls; please fetch them". It is not unreasonable for wget to believe that request, and possibly apply the -R/-X things to things fetched recursively from them (or patch up afterwards, as it appears to do).

But the possibilities are numerous, and the "right" behaviour is unclear to me, so I'll leave it at that and read the fine manual later.

All the best,

f

-- 
Francis Daly        francis@daoine.org


Top    Back


Francis Daly [francis at daoine.org]


Thu, 5 Feb 2009 17:31:31 +0000

On Thu, Feb 05, 2009 at 04:22:50PM +0000, Neil Youngman wrote:

> This could have unfortunate consequences. 
> 
> Consider somebody trying to exclude large files, e.g. .iso files, because they 
> have limited bandwidth and/or per MB bandwidth charges.

I suspect that it does work fine for someone who says "-R iso" when recursively fetching a different url (or else surely someone would have noticed!).

I suspect that it may not work as maybe-hoped if you try

  wget -R iso http://example.com/that.iso

The obvious easy "fix" is "don't do that then".

f

-- 
Francis Daly        francis@daoine.org


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Thu, 5 Feb 2009 15:10:18 -0500

On Thu, Feb 05, 2009 at 05:31:31PM +0000, Francis Daly wrote:

> On Thu, Feb 05, 2009 at 04:22:50PM +0000, Neil Youngman wrote:
> 
> > This could have unfortunate consequences. 
> > 
> > Consider somebody trying to exclude large files, e.g. .iso files, because they 
> > have limited bandwidth and/or per MB bandwidth charges.
> 
> I suspect that it does work fine for someone who says "-R iso" when
> recursively fetching a different url (or else surely someone would
> have noticed!).

Francis, you've lost me completely. The behavior of "wget -rR <pattern>" is to do exactly as Neil describes (i.e., if "wget" was asked to ignore some file in a recursive download, it would first fetch it, then delete it.) Why would you expect it not to do that when that's exactly the problem that was demonstrated here?

ben@Tyr:/tmp/test_wget$ mkdir /var/www/test/abc
ben@Tyr:/tmp/test_wget$ cd $_
ben@Tyr:/var/www/test/abc$ head -c 100M /dev/full > large.iso
ben@Tyr:/var/www/test/abc$ head -c 10M /dev/full > medium.iso
ben@Tyr:/var/www/test/abc$ head -c 1M /dev/full > small.iso
ben@Tyr:/var/www/test/abc$ cd /tmp/test_wget/
ben@Tyr:/tmp/test_wget$ wget -nd -rR 'large*' http://Tyr/test/abc/{large,medium,small}.iso
--15:04:12--  http://tyr/test/abc/large.iso
           => `large.iso'
Resolving tyr... 127.0.0.1
Connecting to tyr|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104,857,600 (100M) [application/x-iso9660-image]
 
100%[=====================================================================================>] 104,857,600    6.86M/s    ETA 00:00
 
15:04:26 (7.58 MB/s) - `large.iso' saved [104857600/104857600]
 
Removing large.iso since it should be rejected.
--15:04:26--  http://tyr/test/abc/medium.iso
           => `medium.iso'
 
[ snipped ]
 
FINISHED --15:04:28--
Downloaded: 116,391,936 bytes in 3 files
ben@Tyr:/tmp/test_wget$ ls
medium.iso  small.iso

The "Removing large.iso since it should be rejected" line makes it pretty obvious: "wget" does the wrong thing in recursive fetches when asked to reject a file.

> I suspect that it may not work as maybe-hoped if you try
> 
>   wget -R iso http://example.com/that.iso
> 
> The obvious easy "fix" is "don't do that then".

'-R' only works on recursive fetches anyway.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]


Thu, 5 Feb 2009 21:15:41 +0000

On Thu, Feb 05, 2009 at 03:10:18PM -0500, Ben Okopnik wrote:

> On Thu, Feb 05, 2009 at 05:31:31PM +0000, Francis Daly wrote:
> > On Thu, Feb 05, 2009 at 04:22:50PM +0000, Neil Youngman wrote:
> > I suspect that it does work fine for someone who says "-R iso" when
> > recursively fetching a different url (or else surely someone would
> > have noticed!).
> 
> Francis, you've lost me completely. The behavior of "wget -rR <pattern>"
> is to do exactly as Neil describes (i.e., if "wget" was asked to ignore
> some file in a recursive download, it would first fetch it, then delete
> it.) Why would you expect it not to do that when that's exactly the
> problem that was demonstrated here? 

I guess that in your examples, I'm not seeing a recursive download.

Just sticking in "-r" doesn't make it recurse -- unless the url content is html and contains links through which wget can recurse.

For example...

> ben@Tyr:/tmp/test_wget$ mkdir /var/www/test/abc
> ben@Tyr:/tmp/test_wget$ cd $_
> ben@Tyr:/var/www/test/abc$ head -c 100M /dev/full > large.iso
> ben@Tyr:/var/www/test/abc$ head -c 10M /dev/full > medium.iso
> ben@Tyr:/var/www/test/abc$ head -c 1M /dev/full > small.iso
> ben@Tyr:/var/www/test/abc$ cd /tmp/test_wget/

Here, try

 wget -np -nd -rR 'large*' http://Tyr/test/abc/

(assuming you have directory indexing enabled, so that the content of that url includes some "a href" links to the .iso files).

(-np means "don't include the parent directory", which is worth including here.)

Now wget will fetch the url given on the command line, will check it for links to which to recurse, will find three, will discard that one where the filename matches the pattern given, and will only get the other two.

(And if you have a typical apache setup, will also save a bunch of files with names like index.html?C=N;O=D)

> ben@Tyr:/tmp/test_wget$ wget -nd -rR 'large*' http://Tyr/test/abc/{large,medium,small}.iso
> --15:04:12--  http://tyr/test/abc/large.iso

The request to wget is "get these three urls, and don't get ones with a filename like large". It gets the three urls. It then deletes the large one. Which, since it had explicitly been told to get, I'm not sure is ideal. But is not unreasonable.

What my request above is is "get this url, and get everything it links to. But don't get ones with a filename like large". And it gets the url, identifies the links, discards the ones that match the pattern (unless they also look like html-containing urls), and only fetches the others.

> The "Removing large.iso since it should be rejected" line makes it
> pretty obvious: "wget" does the wrong thing in recursive fetches when
> asked to reject a file.

I think this case is user error. If you don't want wget to fetch large* files, don't explicitly tell it to fetch large.iso.

I'm happy that wget does the right thing in recursive fetches when asked to reject a file extension or pattern. To me, the later "get this explicit url" overrides the previous "don't get ones that match this" instruction. Any recursively-sought urls that match large* are correctly not requested.

I'm also happy to accept that the -R option only refers to the filename, not the query string, in recursive queries. The behaviour for explicitly-named urls is a bit confusing, as it looks like it deletes based on the saved filename.

But having looked through the info pages, the observed behaviour matches my (current) expectations.

f

-- 
Francis Daly        francis@daoine.org


Top    Back


Predrag Ivanovic [predivan at nadlanu.com]


Thu, 5 Feb 2009 22:48:29 +0100

On Wed, 04 Feb 2009 19:38:11 +0530 Suramya Tomar wrote:

>Hi Ben,
>
>>> So I need to figure out how to exclude the Logout link from the process. 
>>> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
>
>Is there any other way to do this? Maybe some other tool?
>
>- Suramya

Have you tried Pavuk, http://pavuk.sourceforge.net/?

From its man page (http://pavuk.sourceforge.net/man.html)

" You can use regular expressions to help pavuk select and filter content[...]"

HTH

Pedja

-- 
 not approved by the FCC


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 6 Feb 2009 14:19:45 -0500

On Thu, Feb 05, 2009 at 09:15:41PM +0000, Francis Daly wrote:

> On Thu, Feb 05, 2009 at 03:10:18PM -0500, Ben Okopnik wrote:
> > On Thu, Feb 05, 2009 at 05:31:31PM +0000, Francis Daly wrote:
> > > On Thu, Feb 05, 2009 at 04:22:50PM +0000, Neil Youngman wrote:
> 
> > > I suspect that it does work fine for someone who says "-R iso" when
> > > recursively fetching a different url (or else surely someone would
> > > have noticed!).
> > 
> > Francis, you've lost me completely. The behavior of "wget -rR <pattern>"
> > is to do exactly as Neil describes (i.e., if "wget" was asked to ignore
> > some file in a recursive download, it would first fetch it, then delete
> > it.) Why would you expect it not to do that when that's exactly the
> > problem that was demonstrated here? 
> 
> I guess that in your examples, I'm not seeing a recursive download.

"wget" certainly sees it as a request for a recursive download; '-R' does nothing without that '-r'. That behavior, as specified, is broken - and whether it's possible to make it succeed in some other type of recursive retrieval isn't the issue: Suramya's problem comes from exactly this misbehavior of "wget".

> Just sticking in "-r" doesn't make it recurse -- unless the url content
> is html and contains links through which wget can recurse.

Assuming that this is right, how would that help? In Suramya's case, the "Logout" link is a link - and "wget" does try to traverse it. The fact that it will remove the results from its blacklist later doesn't prevent the problem from happening.

> Now wget will fetch the url given on the command line, will check it for
> links to which to recurse, will find three, will discard that one where
> the filename matches the pattern given, and will only get the other two.

In the scenario you proposed, you're right - but it doesn't help Suramya's problem, as you saw. It still traverses the link, and only then removes the bits from its blacklist. Which brings us back to the original problem.

> What my request above is is "get this url, and get everything it links
> to. But don't get ones with a filename like large". And it gets the url,
> identifies the links, discards the ones that match the pattern
> (unless they also look like html-containing urls), and only fetches the others.
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
And that is the problem in a nutshell.

> [...] having looked through the info pages, the observed behaviour matches
> my (current) expectations.

Perhaps your reading comprehension is much higher than mine, but I didn't find either the man page or the info pages at all informative on this point. In addition to that, and far more damning, is the counter- intuitive nature of the '-R' operation: it doesn't do the obvious, the least-surprising thing.

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


clarjon1 [clarjon1 at gmail.com]


Fri, 6 Feb 2009 14:32:38 -0500

On Wed, Feb 4, 2009 at 8:54 AM, Ben Okopnik <ben@linuxgazette.net> wrote:

> On Wed, Feb 04, 2009 at 06:36:50PM +0530, Suramya Tomar wrote:
>>
>> When I start downloading wget visits each and every link and makes a
>> local copy (like its supposed to) but in this process it also visits the
>> "Log out" link which logs me out from the site and then I am unable to
>> download the remaining links.
>>
>> So I need to figure out how to exclude the Logout link from the process.
>> The logout link looks like: www.website.com/index.php?act=Login&CODE=03

-SNIP-

Oh, i actually know the answer to this one! I just had this exact same problem, to be honest.

Heres what you do:

You export the cookie from the site, stick it in a cookies .txt, and have it load the cookies from there.

Then, you monitor the site's download progress in the terminal, and wait until you see it hit the logout page. Once that's done, hit ctrl-z to pause, then login in your browser again.

Once you're logged in, go back to the terminal window, and type "fg" to continue wg.

This took me a couple of hours to figure out, but once i figured it out, boy, was i happy!

I hope this helps!

-- 
Jon


Top    Back


clarjon1 [clarjon1 at gmail.com]


Fri, 6 Feb 2009 14:34:40 -0500

On Fri, Feb 6, 2009 at 2:32 PM, clarjon1 <clarjon1@gmail.com> wrote:

> On Wed, Feb 4, 2009 at 8:54 AM, Ben Okopnik <ben@linuxgazette.net> wrote:
>> On Wed, Feb 04, 2009 at 06:36:50PM +0530, Suramya Tomar wrote:
>>>
>>> When I start downloading wget visits each and every link and makes a
>>> local copy (like its supposed to) but in this process it also visits the
>>> "Log out" link which logs me out from the site and then I am unable to
>>> download the remaining links.
>>>
>>> So I need to figure out how to exclude the Logout link from the process.
>>> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
> -SNIP-
>
> Oh, i actually know the answer to this one!
> I just had this exact same problem, to be honest.
>
> Heres what you do:
> You export the cookie from the site, stick it in a cookies .txt, and
> have it load the cookies from there.
> Then, you monitor the site's download progress in the terminal, and
> wait until you see it hit the logout page. Once that's done, hit
> ctrl-z to pause, then login in your browser again.
> Once you're logged in, go back to the terminal window, and type "fg"
> to continue wg.
>
> This took me a couple of hours to figure out, but once i figured it
> out, boy, was i happy!
>
> I hope this helps!

This is assuming, of course, that the PHPSESSION variable doesn't change in the cookie... else you may have problems again... I luckily didn't run into that, but it is a thing to keep in mind.

-- 
Jon


Top    Back


Ben Okopnik [ben at linuxgazette.net]


Fri, 6 Feb 2009 19:05:27 -0500

On Fri, Feb 06, 2009 at 02:32:38PM -0500, clarjon1 wrote:

> On Wed, Feb 4, 2009 at 8:54 AM, Ben Okopnik <ben@linuxgazette.net> wrote:
> > On Wed, Feb 04, 2009 at 06:36:50PM +0530, Suramya Tomar wrote:
> >>
> >> When I start downloading wget visits each and every link and makes a
> >> local copy (like its supposed to) but in this process it also visits the
> >> "Log out" link which logs me out from the site and then I am unable to
> >> download the remaining links.
> >>
> >> So I need to figure out how to exclude the Logout link from the process.
> >> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
> -SNIP-
> 
> Oh, i actually know the answer to this one!
> I just had this exact same problem, to be honest.
> 
> Heres what you do:
> You export the cookie from the site, stick it in a cookies .txt, and
> have it load the cookies from there.
> Then, you monitor the site's download progress in the terminal, and
> wait until you see it hit the logout page. Once that's done, hit
> ctrl-z to pause, then login in your browser again.
> Once you're logged in, go back to the terminal window, and type "fg"
> to continue wg.
> 
> This took me a couple of hours to figure out, but once i figured it
> out, boy, was i happy!

Eeep. I'm imagining a site with several thousand pages, each of which has a logout link...

-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *


Top    Back


Francis Daly [francis at daoine.org]


Sat, 7 Feb 2009 02:13:07 +0000

On Fri, Feb 06, 2009 at 02:19:45PM -0500, Ben Okopnik wrote:

> On Thu, Feb 05, 2009 at 09:15:41PM +0000, Francis Daly wrote:

Hi there,

> > I guess that in your examples, I'm not seeing a recursive download.
> 
> "wget" certainly sees it as a request for a recursive download; '-R'
> does nothing without that '-r'. That behavior, as specified, is broken

I suspect I'm interpreting the terms differently to you.

To me "-r" means "get these starting urls. Then get the links in the html content fetched so far". "-R" applies to the links, not to the starting urls[*]. "-R" decides whether the link will not be followed (unless it looks like it will be html, explained below).

  * the tests here show that "-R" does determine which saved files of
  the starting urls are deleted after having been got. It's not obvious
  to me that deleting them is right, but it seems justifiable.
> - and whether it's possible to make it succeed in some other type of
> recursive retrieval isn't the issue: Suramya's problem comes from
> exactly this misbehavior of "wget".

Suramya's problem is in the recursion -- in following the links, not in getting the starting url.

The original problem is "mirror this website", which can be considered to be

  wget -r /index.php

The content at /index.php includes links of the form /index.php?do_get_this and /index.php?logout and /index.php?do_get_this_too. The hope is to be able to invite wget not to follow the ?logout link, but to follow all of the others.

Everything I see says that you can't do that with wget, because "-R" does not apply to the query string, and so there is no "-R" argument that can differentiate between the three included links.

> > Just sticking in "-r" doesn't make it recurse -- unless the url content
> > is html and contains links through which wget can recurse.
> 
> Assuming that this is right, how would that help? In Suramya's case, 
> the "Logout" link is a link - and "wget" does try to traverse it.

wget traverses it because it is a link and "-R" cannot prevent it from being traversed, because "-R" only considers the "index.php" part of the url.

There is no wget argument that filters on the query string when deciding what links to follow. Which is the quick answer to the original problem.

[ pavuk was mentioned elsewhere in the thread. That does include skip_url_pattern and skip_url_rpattern options which are documented to include the query string when matching ]

> > What my request above is is "get this url, and get everything it links
> > to. But don't get ones with a filename like large". And it gets the url,
> > identifies the links, discards the ones that match the pattern
> > (unless they also look like html-containing urls), and only fetches the others.
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> And that is the problem in a nutshell.

Actually, no.

To wget, that line just means that if the filename suffix in the url is "htm", "html", or "Xhtml" (for any single X), then the link will be followed irrespective of any -R argument.

Otherwise, "-R" links are not followed, whatever the content-type of the response might be.

> > [...] having looked through the info pages, the observed behaviour matches
> > my (current) expectations.
> 
> Perhaps your reading comprehension is much higher than mine, but I
> didn't find either the man page or the info pages at all informative on
> this point. In addition to that, and far more damning, is the counter-
> intuitive nature of the '-R' operation: it doesn't do the obvious, the
> least-surprising thing.

It makes sense to me, once I accept that "-R" applies to links within content, and not to starting urls. I'm not sure that deleting some saved files by default is great behaviour, but that's a separate issue.

All the best,

f

-- 
Francis Daly        francis@daoine.org


Top    Back


clarjon1 [clarjon1 at gmail.com]


Sat, 7 Feb 2009 09:45:36 -0500

On Fri, Feb 6, 2009 at 7:05 PM, Ben Okopnik <ben@linuxgazette.net> wrote:

> On Fri, Feb 06, 2009 at 02:32:38PM -0500, clarjon1 wrote:
>> On Wed, Feb 4, 2009 at 8:54 AM, Ben Okopnik <ben@linuxgazette.net> wrote:
>> > On Wed, Feb 04, 2009 at 06:36:50PM +0530, Suramya Tomar wrote:
>> >>
>> >> When I start downloading wget visits each and every link and makes a
>> >> local copy (like its supposed to) but in this process it also visits the
>> >> "Log out" link which logs me out from the site and then I am unable to
>> >> download the remaining links.
>> >>
>> >> So I need to figure out how to exclude the Logout link from the process.
>> >> The logout link looks like: www.website.com/index.php?act=Login&CODE=03
>> -SNIP-
>>
>> Oh, i actually know the answer to this one!
>> I just had this exact same problem, to be honest.
>>
>> Heres what you do:
>> You export the cookie from the site, stick it in a cookies .txt, and
>> have it load the cookies from there.
>> Then, you monitor the site's download progress in the terminal, and
>> wait until you see it hit the logout page. Once that's done, hit
>> ctrl-z to pause, then login in your browser again.
>> Once you're logged in, go back to the terminal window, and type "fg"
>> to continue wg.
>>
>> This took me a couple of hours to figure out, but once i figured it
>> out, boy, was i happy!
>
> Eeep. I'm imagining a site with several thousand pages, each of which
> has a logout link...

Actually, that's not how wget works

It notes that it had already downloaded the logout link earlier, so it'll ignore it after it downloads it the first time.


Top    Back