#!/usr/bin/perl eval 'exec /usr/bin/perl -S $0 ${1+"$@"}' if 0; # not running under some shell # Perl 5.002 or later. w3mir is mostly tested with perl 5.004 # use lib '/usr/lib/perl5/site_perl/5.8.0'; # # Once upon a long time ago this was Oscar Nierstrasz's # htget script. # # Retrieves HTML pages, creating local copies in the _current_ # directory. The script will check for the last-modified stamp on the # document, and will not fetch it if the document isn't changed. # # Bug list is in w3mir-README. # # Test cases for janl to use: # w3mir -r -fs http://www.eff.org/ - infinite recursion! # --- but cursory examination seems to indicate confused server... # http://java.sun.com/progGuide/index.html check out the img things. # # Copyright Holders: # Nicolai Langfeldt, janl@ifi.uio.no # Gorm Haug Eriksen, gorm@usit.uio.no # Chris Szurgot, szurgot@itribe.net # Ed Jordan, ed@olympus.itl.net # Alex Knowles, aknowles@avs.com aka ark. # Copying and modification is governed by the "Artistic License" enclosed in # the w3mir distribution # # History (European format date: dd/mm/yy): # oscar 25/03/94 -- added -s option to send output to stdout # oscar 28/03/94 -- made HTTP 1.0 the default # oscar 30/05/94 -- special handling of directory URLs missing a trailing "/" # gorm 20/02/95 -- added mirror capacity + fixed a couple of bugs # janl 28/03/95 -- added a working commandline parser. # janl 18/09/95 -- Changed to use a net http library. Removed dependency of # url.pl. # janl 19/09/95 -- Extensive rewrite. Simplified a lot, works better. # HTML files are now saved in a new and improved manner, # which means they can be recognized as such w/o fancy # filename extention type rules. # szurgot 27/01/96-- Added "Plaintextmode" wrapper to binmode PAGE. # binmode page is required under Win32, but broke modified # checking # -- Minor change added ; to "# '" strings for Emacs cperl-mode # szurgot 07/02/96-- When reading in local file for checking of URLs changed # local ($/) =0; to equal undef; # janl 08/02/96 -- Added szurgot's changes and changed them :-) # szurgot 09/02/96-- Added code to strip /#.*$/ from urls when reading from # local file # -- Added hasAlarm variable to w3http.pl. Set to 1 if you have # alarm(). 0 otherwise. # -- Moved code setting up the valid extensions list into the # args processing where it belonged # janl 20/02/96 -- Added szurgot changes again. # -- Make timeout code work. # -- and made another win32 test. # janl 19/03/96 -- Worked through the code for handling not-modified # documents, it was a bit shabby after htmlop was intro'ed. # janl 20/03/96 -- -l fix # janl 23/04/96 -- Added -fs by request (by Rik Faith) # janl 16/05/96 -- Made -R mandatory, added use and support for # w3http::SAVEBIN # szurgot 19/05/96-- Win95 adaptions. # janl 19/05/96 -- -C did not exactly work as expected. Thanks to Petr # Novak for bug descriptions. # janl 19/05/96 -- Changed logic for @didntget, @got and so on to use # @queue and %urlstat. # janl 09/09/96 -- Removed -R switch. # janl 14/09/96 -- Added ir (initial referer) switch # janl 21/09/96 -- Made retry code saner. There probably needs to be a # sleep before retry comencing switch. When no tty is # present it should be fairly long. # gorm 15/09/96 -- Added cr (check robot) switch. Default to 1 (on) # janl 22/09/96 -- Modified gorms patch to use WWW::RobotRules. Changed # robot switch to be consistent with current w3mir # practice. # janl 27/09/96 -- Spelling corrections from charles.curran@ox.ac.uk # -- Folded in manual diffs from ark. # ark 24/09/96 -- Simple facilities to edit the incomming file(s) # janl 27/09/96 -- Added switch to enable editing and # foolproofed ark's patch a bit. # janl 02/10/96 -- Added -umask switch. # -- Redirected documents did not have a meaningful referer # value (it was undefined). # -- Got w3mir into strict discipline, found some typos... # janl 20/10/96 -- Mtime is preserved # janl 21/10/96 -- -lc switch added. Mtime preservation works better. # janl 06/11/96 -- Treat 301 like 302. # janl 02/12/96 -- Added config file code, fetch/ignore rules, apply # janl 04/12/96 -- Better checking of config input. # janl 06/12/96 -- Putting together the URL selection/editing brains. # janl 07/12/96 -- Checking out some bugs. Adding multiscope options. # janl 12/12/96 -- Adding to and defeaturing the multiscope options. # janl 13/12/96 -- Continuing work in multiscope stuff # -- Unreferenced file and empty directory removal works. # janl 19/02/97 -- Can extract urls from adobe acrobat pdf files :-) # Important: It does _not_ edit urls, so they still # point at the original site(s). # janl 21/02/97 -- Fix -lc bug related to case and the apply things. # -- only use SAVEURL if needed # janl 11/03/97 -- Finish work on SAVEURL conditional. # -- Fixed directory removal code. # -- parse_args did not abort when unknown option/argument # was specified. # janl 12/03/97 -- Made test case for -lc. Didn't work. Fixed it. I think. # Realized we have bug w.r.t. hostname caseing. # janl 13/03/97 -- All redirected to URLs within scope are now queued. # That should make the mirror more complete, but it won't # help browsability when it comes to the redirected doc. # -- Moved robot retrival to the inside of the mirror loop # since we now possebly mirror several sites. # -- Changed 'fetch-options' to 'options'. # -- Added 'proxy-options'/-pflush to controll proxy server(s). # janl 09/04/97 -- Started using URI::URL. # janl 11/04/97 -- Debugging and using URI::URL more correctly various places # janl 09/05/97 -- Added --agent switch # janl 12/05/97 -- Simplified scope checks for root URL, changed URL 'apply' # processing. # -- Small output formating fix in the robot rules code. # -- Version is now 0.99 # janl 14/05/97 -- htmpop no-longer puts ' 1.0.2 # janl 09/05/98 -- More carefull clean_disk code. # -- Check if the redirected URL was a root url, if so # issue a warning and exit. # janl 12/05/98 -- use ->unix_path instead of ->as_string to derive local # filename. # janl 25/05/98 -- -B didn't work too well. # janl 09/07/98 -- Redirect to fragment broke us, less broken now -> 1.0.4 # janl 24/09/98 -- Better errormessages on errors -> 1.0.5 # janl 21/11/98 -- Fix errormessages better. # janl 05/01/99 -- Drop 'Referer: (commandline)' # janl 13/04/99 -- Add initial referer to root urls in batch mode. # janl 15/01/00 -- Remove some leftover print statements # -- Fix also-queue problem as suggested by Sven Koch # janl 04/02/01 -- Use epath instead of path quite often -> 1.0.10 # # Variable name discipline: # - remote, umodified URL. Variables prefixed 'rum_' # - local, filesystem. Variables prefixed 'lf_'. # Use these prefixes so we know what we're working with at all times. # Also, URL objects are postfixed _o # # The apply rules and scope rules work this way: # - First apply the user rules to the remote url. # - Check if document is within scope after this. # - Then apply w3mir's rules to the result. This results is the local, # filesystem, name. # # We use features introduced in 5.002. require 5.002; # win32 and $nulldevice need to be globals, other modules use them. use vars qw($win32 $nulldevice); # To figure out what kind of system this is BEGIN { use Config; $win32 = ( $Config{'osname'} eq 'MSWin32' ); } # More ways to die: use Carp; # Http module: use w3http; # html url extraction and manupulation: use htmlop; # Extract urls from adobe acrobat pdf files: use w3pdfuri; # Date computer: use HTTP::Date; # URLs: use URI::URL; # For flush method use FileHandle; eval ' use URI; $URI::ABS_ALLOW_RELATIVE_SCHEME=1; $URI::ABS_REMOTE_LEADING_DOTS=1; '; # Full discipline: use strict; # Set params in the http package, HTTP protocol version: $w3http::version="1.0"; # The defaults should be for a robotic http agent on good behaviour. my $debug=0; # Debug level my $verbose=0; # Verbosity level, -1 = quiet, 0 = normal, 1... my $pause=0; # Pause between http requests my $retryPause=600; # Pause between retries. 10 minutes. my $retry=3; # Max 3 stabs pr. url. my $r=0; # Recurse? no recursion = absolutify links my $remove=0; # Remove files that are not there? my $s=0; # 0: save on disk 1: stdout 2: just forget 'em my $useauth=0; # Use authorization my %authdata; # Authorization data my $check_robottxt = 1; # Check robots.txt my $do_referer = 1; # Send referers header my $do_user = 1; # Send user header my $cache_header = ''; # The cache-control/pragma: no-cache header my $using_proxy = 0; # Using proxy server or not? my $batch=0; # Batch get URLs? my $read_urls=0; # Get urls from STDIN? my $abs=0; # Absolutify URLs? my $immediate_redir=0; # Immediately follow a redirect? my @root_urls; # This is where we start, the root documents my @root_dirs; # The corresponding directories. for remove my $chdirto=''; # Place to chdir to after reading config file my %nodelete=(); # Files that should not be deleted my $numarg=0; # Number of arguments accepted. my $list_nomir=0; # List of files not mirrored # Fixup related things my $fixrc=''; # Name of w3mfix config file my $fixup=1; # Do things needed to run fixup my $runfix=0; # Run w3mfix for user? my $fixopen=0; # Fixup files open? my $indexname='index.html'; my $VERSION; $VERSION='1.0.10'; $w3http::agent = my $w3mir_agent = "w3mir/$VERSION-2001-01-20"; my $iref=''; # Initial referer. Must evaluate to false # Derived settings my $mine_urls=0; # Mine URLs from documents? my $process_urls=0; # Perform (URL) processing of documents? # Queue of urls to get. my @rum_queue = (); my @urls = (); # URL status map. my %rum_urlstat = (); # Status codes: my $QUEUED = 0; # Queued but not gotten yet. my $TERROR = 100; # Transient error, retry later my $HLERR = 101; # Permanent error, give up my $GOTIT = 200; # Gotten. Note similarity to http result code my $NOTMOD = 304; # Not modified. # Negative codes for nonexistent files, easier to check. my $NEVERMIND= -1; # Don't want it my $REDIR = -302; # Does not exist, redirected my $ENOTFND = -404; # Does not exist. my $OTHERERR = -600; # Some other error happened my $FROBOTS = -601; # Forbidden by robots.txt rule # Directory/files survey: my %lf_file; # What files are present in FS? Disposition? One of: my $FILEDEL=0; # Delete file my $FILEHERE=1; # File present in filesystem only my $FILETHERE=2; # File present on server too. my %lf_dir; # Number of files/dirs in dir. If 0 dir is # eligible for deletion. my %fiddled=(); # If a file becomes a directory or a directory # becomes a file it is considered fiddled and # w3mir will not fiddle with it again in this # run. # Bitbucket device, very OS dependent. $nulldevice='/dev/null'; $nulldevice='nul:' if ($win32); # What to get, and not. # Text of user supplied fetch/ignore rules my $rule_text=" # User defined fetch/ignore rules\n"; # Code ref to the rule procedure my $rule_code; # Code to prefix and postfix the generated code. Prefix should make # $_ contain the url to match. Postfix should return 1, the default # is to get the url/file. my $rule_prefix='$rule_code = sub { local($_) = shift;'."\n"; my $rule_postfix=" return 1;\n}"; # Scope tests generated by URL/Also directives in cfg. The scope code # is just like the rule code, but used for program generated # fetch/ignore rules related to multiscope retrival. my $scope_fetch=" # Automatic fetch rules for multiscope retrival\n"; my $scope_ignore=" # Automatic ignore rules for multiscope retrival\n"; my $scope_code; my $scope_prefix='$scope_code = sub { local($_) = shift;'."\n"; my $scope_postfix=" return 0;\n}"; # Function to apply to urls, se rule comments. my $user_apply_code; # User specified apply code my $apply_code; # w3mirs apply code my $apply_prefix='$apply_code = sub { local($_) = @_;'."\n"; my $apply_lc=' $_ = lc $_; '; my $apply_postfix=' return $_;'."\n}"; my @user_apply; # List of users apply rules. my @internal_apply; # List of w3mirs apply rules. my $infoloss=0; # 1 if any URL translations (which cause # information loss) are in effect. If this is # true we use the SAVEURL operation. my $list; # List url on STDOUT? my $edit; # Edit doc? Remove my $header; # Text to insert in header my $lc=0; # Convert urls/filenames to lowercase? my $fetch=0; # What to fetch: -1: Some, 0: not modified 1: all my $convertnl=1; # Convert newlines? # Non text/html formats we can extract urls from. Function must take one # argument: the filename. my %knownformats = ( 'application/pdf', \&w3pdfuri::list, 'application/x-pdf', \&w3pdfuri::list, ); # Known 'magic numbers' of the known formats. The value is used as # key in %knownformats. the key part is a exact match for the # following beginning at the first byte of the file. # This should probably be made more flexible, but not until we need it. my %knownmagic = ( '%PDF-', 'application/pdf' ); my $iinline=''; # inline RE code to make RE caseinsensitive my $ipost=''; # RE postfix to make it caseinsensitive usage() unless parse_args(@ARGV); { my $w3mirc='.w3mirc'; $w3mirc='w3mir.ini' if $win32; if (-f $w3mirc) { parse_cfg_file($w3mirc); $nodelete{$w3mirc}=1; } } # Check arguments and options if ($#root_urls>=0) { # OK } else { print "URLs: $#rum_queue\n"; usage("No URLs given"); } # Are we converting newlines today? $w3http::convert=0 unless $convertnl; if ($chdirto) { &mkdir($chdirto.'/this-is-not-created-odd-or-what'); chdir($chdirto) || die "w3mir: Can't change working directory to '$chdirto': $!\n"; } $SIG{'INT'}=sub { print STDERR "\nCaught SIGINT!\n"; exit 1; }; $SIG{'QUIT'}=sub { print STDERR "\nCaught SIGQUIT!\n"; exit 1; }; $SIG{'HUP'}=sub { print STDERR "\nCaught SIGHUP!\n"; exit 1; }; &open_fixup if $fixup; # Derive how much document processing we should do. $mine_urls=( $r || $list ); $process_urls=(!$batch && !$edit && !$header); # $abs can be set explicitly with -abs, and implicitly if not recursing $abs = 1 unless $r; print "Absolute references\n" if $abs && $debug; # Cache_controll specified but proxy not in use? die "w3mir: If you want to control a cache, use a proxy server!\n" if ($cache_header && !$using_proxy); # Compile the second order code # - The rum scope tests my $full_rules=$scope_prefix.$scope_fetch.$scope_ignore.$scope_postfix; # warn "Scope rules:\n-------------\n$full_rules\n---------------\n"; eval $full_rules; die "$@" if $@; die "w3mir: Program generated rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:\n----\n". $full_rules."\n----\n" if !defined($scope_code); $full_rules=$rule_prefix.$rule_text.$rule_postfix; # warn "Fetch rules:\n-------------\n$full_rules\n---------------\n"; eval $full_rules; die "$@!" if $@; # - The user specified rum tests die "w3mir: Ignore/Fetch rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:\n----\n". $full_rules."\n----\n" if !defined($rule_code); # - The user specified apply rules # $SIG{__WARN__} = sub { print "$_[0]\n"; confess ""; }; my $full_apply=$apply_prefix.($lc?$apply_lc:''). join($ipost.";\n",@user_apply).(($#user_apply>=0)?$ipost:"").";\n". $apply_postfix; eval $full_apply; die "$@!" if $@; die "w3mir: User apply rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is: ---- ".$full_apply." ----\n" if !defined($apply_code); # print "user apply: $full_apply\n"; $user_apply_code=$apply_code; # - The w3mir generated apply rules $full_apply=$apply_prefix.($lc?$apply_lc:''). join($ipost.";\n",@internal_apply).(($#internal_apply>=0)?$ipost:"").";\n". $apply_postfix; eval $full_apply; die "$@!" if $@; die "Internal apply rules did not compile. The code is: ---- ".$full_apply." ----\n" if !defined($apply_code); # - Information loss via -lc? There are other sources as well. $infoloss=1 if $lc; warn "Infoloss is $infoloss\n" if $debug; # More setup: $w3http::debug=$debug; $w3http::verbose=$verbose; my %rum_referers=(); # Array of referers, key: rum_url my $Robot_Blob; # WWW::RobotsRules object, decides if rum_url is # forbidden to access for us. my $rum_url_o; # rum url, mostly the current, the one we're getting my %gotrobots; # Did I get robots.txt from site? key: url->netloc my($authuser,$authpass);# Username and password for authentication with server my @rum_newurls; # List of rum_urls in document if ($check_robottxt) { # Eval is only way to defer loading of module until we know it's needed? eval 'use WWW::RobotRules;'; die "Could not load WWW::RobotRules, try -drr switch\n" unless defined(&WWW::RobotRules::parse); $Robot_Blob = new WWW::RobotRules $w3mir_agent; } # We have several main-modes of operation. Here we select one if ($r) { die "w3mir: No URLs? Try 'w3mir -h' for help.\n" if $#root_urls==-1; warn "Recursive retrival comencing\n" if $debug; die "w3mir: Sorry, you cannot combine -r/recurse with -I/read_urls\n" if $read_urls; # Recursive my $url; foreach $url (@root_urls) { warn "Root url dequeued: $url\n" if $debug; if (want_this($url)) { queue($url); &add_referer($url,$iref); } else { die "w3mir: Inconsistent configuration: Specified $url is not inside retrival scope\n"; } } mirror(); } else { if ($batch) { warn "Batch retrival commencing\n" if $debug; # Batch get if ($read_urls) { # Get URLs from while () { chomp; &add_referer($_,$iref); batch_get($_); } } else { # Get URLs from commandline my $url; foreach $url (@root_urls) { &add_referer($url,$iref); } foreach $url (@root_urls) { batch_get($url); } } } else { warn "Single url retrival commencing\n" if $debug; # A single URL, with all processing on die "w3mir: You specified several URLs and not -B/batch\n" if $#root_urls>0; queue($root_urls[0]); &add_referer($root_urls[0],$iref); mirror(); } } &close_fixup if $fixup; # This should clean up files: &clean_disk if $remove; warn "w3mir: That's all (".$w3http::xfbytes.'+',$w3http::headbytes. " bytes of it).\n" unless $verbose<0; if ($runfix) { eval 'use Config;'; warn "Running w3mfix\n"; if ($win32) { CORE::system($Config{'perlpath'}." w3mfix $fixrc"); } else { CORE::system("w3mfix $fixrc"); } } exit 0; sub get_document { # Get one document by HTTP ($1/rum_url_o). Save in given filename ($2). # Possebly returning references found in the document. Caller must # set up referer array, check wantedness and everything else. We # handle authentication here though. my($rum_url_o)=shift; my($lf_url)=shift; croak("\$rum_url_o is empty") if !defined($rum_url_o) || !$rum_url_o; croak("$lf_url is empty") if !defined($lf_url) || !$lf_url; # Make sure it's an object $rum_url_o = url $rum_url_o unless ref $rum_url_o; # Derive a filename from the url, the filename contains no URL-quoting my($lf_name) = (url "file:$lf_url")->unix_path; # Make all intermediate directories &mkdir($lf_name) if $s==0; my($rum_as_string) = $rum_url_o->as_string; print STDERR "GET_DOCUMENT: '",$rum_as_string,"' -> '",$lf_name,"'\n" if $debug; my $hostport; my $www_auth=''; # Value of that http reply header my $page_ref; my @rum_newurls; # List of URLs extracted my $url_extractor; my $do_query; # Do query or not? if (defined($rum_urlstat{$rum_as_string}) && $rum_urlstat{$rum_as_string}>0) { warn "w3mir: Internal error, ".$rum_as_string. " queued several times\n"; next; } # Goto here if we want to retry b/c of authentication try_again: # Start building the extra http::query arguments again my @EXTRASTUFF=(); # We'll start by assuming that we're doing the query. $do_query=1; # If we're not checking the timestamp, or the file does not exist # then we get the file unconditionally. Otherwise we only want it # if it's updated. if ($fetch==1) { # Nothing do do? } else { if (-f $lf_name) { if ($fetch==-1) { print STDERR "w3mir: ".($infoloss?$rum_as_string:$lf_name). ", already have it" if $verbose>=0; if (!$mine_urls) { # If -fs and the file exists and we don't need to mine URLs # we're finished! warn "Already have it, no mining, returning!\n" if $debug; print STDERR "\n" if $verbose>=0; return; } $w3http::result=1304; # Pretend it was 'not modified' $do_query=0; } else { push(@EXTRASTUFF,$w3http::IFMODF,$lf_name); } } } if ($do_query) { # Does the server want authorization for this file? $www_auth is # only set if authentication was requested the first time around. # For testing: # $www_auth='Basic realm="foo"'; if ($www_auth) { my($authdata,$method,$realm); ($method,$realm)= $www_auth =~ m/^(\S+)\s+realm=\"([^\"]+)\"/i; $method=lc $method; $realm=lc $realm; die "w3mir: '$method' authentication needed, don't know that.\n" if ($method ne 'basic'); $hostport = $rum_url_o->netloc; $authdata=$authdata{$hostport}{$realm} || $authdata{$hostport}{'*'} || $authdata{'*'}{$realm} || $authdata{'*'}{'*'}; if ($authdata) { push(@EXTRASTUFF,$w3http::AUTHORIZ,$authdata); } else { print STDERR "w3mir: No authorization data for $hostport/$realm\n"; $rum_urlstat{$rum_as_string}=$NEVERMIND; next; } } push(@EXTRASTUFF,$w3http::FREEHEAD,$cache_header) if ($cache_header); # Insert referer header data if at all push(@EXTRASTUFF,$w3http::REFERER,$rum_referers{$rum_as_string}[0]) if ($do_referer && exists($rum_referers{$rum_as_string})); push(@EXTRASTUFF,$w3http::NOUSER) unless ($do_user); # YES, $lf_url is right, w3http::query handles this like an url so # the quoting must all be in place. my $binfile=$lf_url; $binfile='-' if $s==1; $binfile=$nulldevice if $s==2; if ($pause) { print STDERR "w3mir: sleeping\n" if $verbose>0; sleep($pause); } print STDERR "w3mir: ".($infoloss?$rum_as_string:$lf_name) unless $verbose<0; print STDERR "\nFile: $lf_name\n" if $debug; &w3http::query($w3http::GETURL,$rum_as_string, $w3http::SAVEBIN,$binfile, @EXTRASTUFF); print STDERR "w3http::result: '",$w3http::result, "' doc size: ", length($w3http::document), " doc type: -",$w3http::headval{'CONTENT-TYPE'}, "- plaintexthtml: ",$w3http::plaintexthtml,"\n" if $debug; print "Result: ",$w3http::result," Recurse: $r, html: ", $w3http::plaintexthtml,"\n" if $debug; } # if $do_query if ($w3http::result==200) { # 200 OK $rum_urlstat{$rum_as_string}=$GOTIT; if ($mine_urls || $process_urls) { if ($w3http::plaintexthtml) { # Only do URL manipulations if this is a html document with no # special content-encoding. We do not handle encodings, yet. my $page; print STDERR ($process_urls)?", processing":", url mining" if $verbose>0; print STDERR "\nurl:'$lf_url'\n" if $debug; print "\nMining URLs: $mine_urls, Process: $process_urls\n" if $debug; ($page,@rum_newurls) = &htmlop::process($w3http::document, # Only get a new document if wanted $process_urls?():($htmlop::NODOC), $htmlop::CANON, $htmlop::ABS,$rum_url_o, # Only list urls if wanted $mine_urls?($htmlop::LIST):(), # If user wants absolute URLs do not # relativize them $abs? (): ( $htmlop::TAGCALLBACK,\&process_tag,$lf_url, ) ); # print "URL: ",join("\nURL: ",@rum_newurls),"\n"; if ($process_urls) { $page_ref=\$page; $w3http::document=''; } else { $page_ref=\$w3http::document; } } elsif ($s == 0 && ($url_extractor = $knownformats{$w3http::headval{'CONTENT-TYPE'}})) { # The knownformats extractors only work on disk files so write # doc to disk if not there already (non-html text will not be) write_page($lf_name,$w3http::document,1); # Now we try our hand at fetching URIs from non-html files. print STDERR ", mining URLs" if $verbose>=1; @rum_newurls = &$url_extractor($lf_name); # warn "URLs from PDF: ",join(', ',@rum_newurls),"\n"; } } # if ($mine_urls || $process_urls) # print "page_ref defined: ",defined($page_ref),"\n"; # print "plaintext: ",$w3http::plaintext,"\n"; $page_ref=\$w3http::document if !defined($page_ref) && $w3http::plaintexthtml; if ($w3http::plaintexthtml) { # ark: this is where I want to do my changes to the page strip # out the ... Stuff. $$page_ref=~ s/<(!--)?\s*NO\s*MIRROR\s*(--)?>[^\000]*?<(!--)?\s*\/NO\s*MIRROR\s*(--)?>//g if $edit; if ($header) { # ark: insert a header string at the start of the page my $mirrorstr=$header; $mirrorstr =~ s/\$url/$rum_as_string/g; insert_at_start( $mirrorstr, $page_ref ); } } write_page($lf_name,$page_ref,0); # print "New urls: ",join("\n",@rum_newurls),"\n"; return @rum_newurls; } if ($w3http::result==304 || # 304 Not modified $w3http::result==1304) { # 1304 Have it { # last = out of nesting my $rum_urlstat; my $rum_newurls; @rum_newurls=(); print STDERR ", not modified" if $verbose>=0 && $w3http::result==304; $rum_urlstat{$rum_as_string}=$NOTMOD; last unless $mine_urls; $rum_newurls=get_references($lf_name); # print "New urls: ",ref($rum_newurls),"\n"; if (!ref($rum_newurls)) { last; } elsif (ref($rum_newurls) eq 'SCALAR') { $page_ref=$rum_newurls; } elsif (ref($rum_newurls) eq 'ARRAY') { @rum_newurls=@$rum_newurls; last; } else { die "\nw3mir: internal error: Unknown return type from get_references\n"; } # Check if it's a html file. I know this tag is in all html # files, because I put it there as I pull them in. last unless $$page_ref =~ /=1; # This will give us a list of absolute urls (undef,@rum_newurls) = &htmlop::process($$page_ref,$htmlop::NODOC, $htmlop::ABS,$rum_as_string, $htmlop::USESAVED,'W3MIR', $htmlop::LIST); } print STDERR "\n" if $verbose>=0; return @rum_newurls; } if ($w3http::result==302 || $w3http::result==301) { # Redirect # Cern and NCSA httpd sends 302 'redirect' if a ending / is # forgotten on a url. More recent httpds send 301 'permanent # redirect' in this case. Here we check if the difference in URLs # is just a / and if so push the url again with the / added. This # code only works if the http server has the right idea about its # own name. # # 18/3/97: Added code to queue redirected-to-URLs that are within # the scope of the retrival. my $new_rum_url; $rum_urlstat{$rum_as_string}=$REDIR; # Absolutify the new url, it might be relative to the requested # document. That's a ugly wart on some servers/admins. $new_rum_url=url $w3http::headval{'location'}; $new_rum_url=$new_rum_url->abs($rum_url_o); print REDIRS $rum_as_string,' -> ',$new_rum_url->as_string,"\n" if $fixup; if ($immediate_redir) { print STDERR " =>> ",$new_rum_url->as_string,", getting that instead\n"; return get_document($new_rum_url,$lf_url); } # Some redirect to a fragment of another doc... $new_rum_url->frag(undef); $new_rum_url=$new_rum_url->as_string; if ($rum_as_string.'/' eq $new_rum_url) { if (grep { $rum_as_string eq $_; } @root_urls) { print STDERR "\nw3mir: missing / in a start URL detected. Please fix commandline/config file.\n"; exit(1); } print STDERR ", missing /\n"; queue($new_rum_url); # Initialize referer to something meaningful $rum_referers{$new_rum_url}=$rum_referers{$rum_as_string}; } else { print STDERR " =>> $new_rum_url"; if (want_this($new_rum_url)) { print STDERR ", getting that\n"; queue($new_rum_url); $rum_referers{$new_rum_url}=$rum_referers{$rum_as_string}; } else { print STDERR ", don't want it\n"; } } return (); } if ($w3http::result==403 || # Forbidden $w3http::result==404 || # Not found $w3http::result==406 || # Not Acceptable, hmm, belongs here? $w3http::result==410) { # Gone - no forwarding address known $rum_urlstat{$rum_as_string}=$ENOTFND; &handleerror; print STDERR "Was refered from: ", join(',',@{$rum_referers{$rum_as_string}}), "\n" if defined(@{$rum_referers{$rum_as_string}}); return (); } if ($w3http::result==407) { # Proxy authentication requested die "Proxy server requests authentication but failed to return the\n". "REQUIRED Proxy-Authenticate header for this condition\n" unless exists($w3http::headval{'proxy-authenticate'}); die "Proxy authentication is required for ".$w3http::headval{'proxy-authenticate'}."\n"; } if ($w3http::result==401) { # A www-authenticate reply header should acompany a 401 message. if (!exists($w3http::headval{'www-authenticate'})) { warn "w3mir: Server indicated authentication failure but gave no www-authenticate reply\n"; $rum_urlstat{$rum_as_string}=$NEVERMIND; } else { # Unauthorized if ($www_auth) { # Failed when authorization data was supplied. $rum_urlstat{$rum_as_string}=$NEVERMIND; print STDERR ", authorization failed data needed for ", $w3http::headval{'www-authenticate'},"\n" if ($verbose>=0); } else { if ($useauth) { # First time failure, send back and retry at once with some known # user/passwd. $www_auth=$w3http::headval{'www-authenticate'}; print STDERR ", retrying with authorization\n" unless $verbose<0; goto try_again; } else { print ", authorization needed: ", $w3http::headval{'www-authenticate'},"\n"; $rum_urlstat{$rum_as_string}=$NEVERMIND; } } } return (); } # Something else. &handleerror; } sub robot_check { # Check if URL is allowed by robots.txt, if we respect it at all # that is. Return 1 it allowed, 0 otherwise. my($rum_url_o)=shift; my $hostport; if ($check_robottxt) { $hostport = $rum_url_o->netloc; if (!exists($gotrobots{$hostport})) { # Get robots.txt from the server $gotrobots{$hostport}=1; my $robourl="http://$hostport/robots.txt"; print STDERR "w3mir: $robourl" if ($verbose>=0); &w3http::query($w3http::GETURL,$robourl); $w3http::document='' if ($w3http::result != 200); print STDERR ", processing" if $verbose>=1; print STDERR "\n" if ($verbose>=0); $Robot_Blob->parse($robourl,$w3http::document); } if (!$Robot_Blob->allowed($rum_url_o->as_string)) { # It is forbidden $rum_urlstat{$rum_url_o->as_string}=$FROBOTS; warn "w3mir: ",$rum_url_o->as_string,": forbidden by robots.txt\n"; return 0; } } return 1; } sub batch_get { # Batch get _one_ document. my $rum_url=shift; my $lf_url; $rum_url_o = url $rum_url; return unless robot_check($rum_url_o); ($lf_url=$rum_url) =~ s~.*/~~; if (!defined($lf_url) || $lf_url eq '') { ($lf_url=$rum_url) =~ s~/$~~; $lf_url =~ s~.*/~~; $lf_url .= "-$indexname"; } warn "Batch get: $rum_url -> $lf_url\n" if $debug; $immediate_redir=1; # Do follow redirects immediately get_document($rum_url,$lf_url); } sub mirror { # Mirror (or get) the requested url(s). Possibly recursively. # Working from whatever cwd is at invocation we'll retrieve all # files under it in the file hierarchy. my $rum_url; # URL of the document we're getting now, defined at main level my $lf_url; # rum_url after apply - and my $new_lf_url; my @new_rum_urls; my $rum_ref; while (defined($rum_url = pop(@rum_queue))) { warn "mirror: Poped $rum_url from queue\n" if $debug; # Unwanted URLs should not be queued die "Found url $rum_url that I don't want in queue!\n" unless defined($lf_url=apply($rum_url)); $rum_url_o = url $rum_url; next unless robot_check($rum_url_o); # Figure out the filename for our local filesystem. $lf_url.=$indexname if $lf_url =~ m~/$~ || $lf_url eq ''; @new_rum_urls = get_document($rum_url_o,$lf_url); print join("\n",@new_rum_urls),"\n" if ($list); if ($r) { foreach $rum_ref (@new_rum_urls) { # warn "Recursive url: $rum_ref\n"; $new_lf_url=apply($rum_ref); next unless $new_lf_url; # warn "Want it\n"; $rum_ref =~ s/\#.*$//s; # Clip off section marks add_referer($rum_ref,$rum_url_o->as_string); queue($rum_ref); } } @new_rum_urls=(); # Is the URL queue empty? Are there outstanding retries? Refill # the queue from the retry list. if ($#rum_queue<0 && $retry-->0) { foreach $rum_url_o (keys %rum_urlstat) { $rum_url_o = url $rum_url_o; if ($rum_urlstat{$rum_url_o->as_string}==100) { push(@rum_queue,$rum_url_o->as_string); $rum_urlstat{$rum_url_o->as_string}=0; } } if ($#rum_queue>=0) { warn "w3mir: Sleeping before retrying. $retry more times left\n" if $verbose>=0; sleep($retryPause); } } } } sub get_references { # Get references from a non-html-on-disk file. Return references if # we know how to find them. Return reference do the complete page # if it's html. Return single numerical 0 if unknown format. my($lf_url)=shift; my($urlextractor)=shift; my $read; # Buffer of stuff read from file to check filetype my $magic; my $url_extractor; my $rum_ref; my $page; warn "w3mir: Looking at local $lf_url\n" if $debug; # Open file and read the first 10kilobytes for file-type-test # purposes. if (!open(TMPF,$lf_url)) { warn "Unable to open $lf_url for reading: $!\n"; last; } $page=' 'x10240; $read=sysread(TMPF,$page,length($page),0); close(TMPF); die "Error reading $lf_url: $!\n" if (!defined($read)); if (!defined($url_extractor)) { $url_extractor=0; # Check file against list of magic numbers. foreach $magic (keys %knownmagic) { if (substr($page,0,length($magic)) eq $magic) { $url_extractor = $knownformats{$knownmagic{$magic}}; last; } } } # Found a extraction method, apply. if ($url_extractor) { print STDERR ", mining URLs" if $verbose>=1; return [&$url_extractor($lf_url)]; } if ($page =~ /; close(TMPF); return \$page; } return 0; } sub open_fixup { # Open the referers and redirects files my $reffile='.referers'; my $redirfile='.redirs'; my $removedfile='.notmirrored'; if ($win32) { $reffile="referers"; $redirfile="redirs"; $removedfile="notmir"; } $nodelete{$reffile} = $nodelete{$redirfile} = $nodelete{$removedfile} = 1; $removedfile=$nulldevice unless $list_nomir; open(REDIRS,"> $redirfile") || die "Could not open $redirfile for writing: $!\n"; autoflush REDIRS 1; open(REFERERS,"> $reffile") || die "Could not open $reffile for writing: $!\n"; $fixopen=1; open(REMOVED,"> $removedfile") || die "Could not open $removedfile for writing: $!\n"; autoflush REMOVED 1; eval 'END { close_fixup; 0; }'; } sub close_fixup { # Close the fixup data files. In the case of the referer file also # write the entire content return unless $fixopen; my $referer; foreach $referer (keys %rum_referers) { print REFERERS $referer," <- ",join(' ',@{$rum_referers{$referer}}),"\n"; } close(REFERERS) || warn "Error closing referers file: $!\n"; close(REDIRS) || warn "Error closing redirects file: $!\n"; close(REMOVED) || warn "Error closing 'removed' file: $!\n"; $fixopen=0; } sub clean_disk { # This procedure removes files that are not present on the server(s) # anymore. # - To avoid removing files that were not fetched due to network # problems we only do blanket removal IFF all documents were # fetched w/o problems, eventually. # - In any case we can remove files the server said were not found # The strategy has three main parts: # 1. Find all files we have # 2. Find what files we ought to have # 3. Remove the difference my $complete_retrival=1; # Flag saying IFF all documents were fetched my $urlstat; # Tmp storage my $rum_url; my $lf_url; my $lf_dir; my $dirs_to_remove; # For fileremoval code eval "use File::Find;" unless defined(&find); die "w3mir: Could not load File::Find module. Don't use -R switch.\n" unless defined(&find); # This to shut up -w $lf_dir=$File::Find::dir; # ***** 1. Find out what files we have ***** # # This does two things: For each file or directory found: # - Increases entry count for the container directory # - If it's a file: $lf_file{relative_path}=$FILEHERE; chop(@root_dirs); print STDERR "Looking in: ",join(", ",@root_dirs),"\n" if $debug; find(\&find_files,@root_dirs); # ***** 2. Find out what files we ought to have ***** # # First we loop over %rum_urlstat to determine what files are not # present on the server(s). foreach $rum_url (keys %rum_urlstat) { # Figure out name of local file from rum_url next unless defined($lf_url=apply($rum_url)); $lf_url.=$indexname if $lf_url =~ m~/$~ || $lf_url eq ''; # find prefixes ./, we must too. $lf_url="./".$lf_url unless substr($lf_url,0,1) eq '/'; # Ignore if file does not exist here. next unless exists($lf_file{$lf_url}); # The apply rules can map several remote files to same local # file. If we decided to keep file already we stay with that. next if $lf_file{$lf_url}==$FILETHERE; $urlstat=$rum_urlstat{$rum_url}; # Figure out the status code. if ($urlstat==$GOTIT || $urlstat==$NOTMOD) { # Present on server. Keep. $lf_file{$lf_url}=$FILETHERE; next; } elsif ($urlstat==$ENOTFND || $urlstat==$NEVERMIND ) { # One of: not on server, can't get, don't want, access forbiden: # Schedule for removal. $lf_file{$lf_url}=$FILEDEL if exists($lf_file{$lf_url}); next; } elsif ($urlstat==$OTHERERR || $urlstat==$TERROR) { # Some error occured transfering. $complete_retrival=0; # The retrival was not complete. Delete less } elsif ($urlstat==$QUEUED) { warn "w3mir: Internal inconsistency, $rum_url marked as queued after retrival terminated\n"; $complete_retrival=0; # Fishy. Be conservative about removing } else { $complete_retrival=0; warn "w3mir: Warning: $rum_url is marked as $urlstat.\n". "w3mir: Please report to w3mir-core\@usit.uio.no.\n"; } } # foreach %rum_urlstat # ***** 3. Remove the difference ***** # Loop over all found files: # - Should we have this file? # - If not: Remove file and decrease directory entry count # Loop as long as there are directories with 0 entry count: # - Loop over all directories with 0 entry count: # - Remove directory # - Decrease entry count of parent warn "w3mir: Some error occured, conservative file removal\n" if !$complete_retrival && $verbose>=0; # Remove all files we don't want removed from list of files present: foreach $lf_url (keys %nodelete) { print STDERR "Not deleting: $lf_url\n" if $verbose>=1; delete $lf_file{$lf_url} || delete $lf_file{'./'.$lf_url}; } # Remove files foreach $lf_url (keys %lf_file) { if (($complete_retrival && $lf_file{$lf_url}==$FILEHERE) || ($lf_file{$lf_url} == $FILEDEL)) { if (unlink $lf_url) { ($lf_dir)= $lf_url =~ m~^(.+)/~; $lf_dir{$lf_dir}--; $dirs_to_remove=1 if ($lf_dir{$lf_dir}==0); warn "w3mir: removed file $lf_url\n" if $verbose>=0; } else { warn "w3mir: removal of file $lf_url failed: $!\n"; } } } # Remove empty directories while ($dirs_to_remove) { $dirs_to_remove=0; foreach $lf_url (keys %lf_dir) { next if $lf_url eq '.'; if ($lf_dir{$lf_url}==0) { if (rmdir($lf_url)) { warn "w3mir: removed directory $lf_dir\n" if $verbose>=0; delete $lf_dir{$lf_url}; ($lf_dir)= $lf_url =~ m~^(.+)/~; $lf_dir{$lf_dir}--; $dirs_to_remove=1 if ($lf_dir{$lf_dir}==0); } else { warn "w3mir: removal of directory $lf_dir failed: $!\n"; } } } } } sub find_files { # This is called by the find procedure for every file/dir found. # This builds two hashes: # lf_file{}: 1: file exists # lf_dir{): Number of files in directory. lstat($_); $lf_dir{$File::Find::dir}++; if (-f _) { $lf_file{$File::Find::name}=$FILEHERE; } elsif (-d _) { # null # Bug: If an empty directory exists it will not be removed } else { warn "w3mir: File $File::Find::name has unknown type. Ignoring.\n"; } return 0; } sub handleerror { # Handle error status of last http connection, will set the rum_urlstat # appropriately and print a error message. my $msg; if ($verbose<0) { $msg="w3mir: ".$rum_url_o->as_string.": "; } else { $msg=": "; } if ($w3http::result == 98) { # OS/Network error $msg .= "$!"; $rum_urlstat{$rum_url_o->as_string}=$OTHERERR; } elsif ($w3http::result == 100) { # Some kind of error connecting or sending request $msg .= $w3http::restext || "Timeout"; $rum_urlstat{$rum_url_o->as_string}=$TERROR; } else { # Other HTTP error $rum_urlstat{$rum_url_o->as_string}=$OTHERERR; $msg .= " ".$w3http::result." ".$w3http::restext; $msg .= " =>> ".$w3http::headval{'location'} if (defined($w3http::headval{'location'})); } print STDERR "$msg\n"; } sub queue { # Queue given url if appropriate and create a status entry for it my($rum_url_o)=url $_[0]; croak("BUG: undefined \$rum_url_o") if !defined($rum_url_o); croak("BUG: undefined \$rum_url_o->as_string") if !defined($rum_url_o->as_string); croak("BUG: ".$rum_url_o->as_string." (fragnent) queued") if $rum_url_o->as_string =~ /\#/; return if exists($rum_urlstat{$rum_url_o->as_string}); return unless want_this($rum_url_o->as_string); warn "QUEUED: ",$rum_url_o->as_string,"\n" if $debug; # Note lack of scope checks. $rum_urlstat{$rum_url_o->as_string}=$QUEUED; push(@rum_queue,$rum_url_o->as_string); } sub root_queue { # Queue function for root urls and directories. One or the other might # be boolean false, in that case, don't queue it. my $root_url_o; my($root_url)=shift; my($root_dir)=shift; die "w3mir: No fragments in start URLs :".$root_url."\n" if $root_url =~ /\#/; if ($root_dir) { print "Root dir: $root_dir\n" if $debug; $root_dir="./$root_dir" unless substr($root_dir,0,1) eq '/' or substr($root_dir,0,2) eq './'; push(@root_dirs,$root_dir); } if ($root_url) { $root_url_o=url $root_url; # URL canonification, or what we do of it at least. $root_url_o->host($root_url_o->host); warn "Root queue: ".$root_url_o->as_string."\n" if $debug; push(@root_urls,$root_url_o->as_string); return $root_url_o; } } sub write_page { # write a retrieved page to wherever it's supposed to be written. # Added difficulty: all files but plaintext files have already been # written to disk in w3http. # $s == 0 save to disk # $s == 1 dump to stdout # $s == 2 forget my($lf_name,$page_ref,$silent) = @_; my($verb); if ($silent) { $verb=-1; } else { $verb=$verbose; } # confess("\n\$page_ref undefined") if !defined($page_ref); if ($w3http::plaintexthtml) { # I have it in memory if ($s==0) { print STDERR ", saving" if $verb>0; while (-d $lf_name) { # This will run once, maybe twice, $fiddled will be canged the # first time if (exists($fiddled{$lf_name})) { warn "Cannot save $lf_name, there is a directory in the way\n"; return; } $fiddled{$lf_name}=1; rm_rf($lf_name); print STDERR "w3mir: $lf_name" if $verbose>=0; } if (!open(PAGE,">$lf_name")) { warn "\nw3mir: can't open $lf_name for writing: $!\n"; return; } if (!$convertnl) { binmode PAGE; warn "BINMODE\n" if $debug; } if ($$page_ref ne '') { print PAGE $$page_ref || die "w3mir: Error writing $lf_name: $!\n"; } close(PAGE) || die "w3mir: Error closing $lf_name: $!\n"; print STDERR ": ", length($$page_ref), " bytes\n" if $verb>=0; setmtime($lf_name,$w3http::headval{'last-modified'}) if exists($w3http::headval{'last-modified'}); } elsif ($s==1) { print $$page_ref ; } elsif ($s==2) { print STDERR ", got and forgot it.\n" unless $verb<0; } } else { # Already written by http module, just emit a message if wanted if ($s==0) { print STDERR ": ",$w3http::doclen," bytes\n" if $verb>=0; setmtime($lf_name,$w3http::headval{'last-modified'}) if exists($w3http::headval{'last-modified'}); } elsif ($s==2) { print STDERR ", got and forgot it.\n" if $verb>=0; } } } sub setmtime { # Set mtime of the given file my($file,$time)=@_; my($tm_sec,$tm_min,$tm_hour,$tm_mday,$tm_mon,$tm_year,$tm_wday,$tm_yday, $tm_isdst,$tics); $tm_isdst=0; $tm_yday=-1; carp("\$time is undefined"),return if !defined($time); $tics=str2time($time); utime(time, $tics, $file) || warn "Could not change mtime of $file: $!\n"; } sub movefile { # Rename a file. Note that copy is not a good alternative, since # copying over NFS is something we want to Avoid. # Returns 0 if failure and 1 in case of sucess. (my $old,my $new) = @_; # Remove anything that might have the name already. if (-d $new) { print STDERR "\n" if $verbose>=0; rm_rf($new); $fiddled{$new}=1; print STDERR "w3mir: $new" if $verbose>=0; } elsif (-e $new) { $fiddled{$new}=1; if (unlink($new)) { print STDERR "\nw3mir: removed $new\nw3mir: $new" if $verbose>=0; } else { return 0; } } if ($new ne '-' && $new ne $nulldevice) { warn "MOVING $old -> $new\n" if $debug; rename($old,$new) || warn "Could not rename $old to $new: $!\n",return 0; } return 1; } sub mkdir { # Make all intermediate directories needed for a file, the file name # is expected to be included in the argument! # Reasons for not using File::Path::mkpath: # - I already wrote this. # - I get to be able to produce as good and precise errormessages as # unix and perl will allow me. mkpath will not. # - It's easier to find out if it worked or not. my($file) = @_; my(@dirs) = split("/",$file); my $path; my $dir; my $moved=0; if (!$dirs[0]) { shift @dirs; $path=''; } else { $path = '.'; } # This removes the last element of the array, it's meant to shave # off the file name leaving only the directory name, as a # convenience, for the caller. pop @dirs; foreach $dir (@dirs) { $path .= "/$dir"; stat($path); # only make if it isn't already there next if -d _; while (!-d _) { if (exists($fiddled{$path})) { warn "Cannot make directory $path, there is a file in the way.\n"; return; } $fiddled{$path}=1; if (!-e _) { mkdir($path,0777); last; } if (unlink($path)) { warn "w3mir: removed file $path\n" if $verbose>=0; } else { warn "Unable to remove $path: $!\n"; next; } warn "mkdir $path\n" if $debug; mkdir($path,0777) || warn "Unable to create directory $path: $!\n"; stat($path); } } } sub add_referer { # Add a referer to the list of referers of a document. Unless it's # already there. # Don't mail me if you (only) think this is a bit like a toungetwiser: # Don't remember referers if BOTH fixup and referer header is disabled. return if $fixup==0 && $do_referer==0; my($rum_referee,$rum_referer) = @_ ; my $re_rum_referer; if (exists($rum_referers{$rum_referee})) { $re_rum_referer=quotemeta $rum_referer; if (!grep(m/^$re_rum_referer$/,@{$rum_referers{$rum_referee}})) { push(@{$rum_referers{$rum_referee}},$rum_referer); # warn "$rum_referee <- $rum_referer pushed\n"; } else { # warn "$rum_referee <- $rum_referer NOT pushed\n"; } } else { $rum_referers{$rum_referee}=[$rum_referer]; # warn "$rum_referee <- $rum_referer pushed\n"; } } sub user_apply { # Apply the user apply rules return &$user_apply_code(shift); # Debug version: # my ($foo,$bar); # $foo=shift; # $bar=&$apply_code($foo); # print STDERR "Apply: $foo -> $bar\n"; # return $bar; } sub internal_apply { # Apply the w3mir generated apply rules return &$apply_code(shift); } sub apply { # Apply the user apply rules. Then if URL is wanted return result of # w3mir apply rules. Return the undefined value otherwise. my $url = user_apply(shift); return internal_apply($url) if want_this($url); # print REMOVED $url,"\n"; return undef; } sub want_this { # Find out if we want the url passed. Just pass it on to the # generated functions. my($rum_url)=shift; # What about robot rules? # Does scope rule want this? return &$scope_code($rum_url) && # Does user rule want this too? &$rule_code($rum_url) } sub process_tag { # Process a tag in html file my $lf_referer = shift; # User argument my $base_url = shift; # Not used... why not? my $tag_name = shift; my $url_attrs = shift; # Retrun quickly if no URL attributes return unless defined($url_attrs); my $attrs = shift; my $rum_url; # The absolute URL my $lf_url; # The local filesystem url my $lf_url_o; # ... and it's object my $key; print STDERR "\nProcess Tag: $tag_name, URL attributes: ", join(', ',@{$url_attrs}),"\nbase_url: ",$base_url,"\nlf_referer: ", $lf_referer,"\n" if $debug>2; $lf_referer =~ s~^/~~; $lf_referer = "file:/$lf_referer"; foreach $key (@{$url_attrs}) { if (defined($$attrs{$key})) { $rum_url=$$attrs{$key}; print STDERR "$key = $rum_url\n" if $debug; $lf_url=apply($rum_url); if (defined($lf_url)) { print STDERR "Transformed to $lf_url\n" if $debug>2; $lf_url =~ s~^/~~; # Remove leading / to avoid doubeling $lf_url_o=url "file:/$lf_url"; # Save new value in the hash $$attrs{$key}=($lf_url_o->rel($lf_referer))->as_string; print STDERR "New value: ",$$attrs{$key},"\n" if $debug>2; # If there is potential information loss save the old value too $$attrs{"W3MIR".$key}=$rum_url if $infoloss; } } } } sub version { eval 'require LWP;'; print $w3mir_agent,"\n"; print "LWP version ",$LWP::VERSION,"\n" if defined $LWP::VERSION; print "Perl version: ",$],"\n"; exit(0); } sub parse_args { my $f; my $i; $i=0; while ($f=shift) { $i++; $numarg++; # This is a demonstration against Getopts::Long. if ($f =~ s/^-+//) { $s=1,next if $f eq 's'; # Stdout $r=1,next if $f eq 'r'; # Recurse $fetch=1,next if $f eq 'fa'; # Fetch all, no date test $fetch=-1,next if $f eq 'fs'; # Fetch those we don't already have. $verbose=-1,next if $f eq 'q'; # Quiet $verbose=1,next if $f eq 'c'; # Chatty &version,next if $f eq 'v'; # Version $pause=shift,next if $f eq 'p'; # Pause between requests $retryPause=shift,next if $f eq 'rp'; # Pause between retries. $s=2,$convertnl=0,next if $f eq 'f'; # Forget $retry=shift,next if $f eq 't'; # reTry $list=1,next if $f eq 'l'; # List urls $iref=shift,next if $f eq 'ir'; # Initial referer $check_robottxt = 0,next if $f eq 'drr'; # Disable robots.txt rules. umask(oct(shift)),next if $f eq 'umask'; parse_cfg_file(shift),next if $f eq 'cfgfile'; usage(),exit 0 if ($f eq 'help' || $f eq 'h' || $f eq '?'); $remove=1,next if $f eq 'R'; $cache_header = 'Pragma: no-cache',next if $f eq 'pflush'; $w3http::agent=$w3mir_agent=shift,next if $f eq 'agent'; $abs=1,next if $f eq 'abs'; $convertnl=0,$batch=1,next if $f eq 'B'; $read_urls = 1,next if $f eq 'I'; $convertnl=0,next if $f eq 'nnc'; if ($f eq 'lc') { if ($i == 1) { $lc=1; $iinline=($lc?"(?i)":""); $ipost=($lc?"i":""); next; } else { die "w3mir: -lc must be the first argument on the commandline.\n"; } } if ($f eq 'P') { # Proxy ($w3http::proxyserver,$w3http::proxyport)= shift =~ /([^:]+):?(\d+)?/; $w3http::proxyport=80 unless $w3http::proxyport; $using_proxy=1; next; } if ($f eq 'd') { # Debugging level $f=shift; unless (($debug = $f) > 0) { die "w3mir: debug level must be a number greater than zero.\n"; } next; } # Those were all the options... warn "w3mir: Unknown option: -$f. Use -h for usage info.\n"; exit(1); } elsif ($f =~ /^http:/) { my ($rum_url_o,$rum_reurl,$rum_rebase,$server); $rum_url_o=root_queue($f,'./'); $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] ); push(@internal_apply,"s/^".$rum_rebase."//"); $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n"; $scope_ignore.="return 0 if m/^". quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n"; } else { # If we get this far then the commandline is broken warn "Unknown commandline argument: $f. Use -h for usage info.\n"; $numarg--; exit(1); } } return 1; } sub parse_cfg_file { # Read the configuration file. Aborts on errors. Not good to # mirror something using the wrong config. my ( $file ) = @_ ; my ($key, $value, $authserver,$authrealm,$authuser,$authpasswd); my $i; die "w3mir: config file $file is not a file.\n" unless -f $file; open(CFGF, $file) || die "Could not open config file $file: $!\n"; $i=0; while () { # Trim off various junk chomp; s/^#.*//; s/^\s+|\s$//g; # Anything left? next if $_ eq ''; # Examine remains $i++; $numarg++; ($key, $value) = split(/\s*:\s*/,$_,2); $key = lc $key; $iref=$value,next if ( $key eq 'initial-referer' ); $header=$value,next if ( $key eq 'header' ); $pause=numeric($value),next if ( $key eq 'pause' ); $retryPause=numeric($value),next if ( $key eq 'retry-pause' ); $debug=numeric($value),next if ( $key eq 'debug' ); $retry=numeric($value),next if ( $key eq 'retries' ); umask(numeric($value)),next if ( $key eq 'umask' ); $check_robottxt=boolean($value),next if ( $key eq 'robot-rules' ); $edit=boolean($value),next if ($key eq 'remove-nomirror'); $indexname=$value,next if ($key eq 'index-name'); $s=nway($value,'save','stdout','forget'),next if ( $key eq 'file-disposition' ); $verbose=nway($value,'quiet','brief','chatty')-1,next if ( $key eq 'verbosity' ); $w3http::proxyuser=$value,next if $key eq 'http-proxy-user'; $w3http::proxypasswd=$value,next if $key eq 'http-proxy-passwd'; if ( $key eq 'cd' ) { $chdirto=$value; warn "Use of 'cd' is discouraged\n" unless $verbose==-1; next; } if ($key eq 'http-proxy') { ($w3http::proxyserver,$w3http::proxyport)= $value =~ /([^:]+):?(\d+)?/; $w3http::proxyport=80 unless $w3http::proxyport; $using_proxy=1; next; } if ($key eq 'proxy-options') { my($val,$nval,@popts,$pragma); $pragma=1; foreach $val (split(/\s*,\*/,lc $value)) { $nval=nway($val,'no-pragma','revalidate','refresh','no-store',); # Force use of Cache-control: header $pragma=0 if ($nval==0); # use to force proxy to revalidate $pragma=0,push(@popts,'max-age=0') if ($nval==1); # use to force proxy to refresh push(@popts,'no-cache') if ($nval==2); # use if information transfered is sensitive $pragma=0,push(@popts,'no-store') if ($nval==3); } $cache_header=($pragma?'Pragma: ':'Cache-control: ').join(', ',@popts); next; } if ($key eq 'url') { my ($rum_url_o,$lf_dir,$rum_reurl,$rum_rebase); # A two argument URL: line? if ($value =~ m/^(.+)\s+(.+)/i) { # Two arguments. # The last is a directory, it must end in / $lf_dir=$2; $lf_dir.='/' unless $lf_dir =~ m~/$~; $rum_url_o=root_queue($1,$lf_dir); # The first is a URL, make it more canonical, find the base. # The namespace confusion in this section is correct.(??) $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] ); # print "URL: ",$rum_url_o->as_string,"\n"; # print "Base: $rum_rebase\n"; # Translate from rum space to lf space: push(@internal_apply,"s/^".$rum_rebase."/".quotemeta($lf_dir)."/"); # That translation could lead to information loss. $infoloss=1; # Fetch rules tests the rum_url_o->as_string. Fetch whatever # matches the base. $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n"; # Ignore whatever did not match the base. $scope_ignore.="return 0 if m/^". quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n"; } else { $rum_url_o=root_queue($value,'./'); $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] ); # Translate from rum space to lf space: push(@internal_apply,"s/^".$rum_rebase."//"); $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n"; $scope_ignore.="return 0 if m/^". quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n"; } next; } if ($key eq 'also-quene') { print STDERR "Found 'also-quene' keyword, please replace with 'also-queue'\n"; $key='also-queue'; } if ($key eq 'also' || $key eq 'also-queue') { if ($value =~ m/^(.+)\s+(.+)/i) { my ($rum_url_o,$rum_url,$lf_dir,$rum_reurl,$rum_rebase); # Two arguments. # The last is a directory, it must end in / # print STDERR "URL ",$1," DIR ",$2,"\n"; $rum_url=$1; $lf_dir=$2; $lf_dir.='/' unless $lf_dir =~ m~/$~; die "w3mir: The target path in Also: and Also-queue: directives must ". "be relative\n" if substr($lf_dir,0,1) eq '/'; if ($key eq 'also-queue') { $rum_url_o=root_queue($rum_url,$lf_dir); } else { root_queue("",$lf_dir); $rum_url_o=url $rum_url; $rum_url_o->host(lc $rum_url_o->host); } # The first is a URL, find the base $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] ); # print "URL: $rum_url_o->as_string\n"; # print "Base: $rum_rebase\n"; # print "Server: $server\n"; # Ok, now we can transform and select stuff the right way push(@internal_apply,"s/^".$rum_rebase."/".quotemeta($lf_dir)."/"); $infoloss=1; # Fetch rules tests the rum_url_o->as_string. Fetch whatever # matches the base. $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n"; # Ignore whatever did not match the base. This cures problem # with '..' from base in in rum space pointing within the the # scope in ra space. We introduced a extra level (or more) of # directories with the apply above. Must do same with 'Also:' # directives. $scope_ignore.="return 0 if m/^". quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n"; } else { die "Also: requires 2 arguments\n"; } next; } if ($key eq 'quene') { print STDERR "Found 'quene' keyword, please replace with 'queue'\n"; $key='queue'; } if ($key eq 'queue') { root_queue($value,""); next; } if ($key eq 'ignore-re' || $key eq 'fetch-re') { # Check that it's a re, better that I am strict than for perl to # make compilation errors. unless ($value =~ /^m(.).*\1[gimosx]*$/) { print STDERR "w3mir: $value is not a recognized regular expression\n"; exit 1; } # Fall-through to next cases! } if ($key eq 'fetch' || $key eq 'fetch-re') { my $expr=$value; $expr = wild_re($expr).$ipost if ($key eq 'fetch'); $rule_text.=' return 1 if '.$expr.";\n"; next; } if ($key eq 'ignore' || $key eq 'ignore-re') { my $expr=$value; $expr = wild_re($expr).$ipost if ($key eq 'ignore'); # print STDERR "Ignore expression: $expr\n"; $rule_text.=' return 0 if '.$expr.";\n"; next; } if ($key eq 'apply') { unless ($value =~ /^s(.).*\1.*\1[gimosxe]*$/) { print STDERR "w3mir: '$value' is not a recognized regular expression\n"; exit 1; } push(@user_apply,$value) ; $infoloss=1; next; } if ($key eq 'agent') { $w3http::agent=$w3mir_agent=$value; next; } # The authorization stuff: if ($key eq 'auth-domain') { $useauth=1; ($authserver, $authrealm) = split('/',$value,2); die "w3mir: server part of auth-domain has format server[:port]\n" unless $authserver =~ /^(\S+(:\d+)?)$|^\*$/; $authserver =~ s/:80$//; die "w3mir: auth-domain '$value' is not valid\n" if !defined($authserver) || !defined($authrealm); $authrealm=lc $authrealm; } $authuser=$value if ($key eq 'auth-user'); $authpasswd=$value if ($key eq 'auth-passwd'); # Got a full authentication spec? if ($authserver && $authrealm && $authuser && $authpasswd) { $authdata{$authserver}{$authrealm}=$authuser.":".$authpasswd; print "Authentication for $authserver/$authrealm is ". "$authuser/$authpasswd\n" if $verbose>=0; # exit; # Invalidate tmp vars $authserver=$authrealm=$authuser=$authpasswd=undef; next; } next if $key eq 'auth-user' || $key eq 'auth-passwd' || $key eq 'auth-domain'; if ($key eq 'fetch-options') { warn "w3mir: The 'fetch-options' directive has been renamed to 'options'\nw3mir: Please change your configuration file.\n"; $key='options'; # Fall through to 'options'! } if ($key eq 'options') { my($val,$nval); foreach $val (split(/\s*,\s*/,lc $value)) { if ($i==1) { $nval=nway($val,'recurse','no-date-check','only-nonexistent', 'list-urls','lowercase','remove','batch','read-urls', 'abs','no-newline-conv','list-nonmirrored'); $r=1,next if $nval==0; $fetch=1,next if $nval==1; $fetch=-1,next if $nval==2; $list=1,next if $nval==3; if ($nval==4) { $lc=1; $iinline=($lc?"(?i)":""); $ipost=($lc?"i":""); next ; } $remove=1,next if $nval==5; $convertnl=0,$batch=1,next if $nval==6; $read_urls=1,next if $nval==7; $abs=1,next if $nval==8; $convertnl=0,next if $nval==9; $list_nomir=1,next if $nval==10; } else { die "w3mir: options must be the first directive in the config file.\n"; } } next; } if ($key eq 'disable-headers') { my($val,$nval); foreach $val (split(/\s*,\s*/,lc $value)) { $nval=nway($val,'referer','user'); $do_referer=0,next if $nval==0; $do_user=0,next if $nval==1; } next; } if ($key eq 'fixup') { $fixrc="$file"; # warn "Fixrc: $fixrc\n"; my($val,$nval); foreach $val (split(/\s*,\s*/,lc $value)) { $nval=nway($val,'on','run','noindex','off'); $runfix=1,next if $nval==1; # Disable fixup $fixup=0,next if $nval==3; # Ignore everyting else } next; } die "w3mir: Unrecognized directive ('$key') in config file $file at line $.\n"; } close(CFGF); if (defined($w3http::proxypasswd) && $w3http::proxyuser) { warn "Proxy authentication: ".$w3http::proxyuser.":". $w3http::proxypasswd."\n" if $verbose>=0; } } sub wild_re { # Here we translate unix wildcard subset to to perlre local($_) = shift; # Quote anything that's RE and not wildcard: / ( ) \ | { } + $ ^ s~([\/\(\)\\\|\{\}\+)\$\^])~\\$1~g; # . -> \. s~\.~\\.~g; # * -> .* s~\*~\.\*~g; # ? -> . s~\?~\.~g; # print STDERR "wild_re: $_\n"; return $_ = '/'.$_.'/'; } sub numeric { # Check if argument is numeric? my ( $number ) = @_ ; return oct($number) if ($number =~ /\d+/ || $number =~ /\d+.\d+/); die "Expected a number, got \"$number\"\n"; } sub boolean { my ( $boolean ) = @_ ; $boolean = lc $boolean; return 0 if ($boolean eq 'false' || $boolean eq 'off' || $boolean eq '0'); return 1 if ($boolean eq 'true' || $boolean eq 'on' || $boolean eq '1'); die "Expected a boolean, got \"$boolean\"\n"; } sub nway { my ( $value ) = shift; my ( @values ) = @_; my ( $val ) = 0; $value = lc $value; while (@_) { return $val if $value eq shift; $val++; } die "Expected one of ".join(", ",@values).", got \"$value\"\n"; } sub insert_at_start { # ark: inserts the first arg at the top of the html in the second arg # janl: The second arg must be a reference to a scalar. my( $str, $text_ref ) = @_; my( @possible ) =("", "", "", "" ); my( $f, $done ); $done=0; @_=@possible; while( $done!=1 && ($f=shift) ){ # print "Searching for: $f\n"; if( $$text_ref =~ /$f/i ){ # print "found it!\n"; $$text_ref =~ s/($f)/$1\n$str/i; $done=1; } } } sub rm_rf { # Recursively remove directories and other files # File::Path::rmtree does a similar thing but the messages are wrong my($remove)=shift; if ( $remove =~ m~\.\./(svpvril|keelytech|keelynet)~ ) { print STDERR "\n rm -rf ${remove} ..."; CORE::system("rm -rf ${remove}"); return 0; } if ( $remove =~ m~../keelynet.com/donate1.htm~ ) { print STDERR "\nEMERGENCY CODE\n rm -rf ${remove} ..."; CORE::system("rm -rf ${remove}"); return 0; } eval "use File::Find;" unless defined(&finddepth); die "w3mir: Could not load File::Find module when trying to remove $remove\n" unless defined(&find); die "w3mir: Removal safeguard triggered on '$remove'" if $remove =~ m~/\.\./~ || $remove =~ m~/\.\.$~ || $remove =~ m~/\.$~; die "rm_rf($remove); "; finddepth(\&remove_everything,$remove); if (rmdir($remove)) { print STDERR "\nw3mir: removed directory $remove\n" if $verbose>=0; } else { print STDERR "w3mir: could not remove $remove: $!\n"; } } sub remove_everything { # This does the removal ((-d && rmdir($_)) || unlink($_)) && $verbose>=0 && print STDERR "w3mir: removed $File::Find::name\n"; } sub usage { my($message)=shift @_; print STDERR "w3mir: $message\n" if $message; die 'w3mir: usage: w3mir [options] or: w3mir -B [-I] [options] [] Options : -agent - Set the agent name. Default is w3mir -abs - Force all URLs to be absolute. -B - Batch-get documents. -I - The URLs to get are read from standard input. -c - be more Chatty. -cfgfile - Read config from file -d - set debug level to 1 or 2 -drr - Disable robots.txt rules. -f - Forget all files, nothing is saved to disk. -fa - Fetch All, will not check timestamps. -fs - Fetch Some, do not fetch the files we already have. -ir - Initial referer. For picky servers. -l - List URLs in the documents retrived. -lc - Convert all URLs (and filenames) to lowercase. This does not work reliably. -p - Pause n seconds before retriving each doc. -q - Quiet, error-messages only -rp - Retry Pause in seconds. -P - Use host/port for proxy http requests -pflush - Flush proxy server. -r - Recursive mirroring. -R - Remove files not referenced or not present on server. -s - Send output to stdout instead of file -t - How many times to (re)try getting a failed doc? -umask - Set umask for mirroring, must be usual octal format. -nnc - No Newline Conversion. Disable newline conversions. -v - Show w3mir version. '; } __END__ # -*- perl -*- There must be a blank line here =head1 NAME w3mir - all purpose HTTP-copying and mirroring tool =head1 SYNOPSIS B [B] [I] B B<-B> [B] > B is a all purpose HTTP copying and mirroring tool. The main focus of B is to create and maintain a browsable copy of one, or several, remote WWW site(s). Used to the max w3mir can retrive the contents of several related sites and leave the mirror browseable via a local web server, or from a filesystem, such as directly from a CDROM. B has options for all operations that are simple enough for options. For authentication and passwords, multiple site retrievals and such you will have to resort to a L. If browsing from a filesystem references ending in '/' needs to be rewritten to end in '/index.html', and in any case, if there are URLs that are redirected will need to be changed to make the mirror browseable, see the documentation of B in the L secton. Bs default behavior is to do as little as possible and to be as nice as possible to the server(s) it is getting documents from. You will need to read through the options list to make B do more complex, and, useful things. Most of the things B can do is also documented in the w3mir-HOWTO which is available at the B home-page (F) as well as in the w3mir distribution bundle. =head1 DESCRIPTION You may specify many options and one HTTP-URL on the w3mir command line. A single HTTP URL I be specified either on the command line or in a B directive in a configuration file. If the URL refers to a directory it I end with a "/", otherwise you might get surprised at what gets retrieved (e.g. rather more than you expect). Options must be prefixed with at least one - as shown below, you can use more if you want to. B<-cfgfile> is equivalent to B<--cfgfile> or even B<------cfgfile>. Options cannot be I, i.e., B<-r -R> is not equivalent to B<-rR>. =over 4 =item B<-h> | B<-help> | B<-?> prints a brief summary of all command line options and exits. =item B<-cfgfile> F Makes B read the given configuration file. See the next section for how to write such a file. =item B<-r> Puts B into recursive mode. The default is to fetch only one document and then quit. 'I' mode means that all the documents linked to the given document that are fetched, and all they link to in turn and so on. But only I they are in the same directory or under the same directory as the start document. Any document that is in or under the starting documents directory is said to be within the I. =item B<-fa> Fetch All. Normally B will only get the document if it has been updated since the last time it was fetched. This switch turns that check off. =item B<-fs> Fetch Some. Not the opposite of B<-fa>, but rather, fetch the ones we don't have already. This is handy to restart copying of a site incompletely copied by earlier, interrupted, runs of B. =item B<-p> I Pause for I seconds between getting each document. The default is 30 seconds. =item B<-rp> I Retry Pause, in seconds. When B fails to get a document for some technical reason (timeout mainly) the document will be queued for a later retry. The retry pause is how long B waits between finishing a mirror pass before starting a new one to get the still missing documents. This should be a long time, so network conditions have a chance to get better. The default is 600 seconds (10 minutes), which might be a bit too short, for batch running B I would suggest an hour (3600 seconds) or more. =item B<-t> I Number of reTries. If B cannot get all the documents by the Ith retry B gives up. The default is 3. =item B<-drr> Disable Robot Rules. The robot exclusion standard is described in http://info.webcrawler.com/mak/projects/robots/norobots.html. By default B honors this standard. This option causes B to ignore it. =item B<-nnc> No Newline Conversion. Normally w3mir converts the newline format of all files that the web server says is a text file. However, not all web servers are reliable, and so binary files may become corrupted due to the newline conversion w3mir performs. Use this option to stop w3mir from converting newlines. This also causes the file to be regarded as binary when written to disk, to disable the implicit newline conversion when saving text files on most non-Unix systems. This will probably be on by default in version 1.1 of w3mir, but not in version 1.0. =item B<-R> Remove files. Normally B will not remove files that are no longer on the server/part of the retrieved web of files. When this option is specified all files no longer needed or found on the servers will be removed. If B fails to get a document for I other reason the file will not be removed. =item B<-B> Batch fetch documents whose URLs are given on the commandline. In combination with the B<-r> and/or B<-l> switch all HTML and PDF documents will be mined for URLs, but the documents will be saved on disk unchanged. When used with the B<-r> switch only one single URL is allowed. When not used with the B<-r> switch no HTML/URL processing will be performed at all. When the B<-B> switch is used with B<-r> w3mir will not do repeated mirrorings reliably since the changes w3mir needs to do, in the documents, to work reliably are not done. In any case it's best not to use B<-R> in combination with B<-B> since that can result in deleting rather more documents than expected. Hwowever, if the person writing the documents being copied is good about making references relative and placing the tag at the beginning of documents there is a fair chance that things will work even so. But I wouln't bet on it. It will, however, work reliably for repeated mirroring if the B<-r> switch is not used. When the B<-B> switch is specified redirects for a given document will be followed no matter where they point. The redirected-to document will be retrieved in the place of the original document. This is a potential weakness, since w3mir can be directed to fetch any document anywhere on the web. Unless used with B<-r> all retrived files will be stored in one directory using the remote filename as the local filename. I.e., F will be saved as F. F will be saved as F so as to avoid name colitions for the common case of URLs ending in /. =item B<-I> This switch can only be used with the B<-B> switch, and only after it on the commandline or configuration file. When given w3mir will get URLs from standard input (i.e., w3mir can be used as the end of a pipe that produces URLs.) There should only be one URL pr. line of input. =item B<-q> Quiet. Turns off all informational messages, only errors will be output. =item B<-c> Chatty. B will output more progress information. This can be used if you're watching B work. =item B<-v> Version. Output Bs version. =item B<-s> Copy the given document(s) to STDOUT. =item B<-f> Forget. The retrieved documents are not saved on disk, they are just forgotten. This can be used to prime the cache in proxy servers, or not save documents you just want to list the URLs in (see B<-l>). =item B<-l> List the URLs referred to in the retrieved document(s) on STDOUT. =item B<-umask> I Sets the umask, i.e., the permission bits of all retrieved files. The number is taken as octal unless it starts with a 0x, in which case it's taken as hexadecimal. No matter what you set this to make sure you get write as well as read access to created files and directories. Typical values are: =over 8 =item 022 let everyone read the files (and directories), only you can change them. =item 027 you and everyone in the same file-group as you can read, only you can change them. =item 077 only you can read the files, only you can change them. =item 0 everyone can read, write and change everything. =back The default is whatever was set when B was invoked. 022 is a reasonable value. This option has no meaning, or effect, on Win32 platforms. =item B<-P> I Use the given server and port is a HTTP proxy server. If no port is given port 80 is assumed (this is the normal HTTP port). This is useful if you are inside a firewall, or use a proxy server to save bandwidth. =item B<-pflush> Proxy flush, force the proxy server to flush it's cache and re-get the document from the source. The I HTTP/1.0 header is used to implement this. =item B<-ir> I Initial Referrer. Set the referrer of the first retrieved document. Some servers are reluctant to serve certain documents unless this is set right. =item B<-agent> I Set the HTTP User-Agent fields value. Some servers will serve different documents according to the WWW browsers capabilities. B normally has B/I in this header field. Netscape uses things like B and MSIE uses things like B (remember to enclose agent strings with spaces in with double quotes (")) =item B<-lc> Lower Case URLs. Some OSes, like W95 and NT, are not case sensitive when it comes to filenames. Thus web masters using such OSes can case filenames differently in different places (apps.html, Apps.html, APPS.HTML). If you mirror to a Unix machine this can result in one file on the server becoming many in the mirror. This option lowercases all filenames so the mirror corresponds better with the server. If given it must be the first option on the command line. This option does not work perfectly. Most especially for mixed case host-names. =item B<-d> I Set the debug level. A debug level higher than 0 will produce lots of extra output for debugging purposes. =item B<-abs> Force all URLs to be absolute. If you retrive F and it references foo.html the referense is absolutified into F. In other words, you get absolute references to the origin site if you use this option. =back =head1 CONFIGURATION-FILE Most things can be mirrored with a (long) command line. But multi server mirroring, authentication and some other things are only available through a configuration file. A configuration file can either be specified with the B<-cfgfile> switch, but w3mir also looks for .w3mirc (w3mir.ini on Win32 platforms) in the directory where w3mir is started from. The configuration file consists of lines of comments and directives. A directive consists of a keyword followed by a colon (:) and then one or several arguments. # This is a comment. And the next line is a directive: Options: recurse, remove A comment can only start at the beginning of a line. The directive keywords are not case-sensitive, but the arguments I be. =over 4 =item Options: I | I | I | I | I | I | I | I | I | I This must be the first directive in a configuration file. =over 8 =item I see B<-r> switch. =item I see B<-fa> switch. =item I see B<-fs> switch. =item I see B<-l> option. =item I see B<-lc> option. =item I see B<-R> option. =item I see B<-B> option. =item I see B<-I> option. =item I see B<-nnc> option. =item I List URLs not mirrored in a file called .notmirrored ('notmir' on win32). It will contain a lot of duplicate lines and quite possebly be quite large. =back =item URL: I [I] The URL directive may only appear once in any configuration file. Without the optional target directory argument it corresponds directly to the I argument on the command line. If the optional target directory is given all documents from under the given URL will be stored in that directory, and under. The target directory is most likely only specified if the B directive is also specified. If the URL given refers to a directory it I end in a "/", otherwise you might get quite surprised at what gets retrieved. Either one URL: directive or the single-HTTP-URL at the command-line I be given. =item Also: I This directive is only meaningful if the I (or B<-r>) option is given. The directive enlarges the scope of a recursive retrieval to contain the given HTTP-URL and all documents in the same directory or under. Any documents retrieved because of this directive will be stored in the given directory of the mirror. In practice this means that if the documents to be retrieved are stored on several servers, or in several hierarchies on one server or any combination of those. Then the B directive ensures that we get everything into one single mirror. This also means that if you're retrieving URL: http://www.foo.org/gazonk/ but it has inline icons or images stored in http://www.foo.org/icons/ which you will also want to get, then that will be retrieved as well by entering Also: http://www.foo.org/icons/ icons As with the URL directive, if the URL refers to a directory it I end in a "/". Another use for it is when mirroring sites that have several names that all refer to the same (logical) server: URL: http://www.midifest.com/ Also: http://midifest.com/ . At this point in time B has no mechanism to easily enlarge the scope of a mirror after it has been established. That means that you should survey the documents you are going to retrieve to find out what icons, graphics and other things they refer to that you want. And what other sites you might like to retrieve. If you find out that something is missing you will have to delete the whole mirror, add the needed B directives and then reestablish the mirror. This lack of flexibility in what to retrieve will be addressed at a later date. See also the B directive. =item Also-quene: I This is like Also, except that the URL itself is also quened. The Also directive will not cause any documents to be retrived UNLESS they are referenced by some other document w3mir has already retrived. =item Quene: I This is quenes the URL for retrival, but does not enlarge the scope of the retrival. If the URL is outside the scope of retrival it will not be retrived anyway. The observant reader will see that B is like B combined with B. =item Initial-referer: I see B<-ir> option. =item Ignore: F =item Fetch: F =item Ignore-RE: F =item Fetch-RE: F These four are used to set up rules about which documents, within the scope of retrieval, should be gotten and which not. The default is to get I that is within the scope of retrieval. That may not be practical though. This goes for CGI scripts, and especially server side image maps and other things that are executed/evaluated on the server. There might be other things you want unfetched as well. B stores the I/I rules in a list. When a document is considered for retrieval the URL is checked against the list in the same order that the rules appeared in the configuration file. If the URL matches any rule the search stops at once. If it matched a I rule the document is not fetched and any URLs in other documents pointing to it will point to the document at the original server (not inside the mirror). If it matched a I rule the document is gotten. If not matched by any ruĝes the document is gotten. The Fs are a very limited subset of Unix-wildcards. B understands only 'I', 'I<*>', and 'I<[x-y]>' ranges. The F is perls superset of the normal Unix regular expression syntax. They must be completely specified, including the prefixed m, a delimiter of your choice (except the paired delimiters: parenthesis, brackets and braces), and any of the RE modifiers. E.g., Ignore-RE: m/.gif$/i or Ignore-RE: m~/.*/.*/.*/~ and so on. "#" cannot be used as delimiter as it is the comment character in the configuration file. This also has the bad side-effect of making you unable to match fragment names (#foobar) directly. Fortunately perl allows writing ``#'' as ``\043''. You must be very carefull of using the RE anchors (``^'' and ``$'' with the RE versions of these and the I directive. Given the rules: Fetch-RE: m/foobar.cgi$/ Ignore: *.cgi the all files called ``foobar.cgi'' will be fetched. However, if the file is referenced as ``foobar.cgi?query=mp3'' it will I be fetched since the ``$'' anchor will prevent it from matching the I directive and then it will match the I directive instead. If you want to match ``foobar.cgi'' but not ``foobar.cgifu'' you can use perls ``\b'' character class which matches a word boundrary: Fetch-RE: m/foobar.cgi\b/ Ignore: *.cgi which will get ``foobar.cgi'' as well as ``foobar.cgi?query=mp3'' but not ``foobar.cgifu''. BUT, you must keep in mind that a lot of diffetent characters make a word boundrary, maybe something more subtle is needed. =item Apply: I This is used to change a URL into another URL. It is a potentially I powerful feature, and it also provides ample chance for you to shoot your own foot. The whole aparatus is somewhat tenative, if you find there is a need for changes in how Apply rules work please E-mail. If you are going to use this feature please read the documentation for I and I first. The B expressions are applied, in sequence, to the URLs in their absolute form. I.e., with the whole http://host:port/dir/ec/tory/file URL. It is only after this B checks if a document is within the scope of retrieval or not. That means that B rules can be used to change certain URLs to fall inside the scope of retrieval, and vice versa. The I is perls superset of the usual Unix regular expressions for substitution. As with I and I rules it must be specified fully, with the I and delimiting character. It has the same restrictions with regards to delimiters. E.g., Apply: s~/foo/~/bar/~i to translate the path element I to I in all URLs. "#" cannot be used as delimiter as it is the comment character in the configuration file. Please note that w3mir expects that URLs identifying 'directories' keep idenfifying directories after application of Apply rules. Ditto for files. =item Agent: I see B<-agent> option. =item Pause: I see B<-p> option. =item Retry-Pause: I see B<-rp> option. =item Retries: I see B<-t> option. =item debug: I see B<-d> option. =item umask I see B<-umask> option. =item Robot-Rules: I | I Turn robot rules on of off. See B<-drr> option. =item Remove-Nomirror: I | I If this is enabled sections between two consecutive comments in a mirrored document will be removed. This editing is performed even if batch getting is specified. =item Header: I Insert this I html/text into the start of the document. This will be done even if batch is specified. =item File-Disposition: I | I | I What to do with a retrieved file. The I alternative is default. The two others correspond to the B<-s> and B<-f> options. Only one may be specified. =item Verbosity: I | I | I How much B informs you of it's progress. I is the default. The two others correspond to the B<-q> and B<-c> switches. =item Cd: I Change to given directory before starting work. If it does not exist it will be quietly created. Using this option breaks the 'fixup' code so consider not using it, ever. =item HTTP-Proxy: I see the B<-P> switch. =item HTTP-Proxy-user: I =item HTTP-Proxy-passwd: I These two are is used to activate authentication with the proxy server. L only supports I proxy autentication, and is quite simpleminded about it, if proxy authentication is on L will always give it to the proxy. The domain concept is not supported with proxy-authentication. =item Proxy-Options: I | I | I | I Set proxy options. There are two ways to pass proxy options, HTTP/1.0 compatible and HTTP/1.1 compatible. Newer proxy-servers will understand the 1.1 way as well as 1.0. With old proxy-servers only the 1.0 way will work. L will prefer the 1.0 way. The only 1.0 compatible proxy-option is I, it corresponds to the B<-pflush> option and forces the proxy server to pass the request to a upstream server to retrieve a I copy of the document. The I option forces w3mir to use the HTTP/1.1 proxy control header, use this only with servers you know to be new, otherwise it won't work at all. Use of any option but I will also cause HTTP/1.1 to be used. I forces the proxy server to contact the upstream server to validate that it has a fresh copy of the document. This is nicer to the net than I option which forces re-get of the document no matter if the server has a fresh copy already. I forbids the proxy from storing the document in other than in transient storage. This can be used when transferring sensitive documents, but is by no means any warranty that the document can't be found on any storage device on the proxy-server after the transfer. Cryptography, if legal in your contry, is the solution if you want the contents to be secret. I corresponds to the HTTP/1.0 header I or the identical HTTP/1.1 I option. I and I corresponds to I and I respectively. =item Authorization B supports only the I authentication of HTTP/1.0. This method can assign a password to a given user/server/I. The "user" is your user-name on the server. The "server" is the server. The I is a HTTP concept. It is simply a grouping of files and documents. One file or a whole directory hierarchy can belong to a realm. One server may have many realms. A user may have separate passwords for each realm, or the same password for all the realms the user has access to. A combination of a server and a realm is called a I. =over 8 =item Auth-Domain: I Give the server and port, and the belonging realm (making a domain) that the following authentication data holds for. You may specify "*" wildcard for either of I and I, this will work well if you only have one usernme and password on all the servers mirrored. =item Auth-User: I Your user-name. =item Auth-Passwd: I Your password. =back These three directives may be repeated, in clusters, as many times as needed to give the necessary authentication information =item Disable-Headers: I | I Stop B from sending the given headers. This can be used for anonymity, making your retrievals harder to track. It will be even harder if you specify a generic B, like Netscape. =item Fixup: I<...> This directive controls some aspects of the separate program w3mfix. w3mfix uses the same configuration file as w3mir since it needs a lot of the information in the B configuration file to do it's work correctly. B is used to make mirrors more browseable on filesystems (disk or CDROM), and to fix redirected URLs and some other URL editing. If you want a mirror to be browseable of disk or CDROM you almost certainly need to run w3mfix. In many cases it is not necessary when you run a mirror to be used through a WWW server. To make B write the data files B needs, and do nothing else, simply put =over 8 Fixup: on =back in the configuration file. To make B run B automatically after each time B has completed a mirror run specify =over 8 Fixup: run =back L is documented in a separate man page in a effort to not prolong I manpage unnecessarily. =item Index-name: I When retriving URLs ending in '/' w3mir needs to append a filename to store it localy. The default value for this is 'index.html' (this is the most used, its use originated in the NCSA HTTPD as far as I know). Some WWW servers use the filename 'Welcome.html' or 'welcome.html' instead (this was the default in the old CERN HTTPD). And servers running on limited OSes frequently use 'index.htm'. To keep things consistent and sane w3mir and the server should use the same name. Put Index-name: welcome.html when mirroring from a site that uses that convention. When doing a multiserver retrival where the servers use two or more different names for this you should use B rules to make the names consistent within the mirror. When making a mirror for use with a WWW server, the mirror should use the same name as the new server for this, to acomplish that B should be combined with B. Here is an example of use in the to latter cases when Welcome.html is the prefered I name: Index-name: Welcome.html Apply: s~/index.html$~/Welcome.html~ Similarly, if index.html is the prefered I name. Apply: s~/Welcome.html~/index.html~ I is not needed since index.html is the default index name. =back =head1 EXAMPLES =over 4 =item * Just get the latest Dr-Fun if it has been changed since the last time w3mir http://sunsite.unc.edu/Dave/Dr-Fun/latest.jpg =item * Recursively fetch everything on the Star Wars site, remove what is no longer at the server from the mirror: w3mir -R -r http://www.starwars.com/ =item * Fetch the contents of the Sega site through a proxy, pausing for 30 seconds between each document w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/ =item * Do everything according to F w3mir -cfgfile w3mir.cfg =item * A simple configuration file # Remember, options first, as many as you like, comma separated Options: recurse, remove # # Start here: URL: http://www.starwars.com/ # # Speed things up Pause: 0 # # Don't get junk Ignore: *.cgi Ignore: *-cgi Ignore: *.map # # Proxy: HTTP-Proxy: www.foo.org:4321 # # You _should_ cd away from the directory where the config file is. cd: starwars # # Authentication: Auth-domain: server:port/realm Auth-user: me Auth-passwd: my_password # # You can use '*' in place of server:port and/or realm: Auth-domain: */* Auth-user: otherme Auth-user: otherpassword =item Also: # Retrive all of janl's home pages: Options: recurse # # This is the two argument form of URL:. It fetches the first into the second URL: http://www.math.uio.no/~janl/ math/janl # # These says that any documents refered to that lives under these places # should be gotten too. Into the named directories. Two arguments are # required for 'Also:'. Also: http://www.math.uio.no/drift/personer/ math/drift Also: http://www.ifi.uio.no/~janl/ ifi/janl Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai # # The options above will result in this directory hierarchy under # where you started w3mir: # w3mir/math/janl files from http://www.math.uio.no/~janl # w3mir/math/drift from http://www.math.uio.no/drift/personer/ # w3mir/ifi/janl from http://www.ifi.uio.no/~janl/ # w3mir/math-uib/nicolai from http://www.mi.uib.no/~nicolai/ =item Ignore-RE and Fetch-RE # Get only jpeg/jpg files, no gifs Fetch-RE: m/\.jp(e)?g$/ Ignore-RE: m/\.gif$/ =item Apply As I said earlier, B has not been used for Real Work yet, that I know of. But B I, be used to map all web servers at the university of Oslo inside the scope of retrieval very easily: # Start at the main server URL: http://www.uio.no/ # Change http://*.uio.no and http://129.240.* to be a subdirectory # of http://www.uio.no/. Apply: s~^http://(.*\.uio\.no(?:\d+)?)/~http://www.uio.no/$1/~i Apply: s~^http://(129\.240\.[^:]*(?:\d+)?)/~http://www.uio.no/$1/~i =back There are two rather extensive example files in the B distribution. =head1 BUGS =over 4 =item The -lc switch does not work too well. =back =head1 FEATURES These are not bugs. =over 4 =item URLs with two /es ('//') in the path component does not work as some might expect. According to my reading of the URL spec. it is an illegal construct, which is a Good Thing, because I don't know how to handle it if it's legal. =item If you start at http://foo/bar/ then index.html might be gotten twice. =item Some documents point to a point above the server root, i.e., http://some.server/../stuff.html. Netscape, and other browsers, in defiance of the URL standard documents will change the URL to http://some.server/stuff.html. W3mir will not. =item Authentication is I tried if the server requests it. This might lead to a lot of extra connections going up and down, but that's the way it's gotta work for now. =back =head1 SEE ALSO L =head1 AUTHORS Bs authors can be reached at I. Bs home page is at http://www.math.uio.no/~janl/w3mir/