Table of Contents
Loader and this manual are Open Source Software; you may redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation version 2.
This is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.
A copy of the GNU General Public License is available as on the World Wide Web at http://www.gnu.org/copyleft/gpl.html. You can also obtain it by writing to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Table of Contents
This manual is still in DRAFT state. Many sections are still missing or inaccurate. If you are out of the luck, see comments in the sample configurations files or consult sources for more information.
I have tried many programs which copies web to local disk, but none has satisfied me. I do not like too much programs with GUI. But these programs was very primitive (you can not configure many things). Also I have Smart Cache proxy server, which I am using for storing data, so It would be great if loader takes advantage of it.
I have tried 'Best of web downloaders' GNU wget, but found it not too much configurable.
This program is called Smart Cache Loader, but it do not depends on Smart Cache. It can nicely work without it.
Table of Contents
Smart Cache loader is controlled by Chapter 3, Program configuration file and command line options. The best way is to define named sites in configuration file and refer to them by aliases on command line.
When loader sees URL on command line, it looks into configuration file and checks if URL is part of known location. If none location matches, it configures URL with default options from configuration file (they have default in the front; DefaultActions, DefaultMask, ...).
You can redefine default location options by command-line syntax: name=value. Name is the same as used in configuration file.
Currently implemented names are: scandepth, delay, crawltime, threads, retry, retrypriority, options, priority, alias, starturl, log, upd, referer.
Options follows Unix - like syntax. There are starting with `-` character, followed by character.
File contents can be expanded and used as command-line arguments. This is done by: @<filename> syntax.
Program can use other configuration file than default. This is done by: #<filename> syntax.
You can add any URL to command-line. Program will search known locations in config file and if given URL belongs to known location it will inherit its settings; otherwise it will use settings for default location. Program will always create new location which is set to directory of specified URL.
See subsections for special location setups.
You can add URL at command-line to list of already visited URLs. When loader sees this URL when processing others, it will not take any action because this URL is marked as already done.
This is done by: :<URL> syntax. You can use file expanding feature and add filename with visited URLs by: @:<filename>
You can force URL to use default configuration instead searching through list of known sites. This is done by: %<URL> syntax.
You can force URL to be used as another starting URL for its parent location. This is done by: ^<URL> syntax.
Table of Contents
Smart Cache loader uses loader.cnf
configuration file by default.
Alternate config file can be given on command-line by using #filename.ext
syntax. You can use more than one configuration file.
Statements in configuration file can be:
Immediate definitions change behavior of configuration processing engine, they can be used anywhere in configuration file and they can appear more than once. After they are processed, they affects how next configuration options are parsed.
Global definitions are applied to all servers in configuration file or specified at command line. They can appear in configuration file at any place.
Global definitions are:
http_proxy <proxy address> <port>
Loader will use this proxy server for servicing requests.
What to do with downloaded data? Local store can be.
none
- No data will be stored locally.
directory directory_name
- Data will be stored locally
in directory_name.
smartcache path_to_scache.cnf
- Loader will use
Smart Cache for storing data. You must also set the section called “http_proxy” to point
to Smart Cache. Loader do not writes any data to Smart Cache directory; it
just looks inside, which data are there.
This settings apply to all servers or locations processed by Loader unless one of nodefault server options is set. Default masks are processed after any server-specific masks.
Sets default server priority (floating point number) for correct program working it should be >0
Sets default delay between sending requests to site. See the section called “delay”.
Sets default maximum time spent on crawling site. See the section called “crawltime”
Every server or URL location can have specific setup. New location starts by keyword the section called “Location”.
Keyword Locations defines a start of new per-location section in config file. Syntax is Location |Common URL part|. It is recommended to end location URL with / for avoiding possible problems.
Location http://slashdot.org/
If documents at the section called “Location” can be also accessed by alternative URL, add this URL as alias. Server abc.com can be also accessed by www.abc.com.. You can have multiple aliases.
Location http://www.abc.com/ Alias http://abc.com/
URL on which fetching starts. Can be outside Location and you can have more than one. If undefined, defaults to Location.
Default referer: header for this location. This option has effect only for starting URLs, because real referers are used for other URLs.
Name of location. Used for Chapter 2, Program command line options
How many link levels will be scanned. If set to 0, only current page and images will be loaded. If set to -1, only starturl is loaded.
Sets minimum delay between sending next request to same location. Its recommended to use at least 0.5 sec delay for not over flooding target server. You can use floating point numbers and units, for example "2.6s".
Supported units are: s/S (second), m (minutes), M (months), h/H (hours) d/D (days), w/W (weeks), y/Y (years). There is no space between number and unit.
Sets maximum time spent on crawling location, no more requests are send or documents loaded from local storage after this timer expires. Setting it to zero means no limit. You can use same time units as used in the section called “delay”
Server options. Comma-separated list of options, you can invert options by prepending !. List of options follows:
Sets auto fetching for this site to on. Auto-fetching means that if config file with active sites is processed, loader starts loading these sites even with empty command-line.
Remember all seen URL to avoid duplicate URL testing against masks. It uses more memory than the section called “remembervisited” but it saves CPU cycles. This is default choice for backward compatibility reasons, its use it is not recommended unless you have slow CPU.
Remember just visited URLs, it will save memory for cost of CPU cycles. Its generaly better choice than the section called “rememberseen”, because it avoids some side-effects of URL caching if there are multiple ways how to navigate from starturl to certain page.
Actions is whitespace separated list of name=value. These values are merged with the section called “defaultactions” unless the section called “nodefaultactions” is specified. Actions are used as defaults for the section called “mask”.
Similar to the section called “actions”, but it is merged with actions first, then with defaultactions.
extracturl is powerfull option for extracting links from page text. Exctracturl have one mandatory and one optional argument. First mandatory argument is regular expression and optional second argument is replacement string. Extracted links are marked as coming from CONTENT tag
For detailed information about regular expresion syntax used in Java see Java 2 Platform API, http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
Main and most powerful SC Loader config command. It check URL with respect to specified input conditions and performs specified action (load, reject URL ...). Mask command is list of space-separated pairs name=value[, value ...]. List of names follows:
Matches URL by regexp. URL is the section called “strip” stripped before comparing. URL can have special value 'any' or '*' which matches any given URL. If you have multiple URLs specified and separated by ',' there are ORed.
Matches filename extensions. Same as the section called “url” unless no stripping is done. Using ext is faster than url.
Name of HTML tag in which link is found. src mask is used mostly for IMG or A. It is in upper-case and can contain regex. Magic words any or * match any SRC tag. You can also append ! before tag which is negation. HTML tag must be written in upper case i.e IMG, not img. This is limitation of parser code.
mask src=!A act=reject
Rejects all non anchor links.
Changes fetch depth if URL is matched. You can not increase nesting depth level with this command.
Check if size of URL is at least xxx bytes. You can also use magic words 'known' 'unknown' 'any'.
Location of destination URL targeting. Known values are: any,world,known,server,location,directory,subdir or auto. You can have more than 1 target, separated by ','. Special targets are site (everything on this server), me (everything on this location) or auto (guess target from URL mask used).
What to do with URL matched? Possible values are reject (ignore), load, noparse (load but do not parse HTML), fastclose (close after sending request), close (close on reply from server), nosave (do not save it to disk), direct (do not use proxy for this request).
This directive controls Which parts of URL processing and what is logged. Parts are: queue, load, parse, store, ioerr, fatalerr, reject. You can also use some predefined special names for easier use.
Special names are none, server (use server default from actions command), all (log everything).
Two directive controls what will be loged, url logs only url only and depth logs depth of request.
Update strategy for URL can be ONE of load (load it), norefresh (do not load it if already loaded), reload (force re-loading from cache), update (check time difference in hours. Example: upd=update,24) forceupdate ( force proxy server to update with xx hours), noreparse (Do not reparse already loaded HTML documents.).