Smart Cache Loader Manual

0.31

Radim Kolar

<hsn@sendmail.cz>

Copyright

Loader and this manual are Open Source Software; you may redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation version 2.

This is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

A copy of the GNU General Public License is available as on the World Wide Web at http://www.gnu.org/copyleft/gpl.html. You can also obtain it by writing to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Chapter 1. Introduction

Table of Contents

About this manual
About this program

About this manual

This manual is still in DRAFT state. Many sections are still missing or inaccurate. If you are out of the luck, see comments in the sample configurations files or consult sources for more information.

About this program

I have tried many programs which copies web to local disk, but none has satisfied me. I do not like too much programs with GUI. But these programs was very primitive (you can not configure many things). Also I have Smart Cache proxy server, which I am using for storing data, so It would be great if loader takes advantage of it.

I have tried 'Best of web downloaders' GNU wget, but found it not too much configurable.

This program is called Smart Cache Loader, but it do not depends on Smart Cache. It can nicely work without it.

Chapter 2. Program command line options

Table of Contents

Changing default by options

Misc. options

-d

File expanding

Using specified config file

Adding URL to queue

Adding visited URL
Configuring URL with defaults
Configuring URL as start URL
Using URL alias
Configuring URL with default from another location

Smart Cache loader is controlled by Chapter 3, Program configuration file and command line options. The best way is to define named sites in configuration file and refer to them by aliases on command line.

Changing default by options

When loader sees URL on command line, it looks into configuration file and checks if URL is part of known location. If none location matches, it configures URL with default options from configuration file (they have default in the front; DefaultActions, DefaultMask, ...).

You can redefine default location options by command-line syntax: name=value. Name is the same as used in configuration file.

Currently implemented names are: scandepth, delay, crawltime, threads, retry, retrypriority, options, priority, alias, starturl, log, upd, referer.

Misc. options

Options follows Unix - like syntax. There are starting with `-` character, followed by character.

-d

Configure all URLs as default, unless they have #sitename appended.

File expanding

File contents can be expanded and used as command-line arguments. This is done by: @<filename> syntax.

Using specified config file

Program can use other configuration file than default. This is done by: #<filename> syntax.

Adding URL to queue

You can add any URL to command-line. Program will search known locations in config file and if given URL belongs to known location it will inherit its settings; otherwise it will use settings for default location. Program will always create new location which is set to directory of specified URL.

See subsections for special location setups.

Adding visited URL

You can add URL at command-line to list of already visited URLs. When loader sees this URL when processing others, it will not take any action because this URL is marked as already done.

This is done by: :<URL> syntax. You can use file expanding feature and add filename with visited URLs by: @:<filename>

Configuring URL with defaults

You can force URL to use default configuration instead searching through list of known sites. This is done by: %<URL> syntax.

Configuring URL as start URL

You can force URL to be used as another starting URL for its parent location. This is done by: ^<URL> syntax.

Using URL alias

You can use alias (.cnf command name) instead of URL.

Configuring URL with default from another location

You can use URL#(location alias) to configure URL with setting of location alias.

http://www.slashdot.org/#noimg

Chapter 3. Program configuration file

Smart Cache loader uses loader.cnf configuration file by default. Alternate config file can be given on command-line by using #filename.ext syntax. You can use more than one configuration file.

Statements in configuration file can be:

the section called “Global definitions”
the section called “Default server settings”
the section called “Per server configuration”
the section called “Immediate definitions”

Immediate definitions

Immediate definitions change behavior of configuration processing engine, they can be used anywhere in configuration file and they can appear more than once. After they are processed, they affects how next configuration options are parsed.

case_sensitive_matching

Change how strings in configuration file are parsed. They can be case-sensitive or not.

Global definitions

Global definitions are applied to all servers in configuration file or specified at command line. They can appear in configuration file at any place.

Global definitions are:

the section called “threads”
the section called “http_proxy”
the section called “user_agent”
the section called “retry”
the section called “retrypriority”
the section called “localstore”

threads

specifies how many concurrent network/local connections will be made. Default value is 4.

http_proxy

http_proxy <proxy address> <port>

Loader will use this proxy server for servicing requests.

user_agent

You can define custom user-agent string.

retry

How many times can be unsuccessful requests repeated before giving up?

retrypriority

Priority for serving requests that failed at least once.

localstore

What to do with downloaded data? Local store can be.

none - No data will be stored locally.
directory directory_name - Data will be stored locally in directory_name.
smartcache path_to_scache.cnf - Loader will use Smart Cache for storing data. You must also set the section called “http_proxy” to point to Smart Cache. Loader do not writes any data to Smart Cache directory; it just looks inside, which data are there.

Default server settings

This settings apply to all servers or locations processed by Loader unless one of nodefault server options is set. Default masks are processed after any server-specific masks.

defaultserverpriority

Sets default server priority (floating point number) for correct program working it should be >0

defaultscandepth

Sets default depth when scanning web sites.

defaultdelay

Sets default delay between sending requests to site. See the section called “delay”.

defaultcrawltime

Sets default maximum time spent on crawling site. See the section called “crawltime”

Per server configuration

Every server or URL location can have specific setup. New location starts by keyword the section called “Location”.

Location

Keyword Locations defines a start of new per-location section in config file. Syntax is Location |Common URL part|. It is recommended to end location URL with / for avoiding possible problems.

Location http://slashdot.org/

Alias

If documents at the section called “Location” can be also accessed by alternative URL, add this URL as alias. Server abc.com can be also accessed by www.abc.com.. You can have multiple aliases.

Location http://www.abc.com/
Alias http://abc.com/

starturl

URL on which fetching starts. Can be outside Location and you can have more than one. If undefined, defaults to Location.

referer

Default referer: header for this location. This option has effect only for starting URLs, because real referers are used for other URLs.

name

Name of location. Used for Chapter 2, Program command line options

priority

Server priority.

scandepth

How many link levels will be scanned. If set to 0, only current page and images will be loaded. If set to -1, only starturl is loaded.

delay

Sets minimum delay between sending next request to same location. Its recommended to use at least 0.5 sec delay for not over flooding target server. You can use floating point numbers and units, for example "2.6s".

Supported units are: s/S (second), m (minutes), M (months), h/H (hours) d/D (days), w/W (weeks), y/Y (years). There is no space between number and unit.

crawltime

Sets maximum time spent on crawling location, no more requests are send or documents loaded from local storage after this timer expires. Setting it to zero means no limit. You can use same time units as used in the section called “delay”

options

Server options. Comma-separated list of options, you can invert options by prepending !. List of options follows:

active

Sets auto fetching for this site to on. Auto-fetching means that if config file with active sites is processed, loader starts loading these sites even with empty command-line.

passive

Sets auto fetching off for this site.

default

Use default masks and action.

nodefault

Do not use default action and masks.

accept

If none of masks matches, load URL.

reject

If none of masks matches, ignore URL.

nodefaulturlmasks

Do not process default URL masks after masks specific for this location.

nodefaultactions

Do not use defaultaction setting, use systemdefault instead.

anyurl

Load any URL within location, ignore all mask statements.

rememberseen

Remember all seen URL to avoid duplicate URL testing against masks. It uses more memory than the section called “remembervisited” but it saves CPU cycles. This is default choice for backward compatibility reasons, its use it is not recommended unless you have slow CPU.

remembervisited

Remember just visited URLs, it will save memory for cost of CPU cycles. Its generaly better choice than the section called “rememberseen”, because it avoids some side-effects of URL caching if there are multiple ways how to navigate from starturl to certain page.

actions

Actions is whitespace separated list of name=value. These values are merged with the section called “defaultactions” unless the section called “nodefaultactions” is specified. Actions are used as defaults for the section called “mask”.

addactions

Similar to the section called “actions”, but it is merged with actions first, then with defaultactions.

extracturl

extracturl is powerfull option for extracting links from page text. Exctracturl have one mandatory and one optional argument. First mandatory argument is regular expression and optional second argument is replacement string. Extracted links are marked as coming from CONTENT tag

For detailed information about regular expresion syntax used in Java see Java 2 Platform API, http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html

mask

Main and most powerful SC Loader config command. It check URL with respect to specified input conditions and performs specified action (load, reject URL ...). Mask command is list of space-separated pairs name=value[, value ...]. List of names follows:

q

Sets URL queue priority. If you want URL to be loaded, using priority > 0 is advised.

url

Matches URL by regexp. URL is the section called “strip” stripped before comparing. URL can have special value 'any' or '*' which matches any given URL. If you have multiple URLs specified and separated by ',' there are ORed.

ext

Matches filename extensions. Same as the section called “url” unless no stripping is done. Using ext is faster than url.

src

Name of HTML tag in which link is found. src mask is used mostly for IMG or A. It is in upper-case and can contain regex. Magic words any or * match any SRC tag. You can also append ! before tag which is negation. HTML tag must be written in upper case i.e IMG, not img. This is limitation of parser code.

mask src=!A act=reject

Rejects all non anchor links.

depth

Changes fetch depth if URL is matched. You can not increase nesting depth level with this command.

size

Check if size of URL is at least xxx bytes. You can also use magic words 'known' 'unknown' 'any'.

strip

Strips URL before comparing. Can have value 'none','server','location' or 'auto'.

target

Location of destination URL targeting. Known values are: any,world,known,server,location,directory,subdir or auto. You can have more than 1 target, separated by ','. Special targets are site (everything on this server), me (everything on this location) or auto (guess target from URL mask used).

act

What to do with URL matched? Possible values are reject (ignore), load, noparse (load but do not parse HTML), fastclose (close after sending request), close (close on reply from server), nosave (do not save it to disk), direct (do not use proxy for this request).

log

This directive controls Which parts of URL processing and what is logged. Parts are: queue, load, parse, store, ioerr, fatalerr, reject. You can also use some predefined special names for easier use.

Special names are none, server (use server default from actions command), all (log everything).

Two directive controls what will be loged, url logs only url only and depth logs depth of request.

upd

Update strategy for URL can be ONE of load (load it), norefresh (do not load it if already loaded), reload (force re-loading from cache), update (check time difference in hours. Example: upd=update,24) forceupdate ( force proxy server to update with xx hours), noreparse (Do not reparse already loaded HTML documents.).