NAME
webcheck - website link checker
SYNOPSIS
webcheck [OPTION]... URL
DESCRIPTION
webcheck will check the document at the
specified URL for links to other documents, follow these links
recursively and generate an HTML report.
- -i, --internal=PATTERN
- Mark URLs matching the PATTERN (perl-type regular
expression) as an internal link. Can be used multiple times. Note
that the PATTERN is matched against the full URL. URLs matching
this PATTERN will be considered internal, even if they match one of
the --external PATTERNs.
- -x, --external=PATTERN
- Mark URLs matching the PATTERN (perl-type regular
expression) as an external link. Can be used multiple times. Note
that the PATTERN is matched against the full URL.
- -y, --yank=PATTERN
- Do not check URLs matching the PATTERN (perl-type
regular expression). Like the -x flag, though this option will
cause webcheck to not check the link matched by regex whereas -x
will check the link but not its children. Can be used multiple
times. Note that the PATTERN is matched against the full URL.
- -b, --base-only
- Consider any URL not starting with the base URL to be external.
For example, if you run
webcheck -b
then http://www.example.com/foo/bar
will be considered internal whereas http://www.example.com/ will be
considered external. By default all the pages on the site will be
considered internal.
- -a, --avoid-external
- Avoid external links. Normally if webcheck is examining an HTML
page and it finds a link that points to an external document, it
will check to see if that external document exists. This flag
disables that action.
- -q, --quiet, --silent
- Do not print out progress as webcheck traverses a site.
- -d, --debug
- Print debugging information while crawling the site. This
option is mainly useful for developers.
- -o, --output=DIRECTORY
- Output directory. Use to specify the directory where webcheck
will dump its reports. The default is the current directory or as
specified by config.py. If this directory does not exist it will be
created for you (if possible).
- -c, --continue
- Try to continue from a previous run. When using this option
webcheck will look for a webcheck.dat in the output directory. This
file is read to restore the state from the previous run. This
allows webcheck to continue a previously interrupted run. When this
option is used, the --internal, --external and --yank options will
be ignored as well as any URL arguments. The --base-only and
--avoid-external options should be the same as the previous
run.
Note that this option is experimental and it's semantics may change
with coming releases (especially in relation to other options).
Also note that the stored files are not guaranteed to be compatible
between releases.
- -f, --force
- Overwrite files without asking.
- -r, --redirects=N
- Redirect depth. the number of redirects webcheck should follow
when following a link. 0 implies to follow all redirects.
- -w, --wait=SECONDS
- Wait SECONDS between document retrievals. Usually
webcheck will process a url and immediately move on to the next.
However on some loaded systems it may be desirable to have webcheck
pause between requests. This option can be set to any non-negative
number.
- -v, --version
- Show version of program.
- -h, --help
- Show short summary of options.
URL CLASSES
URLs are divided into two classes:
Internal URLs are retrieved and the retrieved item is
checked for syntax. Also, the retrieved item is searched for links
to other items (of any class) and these links are followed.
External URLs are only retrieved to test whether they are
valid and to gather some basic information from them (title, size,
valid and to gather some basic information from them (title, size,
content-type, etc). The retrieved items are not inspected for links
to other items.
Apart from their class URLs can also be considered yanked
(as specified with the --yank or --avoid-external options). The
URLs can be either internal or external and will not be retrieved
or checked at all. URLs of unsupported schemes are also considered
yanked.
EXAMPLES
Check the site www.example.com but consider any path
with "/webcheck" in it to be external.
webcheck
NOTES
When checking internal URLs webcheck honors the robots.txt file,
identifying itself as user-agent webcheck. Disallowed links will
not be checked at all as if the -y option was specified for that
URL. To allow webcheck to crawl parts of a site that other robots
are disallowed, use something like:
User-agent: *
Disallow: /foo
User-agent: webcheck
Allow: /foo
ENVIRONMENT
- <scheme>_proxy
- Proxy url for <scheme>.
REPORTING BUGS
Bug reports shoult be sent to the current maintainer
<arthur@ch.tudelft.nl>.
More information on reporting bugs can be found on the webcheck
homepage:
http://ch.tudelft.nl/~arthur/webcheck/
COPYRIGHT
Copyright © 1998, 1999 Albert Hopkins
(marduk)
Copyright © 2002 Mike W. Meyer
Copyright © 2005, 2006 Arthur de Jong
webcheck is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
The files produced as output from the software do not automatically
fall under the copyright of the software, unless explicitly stated
otherwise.