NAME
linkchecker - check HTML documents for broken links
SYNOPSIS
linkchecker [options]
[file-or-url]...
DESCRIPTION
LinkChecker features recursive checking, multithreading, output
in colored or normal text, HTML, SQL, CSV or a sitemap graph in GML
or XML, support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:,
Gopher, Telnet and local file links, restriction of link checking
with regular expression filters for URLs, proxy support,
username/password authorization for HTTP and FTP, robots.txt
exclusion protocol support, i18n support, a command line interface
and a (Fast)CGI web interface (requires HTTP server)
EXAMPLES
The most common use checks the given domain
recursively, plus any URL pointing outside of the domain:
linkchecker
Beware that this checks the whole site which can have several
hundred thousands URLs. Use the -r option to restrict the
recursion depth.
Don't connect to mailto: hosts, only check their URL syntax.
All other links are checked as usual:
linkchecker --ignore-url=^mailto:
Checking a local HTML file on Unix:
linkchecker ../bla.html
Checking a local HTML file on Windows:
linkchecker c:\temp\test.html
You can skip the http:// url part if the domain starts with
www.:
linkchecker
You can skip the ftp:// url part if the domain starts with
ftp.:
linkchecker -r0
Generate a sitemap graph and convert it with the graphviz dot
utility:
linkchecker -odot -v
OPTIONS
General options
- -h, --help
- Help me! Print usage information for this program.
- -fFILENAME, --config=FILENAME
- Use FILENAME as configuration file. As default
LinkChecker first searches /etc/linkchecker/linkcheckerrc
and then ~/.linkchecker/linkcheckerrc.
- -I, --interactive
- Ask for URL if none are given on the commandline.
- -tNUMBER, --threads=NUMBER
- Generate no more than the given number of threads. Default
number of threads is 10. To disable threading specify a
non-positive number.
- --priority
- Run with normal thread scheduling priority. Per default
LinkChecker runs with low thread priority to be suitable as a
background job.
- -V, --version
- Print version and exit.
- --allow-root
- Do not drop privileges when running as root user on Unix
systems.
Output options
- -v, --verbose
- Log all checked URLs. Default is to log only errors and
warnings.
- --no-warnings
- Don't log warnings. Default is to log warnings.
- -WREGEX, --warning-regex=REGEX
- Define a regular expression which prints a warning if it
matches any content of the checked link. This applies only to valid
pages, so we can get their content.
Use this to check for pages that contain some form of error, for
example "This page has moved" or "Oracle Application Server error".
- --warning-size-bytes=NUMBER
- Print a warning if content size info is available and exceeds
the given number of bytes.
- -q, --quiet
- Quiet operation, an alias for -o none. This is only
useful with -F.
- -oTYPE[/ENCODING],
--output=TYPE[/ENCODING]
- Specify output type as text, html, sql,
csv, gml, dot, xml, none or
blacklist. Default type is text. The various output
types are documented below. The ENCODING specifies the
output encoding, the default is that of your locale. Valid
encodings are listed at .
-
-FTYPE[/ENCODING][/FILENAME],
--file-output=TYPE[/ENCODING][/
FILENAME]
- Output to a file linkchecker-out.TYPE,
$HOME/.linkchecker/blacklist for blacklist output, or
FILENAME if specified. The ENCODING specifies the
output encoding, the default is that of your locale. Valid
encodings are listed at .
The FILENAME and ENCODING parts of the none
output type will be ignored, else if the file already exists, it
will be overwritten. You can specify this option more than once.
Valid file output types are text, html, sql,
csv, gml, dot, xml, none or
blacklist Default is no file output. The various output
types are documented below. Note that you can suppress all console
output with the option -o none.
- --no-status
- Do not print check status messages.
- -DSTRING, --debug=STRING
- Print debugging output for the given logger. Available loggers
are cmdline, checking, cache, gui,
dns and all. Specifying all is an alias for
specifying all available loggers. The option can be given multiple
times to debug with more than one logger.
Foraccurateresults,threadingwillbedisabledduring
debugruns.
- --trace
- Print tracing information.
- --profile
- Write profiling data into a file named linkchecker.prof
in the current working directory. See also --viewprof.
- --viewprof
- Print out previously generated profiling data. See also
--profile.
Checking options
- -rNUMBER, --recursion-level=NUMBER
- Check recursively all links up to given depth. A negative depth
will enable infinite recursion. Default depth is infinite.
- --no-follow-url=REGEX
- Check but do not recurse into URLs matching the given regular
expression. This option can be given multiple times.
- --ignore-url=REGEX
- Only check syntax of URLs matching the given regular
expression. This option can be given multiple times.
- -C, --cookies
- Accept and send HTTP cookies according to RFC 2109. Only
cookies which are sent back to the originating server are accepted.
Sent and accepted cookies are provided as additional logging
information.
- -a, --anchors
- Check HTTP anchor references. Default is not to check anchors.
- --no-anchor-caching
- Treat url#anchora and url#anchorb as equal on caching. This is
the default browser behaviour, but it's not specified in the URI
specification. Use with care since broken anchors are not
guaranteed to be detected in this mode.
- -uSTRING, --user=STRING
- Try the given username for HTTP and FTP authorization. For FTP
the default username is anonymous. For HTTP there is no
default username. See also -p.
- -pSTRING, --password=STRING
- Try the given password for HTTP and FTP authorization. For FTP
the default password is anonymous@. For HTTP there is no
default password. See also -u.
- --timeout=NUMBER
- Set the timeout for connection attempts in seconds. The default
timeout is 60 seconds.
- -PNUMBER, --pause=NUMBER
- Pause the given number of seconds between two subsequent
connection requests to the same host. Default is no pause between
requests.
- -NSTRING, --nntp-server=STRING
- Specify an NNTP server for news: links. Default is the
environment variable NNTP_SERVER. If no host is given, only
the syntax of the link is checked.
- --no-proxy-for=REGEX
- Contact hosts that match the given regular expression directly
instead of going through a proxy. This option can be given multiple
times.
OUTPUT TYPES
Note that by default only errors and warnings
are logged. You should use the --verbose option to get the
complete URL list, especially when outputting a sitemap graph
format.
- text
- Standard text logger, logging URLs in keyword: argument
fashion.
- html
- Log URLs in keyword: argument fashion, formatted as HTML.
Additionally has links to the referenced pages. Invalid URLs have
HTML and CSS syntax check links appended.
- csv
- Log check result in CSV format with one URL per line.
- gml
- Log parent-child relations between linked URLs as a GML sitemap
graph.
- dot
- Log parent-child relations between linked URLs as a DOT sitemap
graph.
- gxml
- Log check result as a GraphXML sitemap graph.
- xml
- Log check result as machine-readable XML.
- sql
- Log check result as SQL script with INSERT commands. An example
script to create the initial SQL table is included as create.sql.
- blacklist
- Suitable for cron jobs. Logs the check result into a file
~/.linkchecker/blacklist which only contains entries with
invalid URLs and the number of times they have failed.
- none
- Logs nothing. Suitable for debugging or checking the exit
code.
REGULAR EXPRESSIONS
Only Python regular expressions are
accepted by LinkChecker. See
for an introduction in regular expressions.
The only addition is that a leading exclamation mark negates the
regular expression.
COOKIE FILES
A cookie file contains standard RFC 805 header
data with the following possible names:
- Scheme (optional)
- Sets the scheme the cookies are valid for; default scheme is
http.
- Host (required)
- Sets the domain the cookies are valid for.
- Path (optional)
- Gives the path the cookies are value for; default path is
/.
- Set-cookie (optional)
- Set cookie name/value. Can be given more than once.
Multiple entries are separated by a blank line. The example
below will send two cookies to all URLs starting with
and one to all URLs starting with :
Host: imadoofus.org
Path: /hello
Set-cookie: ID="smee"
Set-cookie: spam="egg"
Scheme: https
Host: imaweevil.org
Set-cookie: baggage="elitist"; comment="hologram"
PROXY SUPPORT
To use a proxy set $http_proxy, $https_proxy,
$ftp_proxy, $gopher_proxy on Unix or Windows to the proxy URL. The
URL should be of the form
http://[user:pass@]host[
:port], for example http://localhost:8080, or
http://joe:test@proxy.domain.
On a Mac use the Internet Config to select a proxy.
NOTES
URLs on the commandline starting with ftp. are
treated like ftp://ftp., URLs
starting with www. are treated like http://www.. You can also give local files as
arguments.
If you have your system configured to automatically establish a
connection to the internet (e.g. with diald), it will connect when
checking links not pointing to your local host. Use the -s
and -i options to prevent this.
Javascript links are currently ignored.
If your platform does not support threading, LinkChecker
disables it automatically.
You can supply multiple user/password pairs in a configuration
file.
When checking news: links the given NNTP host doesn't
need to be the same as the host of the user browsing your pages.
ENVIRONMENT
NNTP_SERVER - specifies default NNTP
server
http_proxy - specifies default HTTP proxy server
ftp_proxy - specifies default FTP proxy server
LC_MESSAGES, LANG, LANGUAGE - specify output
language
RETURN VALUE
The return value is non-zero when
- *
- invalid links were found or
- *
- link warnings were found and warnings are enabled
- *
- a program error occurred.
FILES
/etc/linkchecker/linkcheckerrc,
~/.linkchecker/linkcheckerrc - default configuration
files
~/.linkchecker/blacklist - default blacklist logger output
filename
linkchecker-out.TYPE - default logger file output
name
- valid output encodings
- regular expression documentation
AUTHOR
Bastian Kleineidam <calvin@users.sourceforge.net>
COPYRIGHT
Copyright © 2000-2005 Bastian Kleineidam