NAME
AutoSearch -- a web-search tracking application
SYNOPSIS
AutoSearch [--stats]
[--verbose] -n ``Query Name'' -s ``query string'' --engine engine
[--mail you@where.com]
[--options ``opt=val'']... [--filter ``filter''] [--host host]
[--port port] [--userid bbunny --password c4rr0t5]
[--ignore_channels KABC,KCBS,KNBC] qid
AutoSearch --VERSION AutoSearch --help AutoSearch --man
DESCRIPTION
AutoSearch performs
a web-based search and puts the results set in
qid/index.html. Subsequent searches (i.e., the second form
above) AutoSearch determine what changes (if any) occured to
the results sent since the last run. These incremental changes are
recorded in qid/YYYYMMDD.html.
AutoSearch is amenable to be run as a cron job
because all the input parameters are saved in the web pages.
AutoSearch can act as a automated query agent for a
particular search. The output files are designed to be a set of web
pages to easily display the results set with a web browser.
Example:
AutoSearch -n 'LSAM Replication'
-s '"lsam replication"'
-e AltaVista
replication_query
This query (which should be all on one line) creates a directory
replication_query and fills it with the fascinating output of the
AltaVista query on "lsam replication", with pages titled
AltaVista query on "lsam replication", with pages titled
``LSAM Replication''. (Note the quoting: the
single quotes in '"lsam replication"' are for the shell,
the double quotes are for AltaVista to search for the phrase rather
than the separate words.)
A more complicated example:
AutoSearch -n 'External Links to LSAM'
-s '(link:www.isi.edu/~lsam) -url:isi.edu'
-e AltaVista::AdvancedWeb
-o coolness=hot
This query does an advanced AltaVista search and specifies the
(hypothetical) ``coolness'' option to the search engine.
OPTIONS
- qid
- The query identifer specifies the
directory in which all the files that relate to this query and
search results will live. It can be an absolute path, or a relative
path from cwd. If the directory does not exist, it will be created
and a new search started.
- --stats
- Show search statistics: the query string,
number of hits, number of filtered hits, filter string, number of
suspended (deleted) hits, previous set size, current set size, etc.
- -v or --verbose
- Verbose: output additional messages
and warnings.
- -n or --qn or --queryname
- Specify the query name. The query name
is used as a heading in the web pages, therefore it should be a
'nice' looking version of the query string.
- -s or --qs or --querystring
- Specify the query string. The query
string is the character string which will be submitted to the
search engine. You may include special characters to group or to
qualify the search.
- -e or --engine
- Specify the search engine. The query
string will be submitted to the user specified search engine.
In many cases there are specialized versions of search
engines. For example, AltaVista::AdvancedWeb and
AltaVista::News allow more powerful and Usenet searches. See
AltaVista or the man page for your search engine for details about
specialized variations.
- --listnewurls
- In addition to all the normal file
maintenance, print all new URLs to STDOUT,
one per line.
- -o or --options
- Specify the query options. The query
options will be submitted to the user search engine with the query
string. This feature permits modification of the query string for a
specific search engine or option. More than one query option may be
specified.
Example: "-o what=news" causes AltaVista to search
Usenet. Although this works, the preferred mechanism in this case
would be "-e AltaVista::News" or "-e
AltaVista::AdvancedNews". Options are intended for internal or
expert use.
- -f or --uf or --urlfilter
- This option specifies a regular
expression which will be compared against the URLs of any results;
if they match the case-insensitive regular expression, they will be
removed from the hit set.
Example: "-f '.*\.isi\.edu'" avoids all of ISI's web pages.
- --cleanup i
- Delete all traces of query results from
more than i days ago. If --cleanup is given, all other options
other than the qid will be ignored.
- --cmdline
- Reconstruct the complete command line
(AutoSearch and all its arguments) that was used to create the
query results. Command line will be shown on STDERR. If --cmdline is given, all other options other
than the qid will be ignored.
- --mail user@address or -m
user@address
- After search is complete, send email
to that user, listing the NEW
results. Email is HTML
format. Requires the Email::Send and related modules. If you
send email through an SMTP
server, you must set environment variable SMTPSERVER to your server name or
IP address. If your
SMTP server requires password,
you must set environment variables SMTPUSERNAME and SMTPPASSWORD. If you send email via
sendmail, you should set environment variable SENDMAIL if the sendmail executable is not
in the path.
- --emailfrom user@address
- If your outgoing mail server rejects
email from certain users, you can use this argument to set the
From: header.
- --userid bbunny
- If the search engine requires a
login/password (e.g. Ebay::Completed), use this.
- --password Carr0t5
- If the search engine requires a
login/password (e.g. Ebay::Mature), use this.
DESCRIPTION
AutoSearch submits
a query to a search engine, produces HTML
pages that reflect the set of 'hits' (filtered search results)
returned by the search engine, and tracks these results over time.
The URL and title are displayed in the
The URL and title are displayed in the
qid/index.html, the URL, the title,
qid/index.html, the URL, the title,
and description are displayed in the 'weekly' files.
To organize these results, each search result is placed in a
query information directory (qid). The directory becomes the search
results 'handle', an easy way to track a set of results. Thus a qid
of "/usr/local/htdocs/lsam/autosearch/load_balancing"
might locate the results on your web server at ".
Inside the qid directory you will find files relating to this
query. The primary file is index.html, which reflects the
latest search results. Every not-filtered hit for every search is
stored in index.html. When a hit is no longer found by the
search engine it a removed from index.html. As new results
for a search are returned from the search engine they are placed in
index.html.
At the bottom of index.html, there is a heading ``Weekly
Search Results'', which is updated each time the search is
submitted (see ``AUTOMATED SEARCHING''). The list of search runs is stored in
reverse chronological order. Runs which provide no new information
are identified with
No Unique Results found for search on <date>
Runs which contain changes are identified by
Web search results for search on <date>
which will be linked a page detailing the changes from that run.
Detailed search results are noted in weekly files. These files
are named YYYYMMDD.html and
are stored in the qid directory. The weekly files include
THE URL, title, and a
THE URL, title, and a
the description (if available). The title is a link to the original
the description (if available). The title is a link to the original
web page.
AUTOMATED SEARCHING
On UNIX-like
systems, cron(1) may
be used to establish periodic searches and the web pages will be
maintained by AutoSearch. To establish the first search, use
the first example under SYNOPSIS. You must
specify the qid, query name and query string. If any of the items
are missing, you will be interactively prompted for the missing
item(s).
Once the first search is complete you can re-run the search with
the second form under SYNOPSIS.
A cron entry like:
0 3 * * 1 /nfs/u1/wls/AutoSearch.pl /www/div7/lsam/autosearch/caching
might be used to run the search each Monday at 3:00 AM. The query name and query string may be repeated;
but they will not be used. This means that with a cron line like:
0 3 * * 1 /nfs/u1/wls/AutoSearch.pl /www/div7/lsam/autosearch/caching -n caching -s caching
a whole new search series can be originated by
rm -r /www/div7/lsam/autosearch/caching
However, the only reason to start a new search series would be
to throw away the old weekly files.
We don't recommend running searches more than once per day, but
if so the per-run files will be updated in-place. Any changes are
added to the page with a comment that ``Recently Added:''; and
deletions are indicated with ``Recently Suspended:.''
CHANGING THE LOOK OF THE PAGES
The
basic format of these two pages is simple and customizable. One
requirement is that the basic structure remain unchanged.
HTML comments are used to identify sections
of the document. Almost everything can be changed except for the
strings which identify the section starts and ends.
Noteworthy tags and their meaning:
- <!--Top-->.*<!--/Top-->
- The text contained within this tag is
placed at the top of the output page. If the text contains
AutoSearch WEB
Searching, then the query name will replace it. If the text
does not contain this magic string and it is the first ever search,
the user will be asked for a query name.
- <!--Query{.*}/Query-->
- The text contained between the braces is
the query string. This is how AutoSearch maintains the query
string. You may edit this string to change the query string; but
only in qid/index.html. The text ask user is special
and will force AutoSearch to request the search string from
the user.
- <!--SearchEngine{.*}/SearchEngine-->
- The text contained between the braces is
the search engine. Other engines supported are HotBot and Lycos.
You may edit this string to change the engine used; but only in
qid/index.html. The text ask user is special and will
force AutoSearch to to request the search string from the
user.
- <!--QueryOptions{.*}/QueryOptions-->
- The text contained between the braces
specifies a query options. Multiple occurrencs of this command are
allowed to specify multiple options.
- <!--URLFilter{.*}/URLFilter-->
- The text contained between the braces is
the URL filter. This is how
AutoSearch maintains the filter. Again you may edit this
string to change the query string; but only in
qid/index.html. The text ask user is special and will
force AutoSearch to ask the user (STDIN) for the query string. When setting up the first
search, you must edit first_index.html, not
qid/index.html. The URL filter is a
standard perl5 regular expression. URLs which do not match will be
kept.
- <!--Bottom-->.*<!--/Bottom-->
- The text contained within this tag is
placed at the bottom of the output page. This is a good place to
put navigation, page owner information, etc.
The remainder of the tags fall into a triplet of
~Heading, ~Template, and ~, where ~ is
Summary, Weekly, Appended, and Suspended. The sub-sections appear
in the order given. To produce a section AutoSearch outputs
the heading, the template, the section, n copies of the formatted
data, and an /section. The tags and their function are:
- ~Heading
- The heading tag identifies the heading for
a section of the output file. The SummaryHeading is for the summary
portion, etc. The section may be empty (e.g., Suspended) and thus
no heading is output.
- ~Template
- The template tag identifies how each item
is to be formatted. Simple text replacement is used to change the
template into the actual output text. The text to be replaced is
noted in ALLCAPS.
- ~
- This tag is used to locate the section (Summary, Weekly, etc.).
This section represents the actual n-items of data.
You can edit these values in the qid/index.html page of
an existing search. The file first_index.html (in the
directory above qid) will be used as a default template for
new queries.
Examples of these files can be seen in the pages under
",
or in the output generated by a new AutoSearch.
FILES
- first_index.html
- optional file to determine the default
format of the index.html file of a new query.
- first_date.html
- optional file to determine the default
format of the YYYYMMDD.html
file for a new query.
- qid/index.html
- (automatically created) latest search
results, and reverse chronological list of periodic searches.
- qid/date.html
- file used as a template for the
YYYYMMDD.html files.
- qid/YYYYMMDD.html
- (automatically created) summary of changes
for a particular date (AKA 'Weekly'
file).
Optional files first_index.html and
first_date.html are used for the initial search as a
template for qid/index.html and date.html,
respectively. If either of these files does not exist; a
default-default template is stored within the AutoSearch
source. The intention of these two files is to permit a user to
establish a framework for a group of search sets which have a
common format. By leaving the default query name and query string
alone, they will be overridden by command line inputs.
SEE ALSO
For the library, see
WWW::Search, for the perl regular expressions, see perlre.
AUTHORS
Wm. L. Scheding
AutoSearch is a re-implementation of an earlier version
written by Kedar Jog.
COPYRIGHT
Copyright (C) 1996-1997
University of Southern California. All rights reserved.
Redistribution and use in source and binary forms are permitted
provided that the above copyright notice and this paragraph are
duplicated in all such forms and that any documentation,
advertising materials, and other materials related to such
distribution and use acknowledge that the software was developed by
the University of Southern California, Information Sciences
Institute. The name of the University may not be used to endorse or
promote products derived from this software without specific prior
written permission.
THIS SOFTWARE
IS PROVIDED
``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
WITHOUT LIMITATION,
THE IMPLIED
WARRANTIES OF
MERCHANTABILITY AND
FITNESS FOR A
PARTICULAR PURPOSE.
DESIRED FEATURES
These are good ideas
that people have suggested.
- URL validation.
- Validate the status of each URL (with HTTP HEAD requests) and indicate this status in the output.
- Multi-search.
- It should be possible to merge the results
of searches from two search-engines. If this merger were done as a
new search engine, this operation would be transparent to
AutoSearch.
BUGS
None known at this time; please
inform the maintainer mthurn@cpan.org if any crop up.