NAME
langident - identifies the language files are written
in
SYNOPSIS
langident [OPTIONS] file1 [file2 ...]
DESCRIPTION
Identifies the language
files are written in using Perl module Lingua::Identify.
OPTIONS
-a
Show all results (not just the most
probable language).
-c
Show confidence level for most
probable language (it will be the first value right after the most
probable language).
-d
Debug (development only).
-e METHODS
Select the method(s) to use. There are three ways of doing this:
# simply using a method
langident -e ngrams3 file
# using several methods (separate them with a comma)
langident -e prefixes3,suffixes3
# using several methods and assign different weights to each of them
langident -e smallwords=2,prefixes=1,ngrams3=1.3
The available methods are the following: smallwords,
prefixes1, prefixes2, prefixes3,
prefixes4, suffixes1, suffixes2,
suffixes3, suffixes4, ngrams1, ngrams2,
ngrams3 and ngrams4.
-h
Display help message and exit.
-l
List all available languages and
exit.
-m NUMBER
Set
maximum number of results (languages) to display (shows the N most
probable languages, by descending order of probability).
Overrides the -a switch.
-o LANGUAGES
Only work with specified languages.
# identify between Portuguese and English only
langident -o pt,en *
-p
Also show percentages.
-s SIZE
Maximum
size to examine.
-v
Show version and exit.
EXAMPLES
Use methods ngrams2 and
ngrams1, assigning the double of importance to ngrams2 (-e switch);
output will include the three most probable languages (-m switch)
with its percentages (-p switch) and also the confidence level (-c
switch) of the first result.
$ langident -e ngrams2=2,ngrams1 -c -p -m 3 README
README:en 65.7209505939491 7.8971987481393 ga 4.11905889385895 tr 4.08487011400505
$
TO DO
- * Add a switch to ignore HTML tags (and
maybe other formats too)
SEE ALSO
Lingua::Identify(3),
Text::ExtractWords(3),
Text::Ngram(3),
Text::Affixes(3).
A linguist and/or a shrink.
The latest CVS version of
"Lingua::Identify" (which includes langident) can
be attained at http://natura.di.uminho.pt/natura/viewcvs.cgi/Lingua/Identify/
ISO 639 Language Codes, at http://www.w3.org/WAI/ER/IG/ert/iso639.htm
AUTHOR
Jose Alves de Castro,
<cog@cpan.org>
COPYRIGHT AND LICENSE
Copyright 2004
by Jose Alves de Castro
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.