NAME
utf8trans - Transliterate UTF-8 characters according
to a table
SYNOPSIS
utf8trans charmap [file]...
DESCRIPTION
utf8trans transliterates characters in
the specified files (or standard input, if they are not specified)
and writes the output to standard output. All input and output is
in the UTF-8 encoding.
This program is usually used to render characters in Unicode
text files as some markup escapes or ASCII transliterations. (It is
not intended for general charset conversions.) It provides
functionality similar to the character maps in XSLT 2.0 (XML
Stylesheet Language - Transformations, version 2.0).
OPTIONS
- -m, --modify
- Modifies the given files in-place with their transliterated
output, instead of sending it to standard output.
This option is useful for efficient transliteration of many
files at once.
- --help
- Show brief usage information and exit.
- --version
- Show version and exit.
USAGE
The translation is done according to the rules in the
oqcharacter mapcq, named in the file charmap. It has the
following format:
- 1.
- Each line represents a translation entry, except for blank
lines and comment lines, which are ignored.
- 2.
- Any amount of whitespace (space or tab) may precede the start
of an entry.
- 3.
- Comment lines begin with #. Everything on the same line is
ignored.
- 4.
- Each entry consists of the Unicode codepoint of the character
to translate, in hexadecimal, followed one space or tab,
followed by the translation string, up to the end of the line.
- 5.
- The translation string is taken literally, including any
leading and trailing spaces (except the delimeter between the
codepoint and the translation string), and all types of characters.
The newline at the end is not included.
The above format is intended to be restrictive, to keep
utf8trans simple. But if a XML-based format is desired,
there is a xmlcharmap2utf8trans script that comes with the
docbook2X distribution, that converts character maps in XSLT 2.0
format to the utf8trans format.
LIMITATIONS
- *
- utf8trans does not work with binary files, because
malformed UTF-8 sequences in the input are substituted with U+FFFD
characters. However, null characters in the input are handled
correctly. This limitation may be removed in the future.
- *
- There is no way to include a newline or null in the
substitution string.
AUTHOR
Steve Cheng <stevecheng@users.sourceforge.net>.