NAME
msort - sort records in complex ways
SYNOPSIS
msort <options> [<input file>]
DESCRIPTION
msort is a program for sorting text files in
sophisticated ways. It was developed initially for alphabetizing
dictionaries of languages in which the ordering may be quite
different from English but has many other uses.
msort allows you to sort blocks of text delimited in a
number of ways rather than just lines and to specify particular
fields of a record as sort keys using either their position,
counted from either end, or by matching regular expressions to
their tags.
msort is capable of sorting on multiple keys, so that
when two records tie on one key, the tie may be broken on another.
Any or all keys may be optional. How absent optional keys are
ordered with respect to present keys may be set separately for each
key.
msort allows you to specify arbitrary sort orders and to
define virtually unlimited numbers of multigraphs of effectively
unlimited length. The sort order and multigraphs are defined
separately for each key. If your system has locale support, you can
also use locale collation rules instead of specify your own sort
order.
msort provides twelve types of key comparison:
lexicographic, numeric, numeric string, hybrid, by string length,
by angle, by date, by domain name, by time, by ISO8601 date/time
stamp, by month name, and random.
What month names are used is a bit complicated. If the -s
flag is used on the same key and its argument is the name of a
file, the month names are read from the file, which should be in
the same format as a sort order definition file. If the -s
flag is used and its argument is a locale name, the month names
recognized will be the month names and abbreviations associated
with the specified locale. If the -s flag is not used the
month names recognized will be the month names and abbreviations
associated with the current locale. If your system does not have
locale support and you do not use the -s flag to read the
month names from a file, the month names recognized will be the
English month names and abbreviations.
msort can reverse the characters in a key, allowing it to
be used to generate reverse dictionaries.
A choice of sorting algorithms is provided.
msort fully supports Unicode. The text to be sorted, and
all specifications, should be in UTF-8 Unicode. (If you have plain
ASCII text, this is not a problem as ASCII is a subset of Unicode.)
Full Unicode case-folding is available.
For usage information, execute msort with no arguments.
Full information about msort is currently to be found in
the reference manual, which is distributed as a PDF (Portable
Document Format) file. If a copy is not available locally, you can
download from msort's home page:
http://billposer.org/Software/msort.html
OPTIONS
Informational options
- -h,--help
- Print usage message
- -v,--version
- Print version message
- -D,--defaults
- List defaults
- -F,--general-options
- List general command line options
- -G,--gnu-equivalences
- List equivalents for GNU sort command line options.
- -H,--informational-options
- List informational command line options
- -K,--key-specific-options
- List key-specific command line options
- -L,--limits
- List limits
- -N,--number-systems
- List the supported number systems.
General options
- -b,--block
- A record is terminated by two or more newlines
- -l,--line
- A record consists of a single line
- -r,--record-separator <separator>
- A record is terminated by separator character
- -O,--fixed-size-record <bytes>
- A record consists of the specified number of bytes.
- -d,--field-separators <character>+
- Fields are delimited by the named character(s)
- -w,--whole
- Sort on the entire text of the record
- -a,--algorithm <algorithm>
- Use the specified sort algorithm. The choices are:
I(nsertionSort), M(ergeSort), Q(uickSort), and S(hellSort). Note
that InsertionSort and MergeSort are stable, while QuickSort and
ShellSort are unstable. The default is QuickSort.
- -M,-initial-maximum-records <records>
- Set initial maximum number of records
- -m,--line-end-carriage-return
- End-of-line in the input data is marked by Carriage Return
(0x0D) as on the Macintosh rather than by Line Feed (0x0A) as on
Unix systems.
- -I,--invert-globally
- Invert sense of comparisons globally
- -B,--BMP
- No characters fall outside the Basic Multingual Plane (that is,
have values greater than 0xFFFF).
- -p,--reserve-private-use-area
- Do not make internal use of the Private Use areas. By default,
multigraphs are assigned internally to codepoints in the
Supplementary Private Use areas if full Unicode is in use or to
codepoints in the Private Use area if input is restricted to the
Basic Multilingual Plane by means of the -B option. If your
input makes use of the Private Use areas, this option prevents
interference with your input. In this case, multigraphs will be
assigned to the Low and High Surrogate areas (0xD800-0xDFFF). Note
that this limits the number of multigraphs to 2,048.
- -Q,--check-only
- Check whether the input is already sorted. Do not generate any
output. Exit status is 0 if input is already sorted, non-zero if
not sorted.
- -q,--quiet
- Be quiet - do not chat while working
- -u,--unicode-normalization <mode>
- Select Unicode normalization mode. The choices of mode are:
c for normalization form C (NFC), d for normalization
form D (NFD), and n for no normalization. The default is
NFC.
Key specific options
- -e,--character-range <m,n>
- Sort on characters m through n. Positive indices start from
one. Negative indices indicate position with respect to the end of
the record. For example, the range 3,-2 consists of the
third character through the next-to-last character.
- -n,--position <POS>(,<POS>)
- Sort on the specified POS or contiguous range of POSs, where a
POS is of the form <field number>(.<character number>).
Both counts begin at one. Field numbers but not character numbers
may be negative, in which case they are counted from the right.
Thus, 1.2 is the second character of the first field; -2.1 is the
first character of the next to last field.
- -t,--tag <tag regexp>
- Sort on the field with the specified tag
- -o,--optional <comparison>
- Optional: compare as (<,=,>) to present key if absent
- -C,--fold-case
- Fold case
- -z,--fold-case-turkic
- Fold case with additional Turkic conversions.
- -c,--comparison-type <comparison type>
- a(ngle),l(exicographic), i(so8601 date/time), t(ime), D(omain
name/email address), d(ate), m(onth name), n(umeric), N(umeric
string),s(ize), h(hybrid), r(andom)
- -y,--number-system <number system>
- Specifies the number system expected for this key. This affects
only numeric and numeric string keys. There are two special values.
If the number system is "all", records may contain any number
system that msort can interpret. Different records may contain
different number systems. If the number system is "any", records
may contain any writing system that msort can interpret, but all
records must make use of the same number system. msort sets
the number system on the basis of the first record.
- -f,--date-format <date format>
- Permutation of ymd with separators, e.g. y-m-d for
international date format, m/d/y for American date format, or a
permutation of yd with separators, e.g. y-d, for day-of-year dates.
All three components may be numbers in any available number system.
The month field may also be a month name, determined by the same
devices as independent month name fields.
- -W,--sort-order-file-separators <file name>
- Read the list of characters to be treated as separators in the
sort order definition file.
- -S,--substitutions <file name>
- Read substitutions from named file
- -s,--sort-order <file name>|<locale
name>|"locale"
- If the argument is a file name, it is taken to be a sort order
file and the sort order for the key is read from the file. If the
argument is a locale name, the collation rules for that locale are
used. If the argument is "locale", the collation rules for the
current locale are used.
- -T,--transformations <(d)(e)(s)>
- Apply the specified transformations. d specifies that
diacritics are to be stripped. Separately encoded combining
diacritics are removed. Characters with diacritics represented
by single codepoints are
replaced with the corresponding ASCII character without the
diacritics, if there is one. e specifies that enclosed
characters, that is, characters within circles or parentheses, are
to be replaced with the corresponding plain ASCII character if
there is one. s specifies that characters in special styles
are to be replaced with the corresponding plain ASCII character if
there is one. Stylistic equivalents include: small capitals (e.g.
U+1D04), script forms (e.g. U+212C), black letter forms (e.g.
U+212D), Arabic presentation forms (e.g. U+FE81), Hebrew
presentation forms (e.g. U+FB1D), fullwidth forms (e.g. U+FF01),
halfwidth forms (e.g. U+FF7B), and the mathematical alphanumeric
symbols (e.g. U+1D400).
- -x,--exclusion-file <file name>
- Read exclusions from named file
- -X,--exclude-characters <exclusions>
- Exclude specified characters
- -i,--invert-locally
- Invert sense of comparisons
- -R,--reverse-key
- Reverse characters of key
Note: long options may not be available on your system.
SEE ALSO
sort(1),
uninum(3)
AUTHOR
Bill Poser (billposer@alum.mit.edu)
LICENSE
GNU General Public License (http://www.gnu.org/licenses/gpl.txt),
version 2.