🚧 work in progress 👷
This directory contains code for canonicalizing and validating
UniMorph feature bundles according to the
official
documentation.
The included command-line tool um_canonicalize
converts UniMorph files
(three-column TSV files consisting of lemma, inflection, and feature bundle) so
that the features are placed in a canonical order (see below). This tool is
written in Pure Python and has no external dependencies except setuptools and
YAML.
This package requires Python 3.6+.
You can install the package from GitHub directly using the following command:
pip install git+https://github.com/unimorph/um-canonicalize.git
After installation, the terminal command um_canonicalize
will be available.
For instance, if you have the modern Greek UniMorph file ell
in your working
directory, you can issue:
um_canonicalize ell
This should enforce the canonicalization (described below) on the file, failing if any conflicts or inconsistencies are detected; running this tool on its own (successful) output should print "0 feature bundles canonicalized".
We propose a canonical ordering of tags within a feature bundle as follows.
- The first tag is the "Part of Speech" tag.
- The remaining universal tags are then placed in lexicographic order of the
category they represent; e.g.,
SEMEL
("semelfactive"), an Aktsionsart feature, comes beforeANIM
("animate"). - Any language-specific (e.g.,
LGSPEC
) tags are then placed at the end, in their lexicographic order.
This ordering is enforced by um_canonicalize
; running
this tool on its own output should print "0 feature bundles canonicalized" to
STDERR.
See CHANGES.md
for changes to the schema since UniMorph 2.0.
The list of features in tags.yaml
has some known gaps. See
CONTRIBUTING.md
for information about how to submit
improvements to this file.
These tools were created by Kyle Gorman.