-
Notifications
You must be signed in to change notification settings - Fork 42
Subject vocabulary formats
A subject vocabulary defines the subjects available for automated indexing or classification. Typically this will be a thesaurus, a classification or a list of subject headings. Annif doesn't care much about the internal structure of a subject vocabulary, it just needs to know the URIs and preferred labels (a.k.a. terms or descriptors) of each subject/class/concept. If the vocabulary includes also notion codes, e.g. as in UDC, also they can be given.
The simple TSV subject vocabulary format only specifies URIs and labels for concepts. The vocabulary file is UTF-8 encoded TSV (tab separated values) file with the file extension .tsv
, where the first column contains a subject URI and the second column its label (and the optional third column the notation code). The format is the same as the extended subject file format for documents, specified below. For example:
<http://example.org/thesaurus/subj1> networking
<http://example.org/thesaurus/subj2> computer science
<http://example.org/thesaurus/subj3> Internet Protocol 42.42
A subject vocabulary can also be given as a SKOS/RDF file. All common RDF serializations (i.e. those supported by rdflib) are supported, including RDF/XML, Turtle and N-Triples.
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend