Specification of the value-to-form processing in Lexibank datasets:
The value-to-form processing is divided into two steps, implemented as methods:
FormSpec.split
: Splits a string into individual form chunks.FormSpec.clean
: Normalizes a form chunk.
These methods use the attributes of a FormSpec
instance to configure their behaviour.
brackets
:{'(': ')'}
Pairs of strings that should be recognized as brackets, specified asdict
mapping opening string to closing stringseparators
:,
Iterable of single character tokens that should be recognized as word separatormissing_data
:('?', '-')
Iterable of strings that are used to mark missing datastrip_inside_brackets
:True
Flag signaling whether to strip content in brackets (and strip leading and trailing whitespace)replacements
:[('†', ''), ('*', ''), ('[', ''), (']', ''), ('~', ''), ('?', ''), (';', ''), ('+', ''), ('-', ''), ('á', 'a'), ('à', 'a'), ('ā', 'a'), ('ï', 'ɪ'), ('í', 'ɪ'), ('ì', 'ɪ'), ('ī', 'ɪ'), ('ē', 'e'), ('ō', 'o'), ('ū', 'u'), (' ', '_'), (',_', ', ')]
List of pairs (source
,target
) used to replace occurrences ofsource
in formswithtarget
(before stripping content in brackets)first_form_only
:True
Flag signaling whether at most one form should be returned fromsplit
- effectively ignoring any spelling variants, etc.normalize_whitespace
:True
Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spacesnormalize_unicode
:None
UNICODE normalization form to use for input ofsplit
(None
, 'NFD' or 'NFC')