Specification of form manipulation

Specification of the value-to-form processing in Lexibank datasets:

The value-to-form processing is divided into two steps, implemented as methods:

These methods use the attributes of a FormSpec instance to configure their behaviour.

brackets: {'(': ')'} Pairs of strings that should be recognized as brackets, specified as dict mapping opening string to closing string
separators: , Iterable of single character tokens that should be recognized as word separator
missing_data: ('?', '-') Iterable of strings that are used to mark missing data
strip_inside_brackets: True Flag signaling whether to strip content in brackets (and strip leading and trailing whitespace)
replacements: [('†', ''), ('*', ''), ('[', ''), (']', ''), ('~', ''), ('?', ''), (';', ''), ('+', ''), ('-', ''), ('á', 'a'), ('à', 'a'), ('ā', 'a'), ('ï', 'ɪ'), ('í', 'ɪ'), ('ì', 'ɪ'), ('ī', 'ɪ'), ('ē', 'e'), ('ō', 'o'), ('ū', 'u'), (' ', '_'), (',_', ', ')] List of pairs (source, target) used to replace occurrences of source in formswith target (before stripping content in brackets)
first_form_only: True Flag signaling whether at most one form should be returned from split - effectively ignoring any spelling variants, etc.
normalize_whitespace: True Flag signaling whether to normalize whitespace - stripping leading and trailing whitespace and collapsing multi-character whitespace to single spaces
normalize_unicode: None UNICODE normalization form to use for input of split (None, 'NFD' or 'NFC')

Provide feedback