This module adds syllabification and stress labeling to phonetic transcriptions of Icelandic. You can use it to enrich existing dictionaries, i.e. produce dictionaries with syllable and stress labeling, or as an element in a TTS pipeline, to add those labelings to transcribed input text.
Regardless of the use case, you have to know your phone set. The module provides necessary data for the SAMPA and IPA phonetic alphabets, if you are using something else you have to create your own data/cons_clusters_your_alphabet.txt
and data/vowels_your_alphabet.txt
. All filenames are defined in src/dictionaries.py
, adjust according to your setup.
To produce a pronunciation dictionary, call syllabify_and_label_dict(your_dictionary)
, where your_dictionary
is a filename of a pronunciation dictionary in plain text format, one entry per line, word and transcription separated by \t
, each phone separated by space:
aaron a: r O n
abbadísin a p a t i s I n
abbas a p a s
...
There are two possible outputs implemented, the syllable structure allows you, however, to easily adapt the output to your needs.
("aaron" nil (((a: ) 1) ((r O n ) 0)))
("abbadísin" nil (((a ) 1) ((p a ) 0) ((t i ) 1) ((s I n ) 0)))
("abbas" nil (((a ) 1) ((p a s ) 0)))
...
Plain syllable formt (no stress labels)
aaron a:.r O n
abbadísin a.p a.t i.s I n
abbas a.p a s
To label phonetic transcriptions in a TTS pipeline, the module needs two lists: a list of words and a list of their transcripts, where the indices in both lists correspond to each other. That is, the transcript for the word at word_list[n]
is found at transcriptions_list[n]
.
Example:
# Input:
['hvernig', 'hefur', 'þú', 'það']
['k_h v E r t n I G', 'h E: v Y r', 'T u:', 'T a: D']
# Output, syllables only:
['k_h v E r t.n I G', 'h E:.v Y r', 'T u:', 'T a: D']
# Output, syllables and stress:
['k_h v E1 r t.n I0 G', 'h E:1.v Y0 r', 'T u:1', 'T a:1 D']
This application is still in development. If you encounter any errors, feel free to open an issue inside the issue tracker. You can also contact us via email.
You can contribute to this project by forking it, creating a private branch and opening a new pull request.
Copyright © 2020, 2021 Grammatek ehf.
This software is developed under the auspices of the Icelandic Government 5-Year Language Technology Program, described here and here (English).
This software is licensed under the Apache License