This dataset was created as part of the Blizzard Challenge 2023, with the goal of improving French homograph disambiguation for text-to-speech models. Homographs are words with the same spelling, but different meaning or pronunciation depending on their context in a sentence. They are prevalent in French, and incorrect disambiguation can lead to intelligibility issues.
To disambiguate the pronunciation of homographs, we propose a system that relies on part-of-speech (POS) tagging and suffix families identification. For POS extraction, we use the pre-trained POS tagger qanastek/pos-french-camembert-flair.
The file homograph_disambiguation_dataset.csv
labels the pronunciation for 35 homographs given their POS tag in a sentence. A total of 4,091 examples were manually annotated by a French native speaker. Sentences were extracted from French data used to train FlauBERT, a state-of-the-art BERT for French.
column name | type | description | example |
sentence_id | int | The unique identifier of the sentence in the dataset | 1188907 |
homograph | string | The target homograph to disambiguate | adoptions |
homograph_index | int | The index position of the token corresponding to the target homograph. Tokens can either be words, punctuations or symbols. Indexing starts at 0. | 6 |
pos | string | The POS tag of the target homograph in the sentence. For additional information regarding POS tags, please refer to the nomenclature provided in the POS Tagger Documentation | NFP |
sentence | string | The sentence containing the homograph to disambiguate. Each word/punctuation/symbol is separated by a space, making it easier to tokenize the sentence. | Je disais que 80 % des adoptions par les familles françaises sont faites à l ' étranger . |
homograph_pronunciation | string | The pronunciation of the target homograph in the sentence. Phonemes are separated by spaces. We use IPA symbols, as in this open-source phonemizer | a d ɔ p s j ɔ̃ |
Below are the statistics for the number of pronunciations per homograph :
homograph | pronunciation | count |
actions | a k s j ɔ̃ a k t j ɔ̃ |
136 6 |
adoptions | a d ɔ p s j ɔ̃ a d ɔ p t j ɔ̃ |
49 20 |
affections | a f ɛ k s j ɔ̃ a f ɛ k t j ɔ̃ |
90 19 |
affluent | a f l y ɑ̃ a f l y |
66 21 |
as | a s a |
64 48 |
bus | b y s b y |
166 99 |
but | b y t b y |
48 6 |
cacher | k a ʃ ɛ ʁ k a ʃ e |
86 68 |
collections | k ɔ l ɛ k s j ɔ̃ k ɔ l ɛ k t j ɔ̃ |
87 21 |
content | k ɔ̃ t ɑ̃ k ɔ̃ t |
99 15 |
convient | k ɔ̃ v j ɛ̃ k ɔ̃ v i |
43 1 |
couvent | k u v ɑ̃ k u v |
62 20 |
détections | d e t ɛ k s j ɔ̃ d e t ɛ k t j ɔ̃ |
41 20 |
est | ɛ s t ɛ |
87 40 |
excellent | ɛ k s ɛ l ɑ̃ ɛ k s ɛ l |
138 21 |
ferment | f ɛ ʁ m ɑ̃ f ɛ ʁ m |
86 45 |
fier | f j ɛ ʁ f j e |
130 37 |
fils | f i s f i l |
103 8 |
intentions | ɛ̃ t ɑ̃ s j ɔ̃ ɛ̃ t ɑ̃ t j ɔ̃ |
66 17 |
minerai | m i n ə ʁ ɛ m i n ə ʁ e |
49 5 |
négligent | n e ɡ l i ʒ ɑ̃ n e ɡ l i ʒ |
114 21 |
options | ɔ p s j ɔ̃ ɔ p t j ɔ̃ |
114 20 |
os | ɔ s o |
34 18 |
parent | p a ʁ ɑ̃ p a ʁ |
142 20 |
plus | p l y s p l y |
100 27 |
portions | p ɔ ʁ s j ɔ̃ p ɔ ʁ t j ɔ̃ |
65 22 |
pressent | p ʁ ɛ s ɑ̃ p ʁ ɛ s |
28 14 |
reporter | ʁ ə p ɔ ʁ t ɛ ʁ ʁ ə p ɔ ʁ t e |
137 37 |
résident | ʁ e z i d ɑ̃ ʁ e z i d |
124 21 |
sens | s ɑ̃ s s ɑ̃ |
91 20 |
somnolent | s ɔ m n ɔ l ɑ̃ s ɔ m n ɔ l |
108 16 |
supporter | s y p ɔ ʁ t ɛ ʁ s y p ɔ ʁ t e |
114 36 |
urgent | y ʁ ʒ ɑ̃ y ʁ ʒ |
127 18 |
violent | v j ɔ l ɑ̃ v j ɔ l |
84 21 |
vis | v i s v i |
128 37 |
Any contribution to this repository is more than welcome.
If you have any feedback, please send it to [email protected].
© [2023] Ubisoft Entertainment. All Rights Reserved