Skip to content

Latest commit

 

History

History
67 lines (55 loc) · 4.93 KB

File metadata and controls

67 lines (55 loc) · 4.93 KB

French Homograph Disambiguation Dataset

This dataset was created as part of the Blizzard Challenge 2023, with the goal of improving French homograph disambiguation for text-to-speech models. Homographs are words with the same spelling, but different meaning or pronunciation depending on their context in a sentence. They are prevalent in French, and incorrect disambiguation can lead to intelligibility issues.

To disambiguate the pronunciation of homographs, we propose a system that relies on part-of-speech (POS) tagging and suffix families identification. For POS extraction, we use the pre-trained POS tagger qanastek/pos-french-camembert-flair.

Dataset

The file homograph_disambiguation_dataset.csv labels the pronunciation for 35 homographs given their POS tag in a sentence. A total of 4,091 examples were manually annotated by a French native speaker. Sentences were extracted from French data used to train FlauBERT, a state-of-the-art BERT for French.

column name type description example
sentence_id int The unique identifier of the sentence in the dataset 1188907
homograph string The target homograph to disambiguate adoptions
homograph_index int The index position of the token corresponding to the target homograph. Tokens can either be words, punctuations or symbols. Indexing starts at 0. 6
pos string The POS tag of the target homograph in the sentence. For additional information regarding POS tags, please refer to the nomenclature provided in the POS Tagger Documentation NFP
sentence string The sentence containing the homograph to disambiguate. Each word/punctuation/symbol is separated by a space, making it easier to tokenize the sentence. Je disais que 80 % des adoptions par les familles françaises sont faites à l ' étranger .
homograph_pronunciation string The pronunciation of the target homograph in the sentence. Phonemes are separated by spaces. We use IPA symbols, as in this open-source phonemizer a d ɔ p s j ɔ̃

Below are the statistics for the number of pronunciations per homograph :

homograph pronunciation count
actions a k s j ɔ̃
a k t j ɔ̃
136
6
adoptions a d ɔ p s j ɔ̃
a d ɔ p t j ɔ̃
49
20
affections a f ɛ k s j ɔ̃
a f ɛ k t j ɔ̃
90
19
affluent a f l y ɑ̃
a f l y
66
21
as a s
a
64
48
bus b y s
b y
166
99
but b y t
b y
48
6
cacher k a ʃ ɛ ʁ
k a ʃ e
86
68
collections k ɔ l ɛ k s j ɔ̃
k ɔ l ɛ k t j ɔ̃
87
21
content k ɔ̃ t ɑ̃
k ɔ̃ t
99
15
convient k ɔ̃ v j ɛ̃
k ɔ̃ v i
43
1
couvent k u v ɑ̃
k u v
62
20
détections d e t ɛ k s j ɔ̃
d e t ɛ k t j ɔ̃
41
20
est ɛ s t
ɛ
87
40
excellent ɛ k s ɛ l ɑ̃
ɛ k s ɛ l
138
21
ferment f ɛ ʁ m ɑ̃
f ɛ ʁ m
86
45
fier f j ɛ ʁ
f j e
130
37
fils f i s
f i l
103
8
intentions ɛ̃ t ɑ̃ s j ɔ̃
ɛ̃ t ɑ̃ t j ɔ̃
66
17
minerai m i n ə ʁ ɛ
m i n ə ʁ e
49
5
négligent n e ɡ l i ʒ ɑ̃
n e ɡ l i ʒ
114
21
options ɔ p s j ɔ̃
ɔ p t j ɔ̃
114
20
os ɔ s
o
34
18
parent p a ʁ ɑ̃
p a ʁ
142
20
plus p l y s
p l y
100
27
portions p ɔ ʁ s j ɔ̃
p ɔ ʁ t j ɔ̃
65
22
pressent p ʁ ɛ s ɑ̃
p ʁ ɛ s
28
14
reporter ʁ ə p ɔ ʁ t ɛ ʁ
ʁ ə p ɔ ʁ t e
137
37
résident ʁ e z i d ɑ̃
ʁ e z i d
124
21
sens s ɑ̃ s
s ɑ̃
91
20
somnolent s ɔ m n ɔ l ɑ̃
s ɔ m n ɔ l
108
16
supporter s y p ɔ ʁ t ɛ ʁ
s y p ɔ ʁ t e
114
36
urgent y ʁ ʒ ɑ̃
y ʁ ʒ
127
18
violent v j ɔ l ɑ̃
v j ɔ l
84
21
vis v i s
v i
128
37

Contributing

Any contribution to this repository is more than welcome.
If you have any feedback, please send it to [email protected].

© [2023] Ubisoft Entertainment. All Rights Reserved