French Homograph Disambiguation Dataset

This dataset was created as part of the Blizzard Challenge 2023, with the goal of improving French homograph disambiguation for text-to-speech models. Homographs are words with the same spelling, but different meaning or pronunciation depending on their context in a sentence. They are prevalent in French, and incorrect disambiguation can lead to intelligibility issues.

To disambiguate the pronunciation of homographs, we propose a system that relies on part-of-speech (POS) tagging and suffix families identification. For POS extraction, we use the pre-trained POS tagger qanastek/pos-french-camembert-flair.

Dataset

The file homograph_disambiguation_dataset.csv labels the pronunciation for 35 homographs given their POS tag in a sentence. A total of 4,091 examples were manually annotated by a French native speaker. Sentences were extracted from French data used to train FlauBERT, a state-of-the-art BERT for French.

column name	type	description	example
sentence_id	int	The unique identifier of the sentence in the dataset	1188907
homograph	string	The target homograph to disambiguate	adoptions
homograph_index	int	The index position of the token corresponding to the target homograph. Tokens can either be words, punctuations or symbols. Indexing starts at 0.	6
pos	string	The POS tag of the target homograph in the sentence. For additional information regarding POS tags, please refer to the nomenclature provided in the POS Tagger Documentation	NFP
sentence	string	The sentence containing the homograph to disambiguate. Each word/punctuation/symbol is separated by a space, making it easier to tokenize the sentence.	Je disais que 80 % des adoptions par les familles françaises sont faites à l ' étranger .
homograph_pronunciation	string	The pronunciation of the target homograph in the sentence. Phonemes are separated by spaces. We use IPA symbols, as in this open-source phonemizer	a d ɔ p s j ɔ̃

Below are the statistics for the number of pronunciations per homograph :

homograph	pronunciation	count
actions	a k s j ɔ̃ a k t j ɔ̃	136 6
adoptions	a d ɔ p s j ɔ̃ a d ɔ p t j ɔ̃	49 20
affections	a f ɛ k s j ɔ̃ a f ɛ k t j ɔ̃	90 19
affluent	a f l y ɑ̃ a f l y	66 21
as	a s a	64 48
bus	b y s b y	166 99
but	b y t b y	48 6
cacher	k a ʃ ɛ ʁ k a ʃ e	86 68
collections	k ɔ l ɛ k s j ɔ̃ k ɔ l ɛ k t j ɔ̃	87 21
content	k ɔ̃ t ɑ̃ k ɔ̃ t	99 15
convient	k ɔ̃ v j ɛ̃ k ɔ̃ v i	43 1
couvent	k u v ɑ̃ k u v	62 20
détections	d e t ɛ k s j ɔ̃ d e t ɛ k t j ɔ̃	41 20
est	ɛ s t ɛ	87 40
excellent	ɛ k s ɛ l ɑ̃ ɛ k s ɛ l	138 21
ferment	f ɛ ʁ m ɑ̃ f ɛ ʁ m	86 45
fier	f j ɛ ʁ f j e	130 37
fils	f i s f i l	103 8
intentions	ɛ̃ t ɑ̃ s j ɔ̃ ɛ̃ t ɑ̃ t j ɔ̃	66 17
minerai	m i n ə ʁ ɛ m i n ə ʁ e	49 5
négligent	n e ɡ l i ʒ ɑ̃ n e ɡ l i ʒ	114 21
options	ɔ p s j ɔ̃ ɔ p t j ɔ̃	114 20
os	ɔ s o	34 18
parent	p a ʁ ɑ̃ p a ʁ	142 20
plus	p l y s p l y	100 27
portions	p ɔ ʁ s j ɔ̃ p ɔ ʁ t j ɔ̃	65 22
pressent	p ʁ ɛ s ɑ̃ p ʁ ɛ s	28 14
reporter	ʁ ə p ɔ ʁ t ɛ ʁ ʁ ə p ɔ ʁ t e	137 37
résident	ʁ e z i d ɑ̃ ʁ e z i d	124 21
sens	s ɑ̃ s s ɑ̃	91 20
somnolent	s ɔ m n ɔ l ɑ̃ s ɔ m n ɔ l	108 16
supporter	s y p ɔ ʁ t ɛ ʁ s y p ɔ ʁ t e	114 36
urgent	y ʁ ʒ ɑ̃ y ʁ ʒ	127 18
violent	v j ɔ l ɑ̃ v j ɔ l	84 21
vis	v i s v i	128 37

Contributing

Any contribution to this repository is more than welcome.
If you have any feedback, please send it to [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
License.md		License.md
README.md		README.md
homograph_disambiguation_dataset.csv		homograph_disambiguation_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

French Homograph Disambiguation Dataset

Dataset

Contributing

About

Releases

Packages

Contributors 2

License

ubisoft/ubisoft-laforge-french-homograph-dataset

Folders and files

Latest commit

History

Repository files navigation

French Homograph Disambiguation Dataset

Dataset

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages