This repository contains data and code for the 2020 African Masters in Machine Intelligence (AMMI) Speech Recognition Course. It contains 132 minutes of diacritized Yoruba read speech.
A corpus of diacritized Yoruba text containing 1001 sentences was collected from two open-source text corpus available at [https://github.com/Niger-Volta-LTI/yoruba-text]. The first source is a collection of Yoruba proverbs and the second is the JW300, which is a parallel corpus of over 300 languages. The collected text corpus was then processed into a format suitable for LIG-Aikuma [https://lig-aikuma.imag.fr/]: an Android application developed specifically for collecting speech data. The corpus was read into 132 minutes (2 hours and 12 minutes) of speech using the “elicitation from text” feature of LIG-Aikuma. The resulting output formed the dataset for this project.
The data consist of an audio file (in the .wav format) for each sentence in the corpus. The recording was made in 20 different sessions and each session has a file linking the audio file to the corresponding line of text. The text was preprocessed to remove digits and punctuation marks and finally tokenized. A separate file was generated that maps each audio filename to the text that produced it. The preprocessed dataset was splitted into 8, 8, and 4 sessions of training, testing and validation set respectively. This is equivalent to 50, 55, and 27 minutes respectively.
The model training code relies heavily on the implementation of Constrastive Predictive Coding (CPC) available at https://github.com/facebookresearch/CPC_audio. Given that the dataset is very small and unaligned, CPC was used to learn speech representation from the data in a self-supervised manner. Two different models were trained under two different settings. In the first setting, a pre-trained CPC model, which was pre-trained on an English corpus, was finetuned and the resulting model was used to train a character classifier. In the second setting a CPC model was trained using randomly initialized weights, for 110 epochs, and the resulting model was then used to train another character classifier (checkpoint is available at data/model
). In both cases, the resulting character classifier and the model were evaluated by computing the character error-rate (CER) on the held-out test set after fine-tuning with the classifier for 30 epochs.
The major challenge was during the data collection stage where I had to read diacritized Yoruba text. Though I am a fluent native speaker, it is still challenging to read diacritized Yoruba text.