This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:
def convert_to_NFC(filename, outfilename):
text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
with open(outfilename, 'w') as f:
f.write(text)
- Lagos-NWU conversational corpus
- Bíbélì Mímọ́ ní Èdè Yorùbá Òde-Òní
- The Yorùbá blog
- Asubiaro, T., Adegbola, T. et al. (2018). A Word-Level Language Identification Strategy for Resource-Scarce Languages
- Òwe Yorùbá
- Ìwé Ti Mọ́mọ́nì
- Kùránì (Qur'an) Mímọ́
- BBC Yorùbá
- Yorùbá for Academic Purpose
- Yobá mọ oduá
- Àwa Ẹlẹ́rìí Jèhófà
- Orí Kìíní
- Iwé ti Nicé
- Alákọ̀wé
- Èdè Yorùbá Rẹwà
- Ìmọ̀_Ẹ̀rọ
- ọ̀rọ̀yorùbá
- Wikipedia
- Poetry of Ọláńrewájú Adépọ̀jù
- https://twitter.com/yobamoodua
- https://twitter.com/yoruba_proverbs
- https://www.facebook.com/oweyoruba
Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!
If you want to cite this repo in your work, please use:
@misc{Orife_yoruba-text_2018,
author = {Orife, Iroro and Fasubaa, Timilehin and Wahab, Olamilekan},
month = {1},
title = {{yoruba-text}},
url = {https://github.com/Niger-Volta-LTI/yoruba-text},
year = {2018}
}