Github: https://github.com/dsfsi/vukuzenzele-nlp/
Give Feedback 📑: DSFSI Resource Feedback Form{:target="_blank"}
The dataset contains editions from the South African government magazine Vuk'uzenzele, created by the Government Communication and Information System (GCIS). Data was scraped from PDFs that have been placed in the data/raw folder. The PDFS were obtatined from the Vuk'uzenzele website.
The datasets contain government magazine editions in 11 languages, namely:
Language | Code | Language | Code |
---|---|---|---|
English | (eng) | Sepedi | (sep) |
Afrikaans | (afr) | Setswana | (tsn) |
isiNdebele | (nbl) | Siswati | (ssw) |
isiXhosa | (xho) | Tshivenda | (ven) |
isiZulu | (zul) | Xitstonga | (tso) |
Sesotho | (nso) |
src_lang | trg_lang | num_aligned_pairs |
---|---|---|
ssw | xho | 2202 |
ssw | zul | 2183 |
xho | zul | 2102 |
nso | xho | 2081 |
nso | tso | 2071 |
ssw | tso | 2034 |
nso | ssw | 2021 |
tsn | tso | 2020 |
tsn | xho | 2009 |
tso | xho | 2009 |
nso | tsn | 2002 |
ssw | tsn | 1987 |
tso | zul | 1957 |
nso | zul | 1953 |
tsn | zul | 1933 |
eng | zul | 1923 |
eng | tso | 1923 |
eng | nso | 1867 |
eng | ssw | 1821 |
afr | xho | 1816 |
eng | xho | 1801 |
nbl | sep | 1795 |
sep | ven | 1794 |
afr | ssw | 1783 |
eng | tsn | 1772 |
afr | zul | 1769 |
afr | nso | 1746 |
nbl | ven | 1699 |
afr | eng | 1661 |
afr | tsn | 1631 |
afr | tso | 1617 |
afr | sep | 551 |
afr | ven | 498 |
afr | nbl | 491 |
nso | sep | 410 |
nso | ven | 352 |
sep | tso | 326 |
sep | tsn | 319 |
tso | ven | 307 |
sep | ssw | 305 |
sep | xho | 300 |
ssw | ven | 290 |
tsn | ven | 285 |
nbl | ssw | 282 |
nbl | nso | 266 |
ven | xho | 260 |
eng | sep | 258 |
nbl | xho | 250 |
sep | zul | 249 |
nbl | tso | 238 |
eng | ven | 234 |
nbl | tsn | 230 |
nbl | zul | 226 |
ven | zul | 225 |
eng | nbl | 184 |
The dataset is present in several forms on the repo.
Generally the dataset is split by edition, eg. 2020-01-ed1
The data directory is broken down as follows
./data
├── external # Data external to this repo
├── interim # I am not really sure - looks like interim in regards to processed.
├── processed # The data from scraping the raw pdfs
├── raw # The raw pdfs of the Vuk'uzenzele magazine
├── sentence_align_output # The output (csv) of the sentence alignment with LASER language encoders
└── simple_align_output # The output (csv) of a simple one to one sentence alignment
The dataset is split by edition in the data/processed folder.
This dataset contains machine-readable data extracted from PDF documents, from https://www.vukuzenzele.gov.za/, provided by the Government Communication Information System (GCIS). While efforts were made to ensure the accuracy and completeness of this data, there may be errors or discrepancies between the original publications and this dataset. No warranties, guarantees or representations are given in relation to the information contained in the dataset. The members of the Data Science for Societal Impact Research Group bear no responsibility and/or liability for any such errors or discrepancies in this dataset. The Government Communication Information System (GCIS) bears no responsibility and/or liability for any such errors or discrepancies in this dataset. It is recommended that users verify all information contained herein before making decisions based upon this information.
- Vukosi Marivate - @vukosi
- Andani Madodonga
- Daniel Njini
- Richard Lastrucci
- Isheanesu Dzingirai
- Jenalea Rajab
Paper
Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora
@inproceedings{lastrucci-etal-2023-preparing, title = "Preparing the Vuk{'}uzenzele and {ZA}-gov-multilingual {S}outh {A}frican multilingual corpora", author = "Richard Lastrucci and Isheanesu Dzingirai and Jenalea Rajab and Andani Madodonga and Matimba Shingange and Daniel Njini and Vukosi Marivate", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.rail-1.3", pages = "18--25" }
Dataset
Vukosi Marivate, Andani Madodonga, Daniel Njini, Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab. The Vuk'uzenzele South African Multilingual Corpus, 2023
@dataset{marivate_vukosi_2023_7598540, author = {Marivate, Vukosi and Njini, Daniel and Madodonga, Andani and Lastrucci, Richard and Dzingirai, Isheanesu Rajab, Jenalea}, title = {The Vuk'uzenzele South African Multilingual Corpus}, month = feb, year = 2023, publisher = {Zenodo}, doi = {10.5281/zenodo.7598539}, url = {https://doi.org/10.5281/zenodo.7598539} }
- License for Data - CC 4.0 BY
- Licence for Code - MIT License