Code for creating clustering benchmarks for arbitrary languages using wikipedia.
The project uses poetry
for dependency management - make sure you have it install (see this guide for for instructions). To install the dependencies, run:
poetry install
Adding a new language has two steps; a) downloading the right files from the wikipedia dump and b) writing a configuration file called {prefix}-config.json
and storing it in language_configs/
. The structure of the config file can be found in src/config.py
.
There are a bunch of scripts to run the different parts of the pipeline. The main ones are:
parse_articles.py
: Parses the articles to create a json with the first paragraphs and the categories for the first 300,000 articles of the wiki dump.parse_sql_gz.py
: Parses the SQL dump of the wikipedia to get the categories of the articles as well as their ids. This includes the top-levlejoin_categories.py
: Joins the categories from the SQL dump with the articles from the parsed articles. Specifically, this joins the categories with the top-level categories as defined from the corresponding language article to Main topic classifications.create_categories.py
: Creates the actual dataset by sampling from the articles and the corresponding categories.upload_hf.py
: Uploads the dataset to Hugging Face. NB: Currently this can only be done by the author (me!).
For convenience, there are two helper scripts for running the pipeline: run_for_lang.sh
and run_all.sh
. The former runs the pipeline for a single language, while the latter runs the pipeline for all languages in the language_configs/
directory.
- Create a read-like file on HF a la this one
- Simple documentation on how the data was created.
- x: all done
- c: config file written
- d: downloaded
- r: run
- e: evaluated
- h: uploaded and update hf
- da
- lv
- gv
- sq
- [d] ku
- [d] sco
- [d] mt
- [d] bs
- [d] ca
- [d] eu
- [d] wa
- [d] cs
- [d] ilo
- [d] min