- Install
xsv
- Install
GNU parallel
- Optionally install
fzf
,jq
andbat
, too.
pip install -r requirements.txt
For Uzbek, use scripts/download_til.py
.
After extracting, create a detokenized version of all of {train,dev,test}.{eng,fin,deu,uzb}
using sacremoses
:
sacremoses detokenize < input_file > output_file
The general workflow to run an experiment is the same regardless of language/segmentation method. Here is an example for English - Finnish translation using regular BPE and 32k merge operations.
- Create an
experiments
andeng_{fin,deu,est,uzb}_bin
directories in the root folder.
mkdir experiments eng_{fin,deu,est,uzb}_bin
- Set the
randseg_experiment_name
and environment variable.
export randseg_experiment_name=english2finnish_vanillabpe
- Set variables for the experiment config file (
randsge_cfg_file
) and hyperparameter folder (randseg_hparams_folder
)
export randseg_cfg_file=$(realpath config/english2finnish_sweep_vanillabpe_cfg.sh)
export randseg_hparams_folder=$(realpath config/sweep_confitions_32k_1worker)
You can also customize your config file if you're running a custom experiment. See ./config/english2*_cfg.sh
for inspiration.
- Run the experiment using SLURM with 10 parallel jobs, one for each seed
sbatch -J your_job_name sweep_experiment.sh
- Analyze results
After the experiments finish, the scores can be found in test.eval.score.{bleu,chrf}
in each experiment folder.