Skip to content

bltlab/random-bpe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Randomized BPE for machine translation

Install required packages

External dependencies

  1. Install xsv
  2. Install GNU parallel
  3. Optionally install fzf, jq and bat, too.

Python dependencies

pip install -r requirements.txt

Download the data

For Uzbek, use scripts/download_til.py.

After extracting, create a detokenized version of all of {train,dev,test}.{eng,fin,deu,uzb} using sacremoses:

sacremoses detokenize < input_file > output_file

Run experiments

The general workflow to run an experiment is the same regardless of language/segmentation method. Here is an example for English - Finnish translation using regular BPE and 32k merge operations.

  1. Create an experiments and eng_{fin,deu,est,uzb}_bin directories in the root folder.
mkdir experiments eng_{fin,deu,est,uzb}_bin
  1. Set the randseg_experiment_name and environment variable.
export randseg_experiment_name=english2finnish_vanillabpe
  1. Set variables for the experiment config file (randsge_cfg_file) and hyperparameter folder (randseg_hparams_folder)
export randseg_cfg_file=$(realpath config/english2finnish_sweep_vanillabpe_cfg.sh)
export randseg_hparams_folder=$(realpath config/sweep_confitions_32k_1worker)

You can also customize your config file if you're running a custom experiment. See ./config/english2*_cfg.sh for inspiration.

  1. Run the experiment using SLURM with 10 parallel jobs, one for each seed
sbatch -J your_job_name sweep_experiment.sh
  1. Analyze results

After the experiments finish, the scores can be found in test.eval.score.{bleu,chrf} in each experiment folder.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published