Unstable training with OpusTrainer #314

eu9ene · 2023-12-18T20:29:58Z

It seems one of the training stages in OpusTrainer (the one that includes back-translations) reduces performance of the model. Probably it's because now we start training with the original corpus and then switch to the mixed one. Maybe it's not the best idea because the training will likely stop with early-stopping and we can't control training parameters separately for now. Related to #293

For now we can either disable back-translations completely until proving in an experiment that they help or try changing the stages to something like a little of mixed back-translations first and then fine-tuning on the original corpus + increasing early-stopping to 30 or 40 from the default 20.

Current config:

datasets:
  original: <dataset0> # Original parallel corpus
  backtranslated: <dataset1> # Back-translated data

stages:
  - start
  - mid
  - end
  - finetune

 # One epoch of only original high-quality data to warm up the model
start:
  - original 1.0
  - until original 1

# Gradually add back-translations to the mix
# Back-translated corpus can vary a lot in size, so we can try using original to count epochs
mid:
  - original 0.7
  - backtranslated 0.3
  - until original 1

# Expand back-translations
end:
  - original 0.6
  - backtranslated 0.4
  - until original 1

# Fine-tuning only on original clean corpus until the early stopping
finetune:
  - original 1.0
  - until original inf

Graphs for en-hu:

The text was updated successfully, but these errors were encountered:

eu9ene · 2023-12-22T19:27:51Z

I've rerun it with a simpler config with reduced back-translations but one of the teachers didn't train completely.

datasets:
  original: <dataset0> # Original parallel corpus
  backtranslated: <dataset1> # Back-translated data

stages:
  - pretrain
  - finetune


pretrain:
  - original 0.7
  - backtranslated 0.3
  - until original 2

# Fine-tuning only on original clean corpus until the early stopping
finetune:
  - original 1.0
  - until original inf

The corpus with back-translations produced by the script looks normal (mono.lten.tsv 200M sentences), the sentences are parallel.

It looks like there's a bug in mixing in OpusTrainer. I think negative slope and lack of training happens because it feeds noise to Marian. I'll try to produce a corpus with OpusTrainer for further inspection.

eu9ene · 2023-12-22T23:23:18Z

I redirected output of OpusTrainer mixing (original 0.7, backtranslated 0.3) to a file instead of Marian and it looks quite normal.

https://firefox-ci-tc.services.mozilla.com/tasks/EPMUZ0nSTJe3awZ2II_zmg/runs/0

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/EPMUZ0nSTJe3awZ2II_zmg/runs/0/artifacts/public%2Fbuild%2Fopus_trainer_corpus.tsv

eu9ene · 2024-01-02T19:53:47Z

After disabling back-translations completely and training teachers only on the original corpus, we can still see the same behavior. It might be a bug in OpusTrainer. The student has trained properly this time (due to fixed splitting), even with enabled augmentations.

https://firefox-ci-tc.services.mozilla.com/tasks/groups/K1iHndFUSxSEDRLg_H9l1A

eu9ene · 2024-01-10T19:40:29Z

More pictures!

Old en-ru run (no OpusTrainer, pre-training on back-translations separately):

en-hu (no OpusTrainer, pre-training on back-translations separately)::

en-ca (with OpusTrainer and back-translaitons mixed in a dedicated stage)

It seems we always had a similar issue for pre-training on noisier data. It's just we ran pre-training for the fixed number of epochs and then fine-tuned to early stopping so it didn't affect the overall training run.

Two ways of fixing this:

Adjust the hyperparameters like learning-rate to prevent this behavior and at the same time increase early stopping to larger values like 40
Implement arguments per stage and pre-train for the fixed number of epochs like we did before (it requires fixes in OpusTrainer, see Restart trainer between stages hplt-project/OpusTrainer#45)

gregtatum · 2024-01-11T17:17:39Z

After this investigation I plan on re-starting Train en-ca (#284) from scratch. I feel like this is the last blocker in my previous attempts.

eu9ene · 2024-01-12T19:31:32Z

The issue appears to be caused mostly by incorrect optimizer-delay setting. With having it fixed training looks stable.

eu9ene added bug Something is broken or not correct quality Improving robustness and translation quality labels Dec 18, 2023

eu9ene mentioned this issue Dec 18, 2023

OpusTrainer should disable early stopping until the final stage #293

Closed

eu9ene self-assigned this Dec 22, 2023

gregtatum mentioned this issue Dec 22, 2023

[meta] Make the pipeline reliable enough to train many languages #311

Open

eu9ene changed the title ~~Backtranslations negatively impact teacher training~~ Unstable training with OpusTrainer Jan 2, 2024

eu9ene mentioned this issue Jan 5, 2024

Investigating weird training bevavior hplt-project/OpusTrainer#46

Closed

eu9ene mentioned this issue Jan 10, 2024

Fix unstable training #352

Merged

gregtatum mentioned this issue Jan 11, 2024

Train en-ca #284

Closed

eu9ene mentioned this issue Jan 16, 2024

Support training arguments for training stages hplt-project/OpusTrainer#44

Closed

gregtatum mentioned this issue Jan 16, 2024

[meta] Ship 30 languages #369

Closed

eu9ene closed this as completed in #352 Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable training with OpusTrainer #314

Unstable training with OpusTrainer #314

eu9ene commented Dec 18, 2023

eu9ene commented Dec 22, 2023

eu9ene commented Dec 22, 2023

eu9ene commented Jan 2, 2024 •

edited

Loading

eu9ene commented Jan 10, 2024

gregtatum commented Jan 11, 2024

eu9ene commented Jan 12, 2024 •

edited

Loading

Unstable training with OpusTrainer #314

Unstable training with OpusTrainer #314

Comments

eu9ene commented Dec 18, 2023

eu9ene commented Dec 22, 2023

eu9ene commented Dec 22, 2023

eu9ene commented Jan 2, 2024 • edited Loading

eu9ene commented Jan 10, 2024

gregtatum commented Jan 11, 2024

eu9ene commented Jan 12, 2024 • edited Loading

eu9ene commented Jan 2, 2024 •

edited

Loading

eu9ene commented Jan 12, 2024 •

edited

Loading