Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable training with OpusTrainer #314

Closed
Tracked by #369 ...
eu9ene opened this issue Dec 18, 2023 · 6 comments · Fixed by #352
Closed
Tracked by #369 ...

Unstable training with OpusTrainer #314

eu9ene opened this issue Dec 18, 2023 · 6 comments · Fixed by #352
Assignees
Labels
bug Something is broken or not correct quality Improving robustness and translation quality

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Dec 18, 2023

It seems one of the training stages in OpusTrainer (the one that includes back-translations) reduces performance of the model. Probably it's because now we start training with the original corpus and then switch to the mixed one. Maybe it's not the best idea because the training will likely stop with early-stopping and we can't control training parameters separately for now. Related to #293

For now we can either disable back-translations completely until proving in an experiment that they help or try changing the stages to something like a little of mixed back-translations first and then fine-tuning on the original corpus + increasing early-stopping to 30 or 40 from the default 20.

Current config:

datasets:
  original: <dataset0> # Original parallel corpus
  backtranslated: <dataset1> # Back-translated data

stages:
  - start
  - mid
  - end
  - finetune

 # One epoch of only original high-quality data to warm up the model
start:
  - original 1.0
  - until original 1

# Gradually add back-translations to the mix
# Back-translated corpus can vary a lot in size, so we can try using original to count epochs
mid:
  - original 0.7
  - backtranslated 0.3
  - until original 1

# Expand back-translations
end:
  - original 0.6
  - backtranslated 0.4
  - until original 1

# Fine-tuning only on original clean corpus until the early stopping
finetune:
  - original 1.0
  - until original inf

Graphs for en-hu:
Screenshot 2023-12-18 at 12 24 05 PM

@eu9ene eu9ene added bug Something is broken or not correct quality Improving robustness and translation quality labels Dec 18, 2023
@eu9ene eu9ene self-assigned this Dec 22, 2023
@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 22, 2023

I've rerun it with a simpler config with reduced back-translations but one of the teachers didn't train completely.

datasets:
  original: <dataset0> # Original parallel corpus
  backtranslated: <dataset1> # Back-translated data

stages:
  - pretrain
  - finetune


pretrain:
  - original 0.7
  - backtranslated 0.3
  - until original 2

# Fine-tuning only on original clean corpus until the early stopping
finetune:
  - original 1.0
  - until original inf
Screenshot 2023-12-22 at 11 25 45 AM

The corpus with back-translations produced by the script looks normal (mono.lten.tsv 200M sentences), the sentences are parallel.

It looks like there's a bug in mixing in OpusTrainer. I think negative slope and lack of training happens because it feeds noise to Marian. I'll try to produce a corpus with OpusTrainer for further inspection.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Dec 22, 2023

I redirected output of OpusTrainer mixing (original 0.7, backtranslated 0.3) to a file instead of Marian and it looks quite normal.

https://firefox-ci-tc.services.mozilla.com/tasks/EPMUZ0nSTJe3awZ2II_zmg/runs/0

https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/EPMUZ0nSTJe3awZ2II_zmg/runs/0/artifacts/public%2Fbuild%2Fopus_trainer_corpus.tsv

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 2, 2024

After disabling back-translations completely and training teachers only on the original corpus, we can still see the same behavior. It might be a bug in OpusTrainer. The student has trained properly this time (due to fixed splitting), even with enabled augmentations.

https://firefox-ci-tc.services.mozilla.com/tasks/groups/K1iHndFUSxSEDRLg_H9l1A

Screenshot 2024-01-02 at 11 49 35 AM

@eu9ene eu9ene changed the title Backtranslations negatively impact teacher training Unstable training with OpusTrainer Jan 2, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 10, 2024

More pictures!

Old en-ru run (no OpusTrainer, pre-training on back-translations separately):
Screenshot 2024-01-10 at 11 29 06 AM

en-hu (no OpusTrainer, pre-training on back-translations separately)::
Screenshot 2024-01-10 at 11 30 06 AM

en-ca (with OpusTrainer and back-translaitons mixed in a dedicated stage)
Screenshot 2024-01-10 at 11 33 27 AM

It seems we always had a similar issue for pre-training on noisier data. It's just we ran pre-training for the fixed number of epochs and then fine-tuned to early stopping so it didn't affect the overall training run.

Two ways of fixing this:

  1. Adjust the hyperparameters like learning-rate to prevent this behavior and at the same time increase early stopping to larger values like 40
  2. Implement arguments per stage and pre-train for the fixed number of epochs like we did before (it requires fixes in OpusTrainer, see Restart trainer between stages hplt-project/OpusTrainer#45)

@gregtatum
Copy link
Member

After this investigation I plan on re-starting Train en-ca (#284) from scratch. I feel like this is the last blocker in my previous attempts.

@gregtatum gregtatum mentioned this issue Jan 11, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 12, 2024

The issue appears to be caused mostly by incorrect optimizer-delay setting. With having it fixed training looks stable.

Screenshot 2024-01-12 at 11 21 25 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct quality Improving robustness and translation quality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants