Fix unstable training #352

eu9ene · 2024-01-10T22:32:29Z

Fix hyperparameters
Adjust OpusTrainer settings

Training looks stable with fixed optimizer-delay.
Reducing learning rate didn't affect much.
Adding back-translations gives significant boost (top pink) but I used lower early stopping for this one to finish faster (20 vs 30 for other experiments).

Fixes #314

[skip ci]

eu9ene · 2024-01-12T19:20:55Z

pipeline/train/configs/opustrainer/teacher.yml

-# Expand back-translations
-end:
+# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs
+pretrain:


instability was not caused by stages so it's fine to use a simpler schedule

Ah, I like this as being simpler.

gregtatum

Looks good as long as there are a few more docs before merging. Thanks for looking into this.

gregtatum · 2024-01-12T19:37:33Z

pipeline/train/configs/training/teacher.train.yml

 learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001
-optimizer-delay: 1 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer
+optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer


Question: Can you explain this change?

The docs say:

--optimizer-delay FLOAT=1 SGD update delay (#batches between updates). 1 = no delay. Can be fractional, e.g. 0.1 to use only 10% of each batch

I don't have a mental model of what this is changing and why it affects things.

It looks like 2 and 4 are used in the student recipes.

https://github.com/search?q=repo%3Abrowsermt%2Fstudents%20optimizer-delay&type=code

I guess this is matching the 2 value from the pretrain step? If so please document this both here and in the pretrain step.

I did not find any resources where this recommendation comes from, but it seems it increases the update batch size that makes training more stable.

Me:
Why is it recommended for marian to set optimizer-delay so that GPU devices * optimizer-delay = 8?

ChatGPT:
The recommendation to set optimizer-delay so that GPU devices * optimizer-delay = 8 in Marian NMT is a heuristic that aims to balance the computational efficiency and the stability of the optimization process. This heuristic is derived from empirical observations that suggest this setting helps to achieve a good trade-off between the speed of computation and the stability of convergence, particularly when using multiple GPUs.

This setting effectively increases the batch size without requiring more memory, which can lead to more stable and reliable gradient estimates. It is a way to utilize the parallelism offered by multiple GPUs while also ensuring that each update is significant enough to provide stable learning, without being so large that it might cause instability due to the accumulation of too much gradient information before an update is applied.

gregtatum · 2024-01-16T21:45:33Z

taskcluster/kinds/tests/kind.yml

-          make test-dry-run
+  # See issue: https://github.com/mozilla/firefox-translations-training/issues/363
+  #  snakemake-dry-run:
+#    # Ensure that the snakemake workflow is still executing correctly, even though


Question: Is this related to your changes?

I just want CI to pass here and disabled this since it's related only to snakemake.

gregtatum · 2024-01-16T21:46:35Z

taskcluster/kinds/tests/kind.yml

+  # See issue: https://github.com/mozilla/firefox-translations-training/issues/363
+  #  snakemake-dry-run:
+#    # Ensure that the snakemake workflow is still executing correctly, even though
+#    # taskcluster is the preferred execution environment.


Thought: This kind of change would be better in a separate PR. I'd prefer to merge with a dirty CI, and fast-follow with another PR. This makes it clearer then things go wrong. For now this change is fine.

gregtatum · 2024-01-16T21:47:19Z

pipeline/train/configs/training/teacher.train.yml

I spent a bit of time researching these various options to see what you changed to fully understand what was going on. Rather than doing that I have a suggestion that should be quick:

Suggestion (docs): It would be nice to include a short message explaining why you chose certain values here. It would be nice to document decisions when we change hyperparameters. This will make it easier to share our knowledge with each other and remember things for our future selves.

gregtatum · 2024-01-16T21:47:30Z

pipeline/train/configs/opustrainer/teacher.yml

- UpperCase: 0.1 # Apply randomly to 10% of sentences
- TitleCase: 0.1
+- UpperCase: 0.07 # Apply randomly to 7% of sentences
+- TitleCase: 0.05


Question: Why are you changing these here?

Based on the paper https://arxiv.org/pdf/2311.14838.pdf. They set them to 0.05 but I noticed that title case performs better than upper case so I boosted it a bit. Also I ran an experiment and got satisfactory results. I added a link to the docs.

gregtatum · 2024-01-16T21:47:43Z

pipeline/train/configs/opustrainer/teacher.yml

-# Expand back-translations
-end:
+# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs
+pretrain:


Ah, I like this as being simpler.

…unstable_training

eu9ene added 8 commits January 9, 2024 14:19

Adjust opus trainer settings

adc7eef

Fix optimizer delay

cf4f588

Use default learning rate

dbe320e

Enable back translations

6e0f52f

Report learning rate for teacher

a26467b

Remove old link

38edf1e

Match validation and save frequency

5207333

Roll back learning rate

af7d179

eu9ene commented Jan 12, 2024

View reviewed changes

eu9ene added 2 commits January 12, 2024 11:27

Disable snakemake dry run

6fb418f

Merge branch 'main' into fix_unstable_training

fe89d28

eu9ene requested a review from gregtatum January 12, 2024 19:29

eu9ene marked this pull request as ready for review January 12, 2024 19:29

eu9ene requested a review from a team as a code owner January 12, 2024 19:29

eu9ene requested a review from jcristau January 12, 2024 19:29

eu9ene mentioned this pull request Jan 16, 2024

OpusTrainer should disable early stopping until the final stage #293

Closed

gregtatum approved these changes Jan 16, 2024

View reviewed changes

eu9ene added 4 commits January 17, 2024 11:09

Merge branch 'main' into fix_unstable_training

36b36a5

Add a note about optimizer delay

80f47f4

Add a link to opus trainer paper

1a6abb2

Merge remote-tracking branch 'origin/fix_unstable_training' into fix_…

dcc9974

…unstable_training

eu9ene merged commit 61f5ec2 into main Jan 17, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unstable training #352

Fix unstable training #352

eu9ene commented Jan 10, 2024 •

edited

Loading

eu9ene Jan 12, 2024

gregtatum Jan 16, 2024

gregtatum left a comment

gregtatum Jan 12, 2024

gregtatum Jan 12, 2024

gregtatum Jan 16, 2024

eu9ene Jan 17, 2024

gregtatum Jan 16, 2024

eu9ene Jan 17, 2024

gregtatum Jan 16, 2024

gregtatum Jan 16, 2024

gregtatum Jan 16, 2024

eu9ene Jan 17, 2024

gregtatum Jan 16, 2024

Fix unstable training #352

Fix unstable training #352

Conversation

eu9ene commented Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregtatum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene commented Jan 10, 2024 •

edited

Loading