Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unstable training #352

Merged
merged 14 commits into from
Jan 17, 2024
4 changes: 2 additions & 2 deletions pipeline/train/configs/opustrainer/backward.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ train:
- until original 10 # Train for 10 epochs. Only OpusTrainer can control epochs, it's all one big epoch for Marian

modifiers:
- UpperCase: 0.1 # Apply randomly to 5% of sentences
- TitleCase: 0.1
- UpperCase: 0.07 # Apply randomly to 7% of sentences
- TitleCase: 0.05
#- Typos: 0.05

seed: 1111
Expand Down
4 changes: 2 additions & 2 deletions pipeline/train/configs/opustrainer/student.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ train:
# TODO: augment corpus before decoding or reduce augmentation rate
# TODO: https://github.com/mozilla/firefox-translations-training/issues/272
#modifiers:
#- UpperCase: 0.1 # Apply randomly to 5% of sentences
#- TitleCase: 0.1
- UpperCase: 0.07 # Apply randomly to 7% of sentences
- TitleCase: 0.05
# TODO: enable typos, issue https://github.com/mozilla/firefox-translations-training/issues/262
#- Typos: 0.05
# TODO: enable tags, currently doesn't work because of the issue with tokenization
Expand Down
26 changes: 6 additions & 20 deletions pipeline/train/configs/opustrainer/teacher.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,28 +3,14 @@ datasets:
backtranslated: <dataset1> # Back-translated data

stages:
- start
- mid
- end
- pretrain
- finetune

# One epoch of only original high-quality data to warm up the model
start:
- original 1.0
- until original 1

# Gradually add back-translations to the mix
# Back-translated corpus can vary a lot in size, so we can try using original to count epochs
mid:
- original 0.7
- backtranslated 0.3
- until original 1

# Expand back-translations
end:
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs
pretrain:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instability was not caused by stages so it's fine to use a simpler schedule

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I like this as being simpler.

- original 0.6
- backtranslated 0.4
- until original 1
- until original 2

# Fine-tuning only on original clean corpus until the early stopping
finetune:
Expand All @@ -33,8 +19,8 @@ finetune:


modifiers:
- UpperCase: 0.1 # Apply randomly to 10% of sentences
- TitleCase: 0.1
- UpperCase: 0.07 # Apply randomly to 7% of sentences
- TitleCase: 0.05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why are you changing these here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the paper https://arxiv.org/pdf/2311.14838.pdf. They set them to 0.05 but I noticed that title case performs better than upper case so I boosted it a bit. Also I ran an experiment and got satisfactory results. I added a link to the docs.

# TODO: enable typos, issue https://github.com/mozilla/firefox-translations-training/issues/262
#- Typos: 0.05

Expand Down
9 changes: 5 additions & 4 deletions pipeline/train/configs/training/teacher.train.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# https://discourse.translatelocally.com/t/marian-configuration-to-use/24
disp-freq: 1000
# default learning rate for transformer-big is 0.0002 https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp
learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001
optimizer-delay: 1 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer
optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Can you explain this change?

The docs say:

--optimizer-delay FLOAT=1 SGD update delay (#batches between updates). 1 = no delay. Can be fractional, e.g. 0.1 to use only 10% of each batch

I don't have a mental model of what this is changing and why it affects things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is matching the 2 value from the pretrain step? If so please document this both here and in the pretrain step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find any resources where this recommendation comes from, but it seems it increases the update batch size that makes training more stable.

Me:
Why is it recommended for marian to set optimizer-delay so that GPU devices * optimizer-delay = 8?

ChatGPT:
The recommendation to set optimizer-delay so that GPU devices * optimizer-delay = 8 in Marian NMT is a heuristic that aims to balance the computational efficiency and the stability of the optimization process. This heuristic is derived from empirical observations that suggest this setting helps to achieve a good trade-off between the speed of computation and the stability of convergence, particularly when using multiple GPUs.

This setting effectively increases the batch size without requiring more memory, which can lead to more stable and reliable gradient estimates. It is a way to utilize the parallelism offered by multiple GPUs while also ensuring that each update is significant enough to provide stable learning, without being so large that it might cause instability due to the accumulation of too much gradient information before an update is applied.

lr-report: True
save-freq: 5000
valid-freq: 3000
valid-freq: 5000
valid-max-length: 300
valid-mini-batch: 8
early-stopping: 20
early-stopping: 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a bit of time researching these various options to see what you changed to fully understand what was going on. Rather than doing that I have a suggestion that should be quick:

Suggestion (docs): It would be nice to include a short message explaining why you chose certain values here. It would be nice to document decisions when we change hyperparameters. This will make it easier to share our knowledge with each other and remember things for our future selves.