-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unstable training #352
Conversation
# Expand back-translations | ||
end: | ||
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs | ||
pretrain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instability was not caused by stages so it's fine to use a simpler schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I like this as being simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good as long as there are a few more docs before merging. Thanks for looking into this.
learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001 | ||
optimizer-delay: 1 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer | ||
optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Can you explain this change?
The docs say:
--optimizer-delay FLOAT=1 SGD update delay (#batches between updates). 1 = no delay. Can be fractional, e.g. 0.1 to use only 10% of each batch
I don't have a mental model of what this is changing and why it affects things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like 2 and 4 are used in the student recipes.
https://github.com/search?q=repo%3Abrowsermt%2Fstudents%20optimizer-delay&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is matching the 2
value from the pretrain
step? If so please document this both here and in the pretrain step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find any resources where this recommendation comes from, but it seems it increases the update batch size that makes training more stable.
Me:
Why is it recommended for marian to set optimizer-delay so that GPU devices * optimizer-delay = 8?
ChatGPT:
The recommendation to set optimizer-delay so that GPU devices * optimizer-delay = 8 in Marian NMT is a heuristic that aims to balance the computational efficiency and the stability of the optimization process. This heuristic is derived from empirical observations that suggest this setting helps to achieve a good trade-off between the speed of computation and the stability of convergence, particularly when using multiple GPUs.
This setting effectively increases the batch size without requiring more memory, which can lead to more stable and reliable gradient estimates. It is a way to utilize the parallelism offered by multiple GPUs while also ensuring that each update is significant enough to provide stable learning, without being so large that it might cause instability due to the accumulation of too much gradient information before an update is applied.
make test-dry-run | ||
# See issue: https://github.com/mozilla/firefox-translations-training/issues/363 | ||
# snakemake-dry-run: | ||
# # Ensure that the snakemake workflow is still executing correctly, even though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is this related to your changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want CI to pass here and disabled this since it's related only to snakemake.
# See issue: https://github.com/mozilla/firefox-translations-training/issues/363 | ||
# snakemake-dry-run: | ||
# # Ensure that the snakemake workflow is still executing correctly, even though | ||
# # taskcluster is the preferred execution environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: This kind of change would be better in a separate PR. I'd prefer to merge with a dirty CI, and fast-follow with another PR. This makes it clearer then things go wrong. For now this change is fine.
early-stopping: 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent a bit of time researching these various options to see what you changed to fully understand what was going on. Rather than doing that I have a suggestion that should be quick:
Suggestion (docs): It would be nice to include a short message explaining why you chose certain values here. It would be nice to document decisions when we change hyperparameters. This will make it easier to share our knowledge with each other and remember things for our future selves.
- UpperCase: 0.1 # Apply randomly to 10% of sentences | ||
- TitleCase: 0.1 | ||
- UpperCase: 0.07 # Apply randomly to 7% of sentences | ||
- TitleCase: 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why are you changing these here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the paper https://arxiv.org/pdf/2311.14838.pdf. They set them to 0.05 but I noticed that title case performs better than upper case so I boosted it a bit. Also I ran an experiment and got satisfactory results. I added a link to the docs.
# Expand back-translations | ||
end: | ||
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs | ||
pretrain: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I like this as being simpler.
Training looks stable with fixed optimizer-delay.
Reducing learning rate didn't affect much.
Adding back-translations gives significant boost (top pink) but I used lower early stopping for this one to finish faster (20 vs 30 for other experiments).
Fixes #314
[skip ci]