-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unstable training #352
Changes from 8 commits
adc7eef
cf4f588
dbe320e
6e0f52f
a26467b
38edf1e
5207333
af7d179
6fb418f
fe89d28
36b36a5
80f47f4
1a6abb2
dcc9974
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,28 +3,14 @@ datasets: | |
backtranslated: <dataset1> # Back-translated data | ||
|
||
stages: | ||
- start | ||
- mid | ||
- end | ||
- pretrain | ||
- finetune | ||
|
||
# One epoch of only original high-quality data to warm up the model | ||
start: | ||
- original 1.0 | ||
- until original 1 | ||
|
||
# Gradually add back-translations to the mix | ||
# Back-translated corpus can vary a lot in size, so we can try using original to count epochs | ||
mid: | ||
- original 0.7 | ||
- backtranslated 0.3 | ||
- until original 1 | ||
|
||
# Expand back-translations | ||
end: | ||
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs | ||
pretrain: | ||
- original 0.6 | ||
- backtranslated 0.4 | ||
- until original 1 | ||
- until original 2 | ||
|
||
# Fine-tuning only on original clean corpus until the early stopping | ||
finetune: | ||
|
@@ -33,8 +19,8 @@ finetune: | |
|
||
|
||
modifiers: | ||
- UpperCase: 0.1 # Apply randomly to 10% of sentences | ||
- TitleCase: 0.1 | ||
- UpperCase: 0.07 # Apply randomly to 7% of sentences | ||
- TitleCase: 0.05 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Why are you changing these here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the paper https://arxiv.org/pdf/2311.14838.pdf. They set them to 0.05 but I noticed that title case performs better than upper case so I boosted it a bit. Also I ran an experiment and got satisfactory results. I added a link to the docs. |
||
# TODO: enable typos, issue https://github.com/mozilla/firefox-translations-training/issues/262 | ||
#- Typos: 0.05 | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,10 @@ | ||
# https://discourse.translatelocally.com/t/marian-configuration-to-use/24 | ||
disp-freq: 1000 | ||
# default learning rate for transformer-big is 0.0002 https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp | ||
learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001 | ||
optimizer-delay: 1 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer | ||
optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Can you explain this change? The docs say:
I don't have a mental model of what this is changing and why it affects things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like 2 and 4 are used in the student recipes. https://github.com/search?q=repo%3Abrowsermt%2Fstudents%20optimizer-delay&type=code There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this is matching the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not find any resources where this recommendation comes from, but it seems it increases the update batch size that makes training more stable. Me: ChatGPT: This setting effectively increases the batch size without requiring more memory, which can lead to more stable and reliable gradient estimates. It is a way to utilize the parallelism offered by multiple GPUs while also ensuring that each update is significant enough to provide stable learning, without being so large that it might cause instability due to the accumulation of too much gradient information before an update is applied. |
||
lr-report: True | ||
save-freq: 5000 | ||
valid-freq: 3000 | ||
valid-freq: 5000 | ||
valid-max-length: 300 | ||
valid-mini-batch: 8 | ||
early-stopping: 20 | ||
early-stopping: 20 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I spent a bit of time researching these various options to see what you changed to fully understand what was going on. Rather than doing that I have a suggestion that should be quick: Suggestion (docs): It would be nice to include a short message explaining why you chose certain values here. It would be nice to document decisions when we change hyperparameters. This will make it easier to share our knowledge with each other and remember things for our future selves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instability was not caused by stages so it's fine to use a simpler schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I like this as being simpler.