Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unstable training #352

Merged
merged 14 commits into from
Jan 17, 2024
Merged

Fix unstable training #352

merged 14 commits into from
Jan 17, 2024

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Jan 10, 2024

  • Fix hyperparameters
  • Adjust OpusTrainer settings

Training looks stable with fixed optimizer-delay.
Reducing learning rate didn't affect much.
Adding back-translations gives significant boost (top pink) but I used lower early stopping for this one to finish faster (20 vs 30 for other experiments).

Screenshot 2024-01-12 at 11 21 25 AM

Fixes #314

[skip ci]

# Expand back-translations
end:
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs
pretrain:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instability was not caused by stages so it's fine to use a simpler schedule

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I like this as being simpler.

@eu9ene eu9ene requested a review from gregtatum January 12, 2024 19:29
@eu9ene eu9ene marked this pull request as ready for review January 12, 2024 19:29
@eu9ene eu9ene requested a review from a team as a code owner January 12, 2024 19:29
@eu9ene eu9ene requested a review from jcristau January 12, 2024 19:29
Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as long as there are a few more docs before merging. Thanks for looking into this.

learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001
optimizer-delay: 1 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer
optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Can you explain this change?

The docs say:

--optimizer-delay FLOAT=1 SGD update delay (#batches between updates). 1 = no delay. Can be fractional, e.g. 0.1 to use only 10% of each batch

I don't have a mental model of what this is changing and why it affects things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is matching the 2 value from the pretrain step? If so please document this both here and in the pretrain step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find any resources where this recommendation comes from, but it seems it increases the update batch size that makes training more stable.

Me:
Why is it recommended for marian to set optimizer-delay so that GPU devices * optimizer-delay = 8?

ChatGPT:
The recommendation to set optimizer-delay so that GPU devices * optimizer-delay = 8 in Marian NMT is a heuristic that aims to balance the computational efficiency and the stability of the optimization process. This heuristic is derived from empirical observations that suggest this setting helps to achieve a good trade-off between the speed of computation and the stability of convergence, particularly when using multiple GPUs.

This setting effectively increases the batch size without requiring more memory, which can lead to more stable and reliable gradient estimates. It is a way to utilize the parallelism offered by multiple GPUs while also ensuring that each update is significant enough to provide stable learning, without being so large that it might cause instability due to the accumulation of too much gradient information before an update is applied.

make test-dry-run
# See issue: https://github.com/mozilla/firefox-translations-training/issues/363
# snakemake-dry-run:
# # Ensure that the snakemake workflow is still executing correctly, even though
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this related to your changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want CI to pass here and disabled this since it's related only to snakemake.

# See issue: https://github.com/mozilla/firefox-translations-training/issues/363
# snakemake-dry-run:
# # Ensure that the snakemake workflow is still executing correctly, even though
# # taskcluster is the preferred execution environment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: This kind of change would be better in a separate PR. I'd prefer to merge with a dirty CI, and fast-follow with another PR. This makes it clearer then things go wrong. For now this change is fine.

early-stopping: 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a bit of time researching these various options to see what you changed to fully understand what was going on. Rather than doing that I have a suggestion that should be quick:

Suggestion (docs): It would be nice to include a short message explaining why you chose certain values here. It would be nice to document decisions when we change hyperparameters. This will make it easier to share our knowledge with each other and remember things for our future selves.

- UpperCase: 0.1 # Apply randomly to 10% of sentences
- TitleCase: 0.1
- UpperCase: 0.07 # Apply randomly to 7% of sentences
- TitleCase: 0.05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why are you changing these here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the paper https://arxiv.org/pdf/2311.14838.pdf. They set them to 0.05 but I noticed that title case performs better than upper case so I boosted it a bit. Also I ran an experiment and got satisfactory results. I added a link to the docs.

# Expand back-translations
end:
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs
pretrain:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I like this as being simpler.

@eu9ene eu9ene merged commit 61f5ec2 into main Jan 17, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unstable training with OpusTrainer
2 participants