-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unstable training #352
Changes from all commits
adc7eef
cf4f588
dbe320e
6e0f52f
a26467b
38edf1e
5207333
af7d179
6fb418f
fe89d28
36b36a5
80f47f4
1a6abb2
dcc9974
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,28 +3,14 @@ datasets: | |
backtranslated: <dataset1> # Back-translated data | ||
|
||
stages: | ||
- start | ||
- mid | ||
- end | ||
- pretrain | ||
- finetune | ||
|
||
# One epoch of only original high-quality data to warm up the model | ||
start: | ||
- original 1.0 | ||
- until original 1 | ||
|
||
# Gradually add back-translations to the mix | ||
# Back-translated corpus can vary a lot in size, so we can try using original to count epochs | ||
mid: | ||
- original 0.7 | ||
- backtranslated 0.3 | ||
- until original 1 | ||
|
||
# Expand back-translations | ||
end: | ||
# Back-translated corpus can vary a lot in size, so we can try using original one to count epochs | ||
pretrain: | ||
- original 0.6 | ||
- backtranslated 0.4 | ||
- until original 1 | ||
- until original 2 | ||
|
||
# Fine-tuning only on original clean corpus until the early stopping | ||
finetune: | ||
|
@@ -33,8 +19,8 @@ finetune: | |
|
||
|
||
modifiers: | ||
- UpperCase: 0.1 # Apply randomly to 10% of sentences | ||
- TitleCase: 0.1 | ||
- UpperCase: 0.07 # Apply randomly to 7% of sentences | ||
- TitleCase: 0.05 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Why are you changing these here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the paper https://arxiv.org/pdf/2311.14838.pdf. They set them to 0.05 but I noticed that title case performs better than upper case so I boosted it a bit. Also I ran an experiment and got satisfactory results. I added a link to the docs. |
||
# TODO: enable typos, issue https://github.com/mozilla/firefox-translations-training/issues/262 | ||
#- Typos: 0.05 | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,10 @@ | ||
# https://discourse.translatelocally.com/t/marian-configuration-to-use/24 | ||
disp-freq: 1000 | ||
# default learning rate for transformer-big is 0.0002 https://github.com/marian-nmt/marian-dev/blob/master/src/common/aliases.cpp | ||
learn-rate: 0.0003 # Turn this down if you get a diverged model, maybe 0.0001 | ||
optimizer-delay: 1 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer | ||
optimizer-delay: 2 # Roughly GPU devices * optimizer-delay = 8, but keep as an integer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Can you explain this change? The docs say:
I don't have a mental model of what this is changing and why it affects things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like 2 and 4 are used in the student recipes. https://github.com/search?q=repo%3Abrowsermt%2Fstudents%20optimizer-delay&type=code There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this is matching the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not find any resources where this recommendation comes from, but it seems it increases the update batch size that makes training more stable. Me: ChatGPT: This setting effectively increases the batch size without requiring more memory, which can lead to more stable and reliable gradient estimates. It is a way to utilize the parallelism offered by multiple GPUs while also ensuring that each update is significant enough to provide stable learning, without being so large that it might cause instability due to the accumulation of too much gradient information before an update is applied. |
||
lr-report: True | ||
save-freq: 5000 | ||
valid-freq: 3000 | ||
valid-freq: 5000 | ||
valid-max-length: 300 | ||
valid-mini-batch: 8 | ||
early-stopping: 20 | ||
early-stopping: 20 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I spent a bit of time researching these various options to see what you changed to fully understand what was going on. Rather than doing that I have a suggestion that should be quick: Suggestion (docs): It would be nice to include a short message explaining why you chose certain values here. It would be nice to document decisions when we change hyperparameters. This will make it easier to share our knowledge with each other and remember things for our future selves. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,38 +22,39 @@ task-defaults: | |
cwd: '{checkout}' | ||
|
||
tasks: | ||
snakemake-dry-run: | ||
# Ensure that the snakemake workflow is still executing correctly, even though | ||
# taskcluster is the preferred execution environment. | ||
worker-type: b-cpu | ||
worker: | ||
max-run-time: 3600 | ||
docker-image: {in-tree: test} | ||
run-on-tasks-for: ["github-push", "github-pull-request"] | ||
optimization: | ||
skip-unless-changed: | ||
- pipeline/** | ||
- envs/** | ||
- configs/** | ||
run: | ||
command: | ||
- bash | ||
- -c | ||
- >- | ||
echo "Setting environment variables" && | ||
export CONDA_PATH=/builds/worker/artifacts/mambaforge && | ||
export SNAKEMAKE_OUTPUT_CACHE=/builds/worker/artifacts/mambaforge && | ||
export REPORTS=/builds/worker/artifacts/reports && | ||
export MODELS=/builds/worker/artifacts/models && | ||
|
||
echo "Install necessary dependencies" && | ||
make conda && | ||
make snakemake && | ||
make git-modules && | ||
|
||
echo "Start the dry run" && | ||
make dry-run && | ||
make test-dry-run | ||
# See issue: https://github.com/mozilla/firefox-translations-training/issues/363 | ||
# snakemake-dry-run: | ||
# # Ensure that the snakemake workflow is still executing correctly, even though | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Is this related to your changes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just want CI to pass here and disabled this since it's related only to snakemake. |
||
# # taskcluster is the preferred execution environment. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thought: This kind of change would be better in a separate PR. I'd prefer to merge with a dirty CI, and fast-follow with another PR. This makes it clearer then things go wrong. For now this change is fine. |
||
# worker-type: b-cpu | ||
# worker: | ||
# max-run-time: 3600 | ||
# docker-image: {in-tree: test} | ||
# run-on-tasks-for: ["github-push", "github-pull-request"] | ||
# optimization: | ||
# skip-unless-changed: | ||
# - pipeline/** | ||
# - envs/** | ||
# - configs/** | ||
# run: | ||
# command: | ||
# - bash | ||
# - -c | ||
# - >- | ||
# echo "Setting environment variables" && | ||
# export CONDA_PATH=/builds/worker/artifacts/mambaforge && | ||
# export SNAKEMAKE_OUTPUT_CACHE=/builds/worker/artifacts/mambaforge && | ||
# export REPORTS=/builds/worker/artifacts/reports && | ||
# export MODELS=/builds/worker/artifacts/models && | ||
# | ||
# echo "Install necessary dependencies" && | ||
# make conda && | ||
# make snakemake && | ||
# make git-modules && | ||
# | ||
# echo "Start the dry run" && | ||
# make dry-run && | ||
# make test-dry-run | ||
|
||
black: | ||
# Run python's black formatter, which formats python files. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instability was not caused by stages so it's fine to use a simpler schedule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I like this as being simpler.