-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM error fixing #58
OOM error fixing #58
Conversation
Testing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be simpler if we added this as a feature of the HuggingFaceNmtEngine
class. I think it is something that would be generally useful. We could pass in a parameter to the constructor of the class to turn it on. The parameter could be a number between 0 and 1 with a default value of 1. The value would be multiplied by the current batch size to get the new batch size. We could call it oom_batch_size_backoff_factor
or something like that. No other code would need to be changed. If it is unclear what I am suggesting, I can implement it.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
If you would like to redo the implementation, go ahead. What I have now works and you should be able to pull a lot from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me take a last review of the changes.
Reviewable status: 0 of 7 files reviewed, all discussions resolved (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @johnml1135)
machine/jobs/settings.yaml
line 38 at r2 (raw file):
parent_model_name: facebook/nllb-200-distilled-600M train_params: group_by_length: false
We did you disable grouping by length?
machine/translation/nmt_translation_engine.py
line 9 at r2 (raw file):
class NmtTranslationEngine(TranslationEngine, ContextManager["NmtTranslationEngine"]):
This interface isn't needed. You can remove it.
machine/translation/huggingface/hugging_face_nmt_engine.py
line 68 at r2 (raw file):
batch_size = self._pipeline_kwargs.pop("batch_size") if batch_size is not None: self._batch_size = int(batch_size) # type: ignore[assignment]
What is causing the type error? You shouldn't need to ignore the error. Use cast
if necessary.
machine/translation/huggingface/hugging_face_nmt_engine.py
line 70 at r2 (raw file):
self._batch_size = int(batch_size) # type: ignore[assignment] else: self._batch_size = 16
The default batch size should be 1.
machine/translation/huggingface/hugging_face_nmt_engine.py
line 108 at r2 (raw file):
all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size])) return all_results except Exception as e:
This will catch any exception and treat it like an OOM error. We need to be more specific here.
machine/translation/huggingface/hugging_face_nmt_engine.py
line 109 at r2 (raw file):
return all_results except Exception as e: if self._oom_batch_size_backoff_multiplier >= 0.9999:
We should also check if the current batch size is 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @johnml1135)
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 137 at r2 (raw file):
set_seed(self._training_args.seed) logger.info("Initializing tokenizer.")
This should be moved to the line that actually initializes the tokenizer.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 197 at r2 (raw file):
return AutoTokenizer.from_pretrained(str(tokenizer_dir), use_fast=True) logger.info("Checking for missing tokens.")
Move this into the if
check.
machine/translation/huggingface/hugging_face_nmt_model_trainer.py
line 238 at r2 (raw file):
tokenizer.id_to_lang_token[lang_id] = lang_code logger.info("Add new language codes as tokens.")
Move this into the if
check.
Previously, ddaspit (Damien Daspit) wrote…
I'll undo it - I was trying to reduce the startup time, but it didn't help. |
Previously, ddaspit (Damien Daspit) wrote…
Forgot to remove them all done. |
Previously, ddaspit (Damien Daspit) wrote…
Resolved. |
Previously, ddaspit (Damien Daspit) wrote…
ok. The settings.yaml still has 16 which should be an appropriate default for what we need. |
Previously, ddaspit (Damien Daspit) wrote…
This is a known issue: pytorch/pytorch#109961. It will be addressed here: #67. Made updates for now incase there is a different type of error. |
Previously, ddaspit (Damien Daspit) wrote…
Done. |
Previously, ddaspit (Damien Daspit) wrote…
Done |
Previously, ddaspit (Damien Daspit) wrote…
Done |
Previously, ddaspit (Damien Daspit) wrote…
Done |
Respond to reviewer comments.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #58 +/- ##
==========================================
- Coverage 86.61% 86.59% -0.03%
==========================================
Files 223 223
Lines 13366 13395 +29
==========================================
+ Hits 11577 11599 +22
- Misses 1789 1796 +7 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 8 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
machine/translation/huggingface/hugging_face_nmt_engine.py
line 108 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
This is a known issue: pytorch/pytorch#109961. It will be addressed here: #67. Made updates for now incase there is a different type of error.
I took a look at the issues in the Pytorch repo. There is a class (torch.cuda.OutOfMemoryError
) that gets thrown for OOM errors that we can catch. It just doesn't inherit from Exception
, so Pylance displays an error stating: "OutOfMemoryError" is not a valid exception class
. We should be able to use torch.cuda.OutOfMemoryError
and safely ignore the Pylance error.
Also, I don't think we should raise a new error. We should just rethrow the existing OutOfMemoryError
.
Previously, ddaspit (Damien Daspit) wrote…
Ok. Will fix and merge. |
Previously, johnml1135 (John Lambert) wrote…
Actually, it won't work. We can't rethrow a OutOfMemoryError because it is not actually an exception. I think we will likely have to leave it as is. |
Previously, johnml1135 (John Lambert) wrote…
I figured it out - fixing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)
machine/translation/huggingface/hugging_face_nmt_engine.py
line 108 at r2 (raw file):
Previously, johnml1135 (John Lambert) wrote…
I figured it out - fixing it.
This worked on my machine:
try:
for step in range(0, outer_batch_size, self._batch_size):
all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
return all_results
except torch.cuda.OutOfMemoryError: # type: ignore[reportGeneralTypeIssues]
if self._oom_batch_size_backoff_multiplier >= 0.9999 or self._batch_size == 1:
raise
self._batch_size = max(int(round(self._batch_size * self._oom_batch_size_backoff_multiplier)), 1)
logger.warn(f"Out of memory error caught, reducing batch size to {self._batch_size} and retrying.")
self._pipeline = _TranslationPipeline(
model=self._model,
tokenizer=self._tokenizer,
batch_size=self._batch_size,
**self._pipeline_kwargs,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)
Previously, ddaspit (Damien Daspit) wrote…
I checked it again - switching to ```
|
Change to real error - and suppress warnings change name to oom_batch_size_backoff_mult
…into batch_size_1
#52
Fix out of memory errors by gradually backing off of engine batch size.
This change is