OOM error fixing #58

johnml1135 · 2023-11-09T20:30:22Z

#52
Fix out of memory errors by gradually backing off of engine batch size.

This change is

johnml1135 · 2023-11-09T20:36:57Z

Testing:


[WARNING|text2text_generation.py:311] 2023-11-09 15:34:53,499 >> Your input_length: 937 is bigger than 0.9 * max_length: 200. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
2023-11-09 15:34:59
2023-11-09 15:34:57,105 - machine.jobs.nmt_engine_build_job - INFO - Out of memory error, reducing batch size to 8
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file sentencepiece.bpe.model
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file tokenizer_config.json
[WARNING|logging.py:290] 2023-11-09 15:34:57,553 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|text2text_generation.py:311] 2023-11-09 15:34:57,566 >> Your input_length: 529 is bigger than 0.9 * max_length: 200. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
2023-11-09 15:35:05
[WARNING|text2text_generation.py:311] 2023-11-09 15:35:00,315 >> Your input_length: 937 is bigger than 0.9 * max_length: 200. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
2023-11-09 15:35:03,667 - machine.jobs.build_nmt_engine - INFO - Finished
2023-11-09 15:35:16
Process completed successfully

ddaspit

I think this would be simpler if we added this as a feature of the HuggingFaceNmtEngine class. I think it is something that would be generally useful. We could pass in a parameter to the constructor of the class to turn it on. The parameter could be a number between 0 and 1 with a default value of 1. The value would be multiplied by the current batch size to get the new batch size. We could call it oom_batch_size_backoff_factor or something like that. No other code would need to be changed. If it is unclear what I am suggesting, I can implement it.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135 · 2023-11-09T22:19:01Z

If you would like to redo the implementation, go ahead. What I have now works and you should be able to pull a lot from it.

johnml1135

Please let me take a last review of the changes.

Reviewable status: 0 of 7 files reviewed, all discussions resolved (waiting on @ddaspit)

ddaspit

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @johnml1135)

machine/jobs/settings.yaml line 38 at r2 (raw file):

    parent_model_name: facebook/nllb-200-distilled-600M
    train_params:
      group_by_length: false

We did you disable grouping by length?

machine/translation/nmt_translation_engine.py line 9 at r2 (raw file):

class NmtTranslationEngine(TranslationEngine, ContextManager["NmtTranslationEngine"]):

This interface isn't needed. You can remove it.

machine/translation/huggingface/hugging_face_nmt_engine.py line 68 at r2 (raw file):

        batch_size = self._pipeline_kwargs.pop("batch_size")
        if batch_size is not None:
            self._batch_size = int(batch_size)  # type: ignore[assignment]

What is causing the type error? You shouldn't need to ignore the error. Use cast if necessary.

machine/translation/huggingface/hugging_face_nmt_engine.py line 70 at r2 (raw file):

            self._batch_size = int(batch_size)  # type: ignore[assignment]
        else:
            self._batch_size = 16

The default batch size should be 1.

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

                    all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
                return all_results
            except Exception as e:

This will catch any exception and treat it like an OOM error. We need to be more specific here.

machine/translation/huggingface/hugging_face_nmt_engine.py line 109 at r2 (raw file):

                return all_results
            except Exception as e:
                if self._oom_batch_size_backoff_multiplier >= 0.9999:

We should also check if the current batch size is 1.

ddaspit

Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @johnml1135)

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 137 at r2 (raw file):

        set_seed(self._training_args.seed)

        logger.info("Initializing tokenizer.")

This should be moved to the line that actually initializes the tokenizer.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 197 at r2 (raw file):

            return AutoTokenizer.from_pretrained(str(tokenizer_dir), use_fast=True)

        logger.info("Checking for missing tokens.")

Move this into the if check.

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 238 at r2 (raw file):

                tokenizer.id_to_lang_token[lang_id] = lang_code

        logger.info("Add new language codes as tokens.")

Move this into the if check.

johnml1135 · 2023-11-21T14:33:51Z

machine/jobs/settings.yaml line 38 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We did you disable grouping by length?

I'll undo it - I was trying to reduce the startup time, but it didn't help.

johnml1135 · 2023-11-21T14:39:11Z

machine/translation/nmt_translation_engine.py line 9 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This interface isn't needed. You can remove it.

Forgot to remove them all done.

johnml1135 · 2023-11-21T14:52:06Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 68 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is causing the type error? You shouldn't need to ignore the error. Use cast if necessary.

Resolved.

johnml1135 · 2023-11-21T14:52:55Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 70 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The default batch size should be 1.

ok. The settings.yaml still has 16 which should be an appropriate default for what we need.

johnml1135 · 2023-11-21T15:12:10Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This will catch any exception and treat it like an OOM error. We need to be more specific here.

This is a known issue: pytorch/pytorch#109961. It will be addressed here: #67. Made updates for now incase there is a different type of error.

johnml1135 · 2023-11-21T15:12:17Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 109 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should also check if the current batch size is 1.

Done.

johnml1135 · 2023-11-21T15:13:41Z

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 137 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be moved to the line that actually initializes the tokenizer.

Done

johnml1135 · 2023-11-21T15:14:36Z

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 197 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Move this into the if check.

Done

johnml1135 · 2023-11-21T15:15:18Z

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 238 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Move this into the if check.

Done

Respond to reviewer comments.

codecov-commenter · 2023-11-21T15:19:15Z

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (2174c7d) 86.61% compared to head (c675bbc) 86.59%.
Report is 6 commits behind head on main.

Files	Patch %	Lines
...translation/huggingface/hugging_face_nmt_engine.py	70.96%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
- Coverage   86.61%   86.59%   -0.03%     
==========================================
  Files         223      223              
  Lines       13366    13395      +29     
==========================================
+ Hits        11577    11599      +22     
- Misses       1789     1796       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

Reviewed 7 of 8 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is a known issue: pytorch/pytorch#109961. It will be addressed here: #67. Made updates for now incase there is a different type of error.

I took a look at the issues in the Pytorch repo. There is a class (torch.cuda.OutOfMemoryError) that gets thrown for OOM errors that we can catch. It just doesn't inherit from Exception, so Pylance displays an error stating: "OutOfMemoryError" is not a valid exception class. We should be able to use torch.cuda.OutOfMemoryError and safely ignore the Pylance error.

Also, I don't think we should raise a new error. We should just rethrow the existing OutOfMemoryError.

johnml1135 · 2023-11-21T17:53:50Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I took a look at the issues in the Pytorch repo. There is a class (torch.cuda.OutOfMemoryError) that gets thrown for OOM errors that we can catch. It just doesn't inherit from Exception, so Pylance displays an error stating: "OutOfMemoryError" is not a valid exception class. We should be able to use torch.cuda.OutOfMemoryError and safely ignore the Pylance error.

Also, I don't think we should raise a new error. We should just rethrow the existing OutOfMemoryError.

Ok. Will fix and merge.

johnml1135 · 2023-11-21T19:09:46Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok. Will fix and merge.

Actually, it won't work. We can't rethrow a OutOfMemoryError because it is not actually an exception. I think we will likely have to leave it as is.

johnml1135 · 2023-11-21T19:34:56Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Actually, it won't work. We can't rethrow a OutOfMemoryError because it is not actually an exception. I think we will likely have to leave it as is.

I figured it out - fixing it.

ddaspit

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I figured it out - fixing it.

This worked on my machine:

try:
    for step in range(0, outer_batch_size, self._batch_size):
        all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
    return all_results
except torch.cuda.OutOfMemoryError:  # type: ignore[reportGeneralTypeIssues]
    if self._oom_batch_size_backoff_multiplier >= 0.9999 or self._batch_size == 1:
        raise
    self._batch_size = max(int(round(self._batch_size * self._oom_batch_size_backoff_multiplier)), 1)
    logger.warn(f"Out of memory error caught, reducing batch size to {self._batch_size} and retrying.")
    self._pipeline = _TranslationPipeline(
        model=self._model,
        tokenizer=self._tokenizer,
        batch_size=self._batch_size,
        **self._pipeline_kwargs,
    )

ddaspit

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135 · 2023-11-22T01:41:25Z

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This worked on my machine:

try:
    for step in range(0, outer_batch_size, self._batch_size):
        all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
    return all_results
except torch.cuda.OutOfMemoryError:  # type: ignore[reportGeneralTypeIssues]
    if self._oom_batch_size_backoff_multiplier >= 0.9999 or self._batch_size == 1:
        raise
    self._batch_size = max(int(round(self._batch_size * self._oom_batch_size_backoff_multiplier)), 1)
    logger.warn(f"Out of memory error caught, reducing batch size to {self._batch_size} and retrying.")
    self._pipeline = _TranslationPipeline(
        model=self._model,
        tokenizer=self._tokenizer,
        batch_size=self._batch_size,
        **self._pipeline_kwargs,
    )

I checked it again - switching to ```
torch.cuda.OutOfMemoryError

Change to real error - and suppress warnings change name to oom_batch_size_backoff_mult

…into batch_size_1

johnml1135 added 2 commits November 8, 2023 15:38

First try at fixing OOM's

562bab2

Second round of fixes for OOM errors

c61f841

johnml1135 requested a review from ddaspit November 9, 2023 20:30

ddaspit requested changes Nov 9, 2023

View reviewed changes

Fix tests

f77b1c8

johnml1135 commented Nov 9, 2023

View reviewed changes

Rework to add to huggingface directly

9872305

ddaspit requested changes Nov 20, 2023

View reviewed changes

ddaspit reviewed Nov 20, 2023

View reviewed changes

Revert initial implementation.

db31d41

Respond to reviewer comments.

Add fixme

1ea55fa

ddaspit reviewed Nov 21, 2023

View reviewed changes

Use OutOfMemoryError to backoff of batch size

6e1a8f8

ddaspit approved these changes Nov 22, 2023

View reviewed changes

johnml1135 added 2 commits November 21, 2023 20:53

Fix small error.

16203fc

Change to real error - and suppress warnings change name to oom_batch_size_backoff_mult

Merge branch 'batch_size_1' of https://github.com/sillsdev/machine.py …

c675bbc

…into batch_size_1

johnml1135 merged commit 4faa596 into main Nov 22, 2023

ddaspit deleted the batch_size_1 branch November 22, 2023 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error fixing #58

OOM error fixing #58

johnml1135 commented Nov 9, 2023 •

edited by ddaspit

Loading

johnml1135 commented Nov 9, 2023

ddaspit left a comment

johnml1135 commented Nov 9, 2023

johnml1135 left a comment

ddaspit left a comment

ddaspit left a comment

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

codecov-commenter commented Nov 21, 2023 •

edited

Loading

ddaspit left a comment

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

ddaspit left a comment

ddaspit left a comment

johnml1135 commented Nov 22, 2023

OOM error fixing #58

OOM error fixing #58

Conversation

johnml1135 commented Nov 9, 2023 • edited by ddaspit Loading

johnml1135 commented Nov 9, 2023

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 9, 2023

johnml1135 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

codecov-commenter commented Nov 21, 2023 • edited Loading

Codecov Report

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

johnml1135 commented Nov 21, 2023

ddaspit left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 22, 2023

johnml1135 commented Nov 9, 2023 •

edited by ddaspit

Loading

codecov-commenter commented Nov 21, 2023 •

edited

Loading