Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error fixing #58

Merged
merged 9 commits into from
Nov 22, 2023
Merged

OOM error fixing #58

merged 9 commits into from
Nov 22, 2023

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Nov 9, 2023

#52
Fix out of memory errors by gradually backing off of engine batch size.


This change is Reviewable

@johnml1135 johnml1135 requested a review from ddaspit November 9, 2023 20:30
@johnml1135
Copy link
Collaborator Author

Testing:


[WARNING|text2text_generation.py:311] 2023-11-09 15:34:53,499 >> Your input_length: 937 is bigger than 0.9 * max_length: 200. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
2023-11-09 15:34:59
2023-11-09 15:34:57,105 - machine.jobs.nmt_engine_build_job - INFO - Out of memory error, reducing batch size to 8
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file sentencepiece.bpe.model
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2041] 2023-11-09 15:34:57,107 >> loading file tokenizer_config.json
[WARNING|logging.py:290] 2023-11-09 15:34:57,553 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|text2text_generation.py:311] 2023-11-09 15:34:57,566 >> Your input_length: 529 is bigger than 0.9 * max_length: 200. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
2023-11-09 15:35:05
[WARNING|text2text_generation.py:311] 2023-11-09 15:35:00,315 >> Your input_length: 937 is bigger than 0.9 * max_length: 200. You might consider increasing your max_length manually, e.g. translator('...', max_length=400)
2023-11-09 15:35:03,667 - machine.jobs.build_nmt_engine - INFO - Finished
2023-11-09 15:35:16
Process completed successfully

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be simpler if we added this as a feature of the HuggingFaceNmtEngine class. I think it is something that would be generally useful. We could pass in a parameter to the constructor of the class to turn it on. The parameter could be a number between 0 and 1 with a default value of 1. The value would be multiplied by the current batch size to get the new batch size. We could call it oom_batch_size_backoff_factor or something like that. No other code would need to be changed. If it is unclear what I am suggesting, I can implement it.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

If you would like to redo the implementation, go ahead. What I have now works and you should be able to pull a lot from it.

Copy link
Collaborator Author

@johnml1135 johnml1135 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me take a last review of the changes.

Reviewable status: 0 of 7 files reviewed, all discussions resolved (waiting on @ddaspit)

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @johnml1135)


machine/jobs/settings.yaml line 38 at r2 (raw file):

    parent_model_name: facebook/nllb-200-distilled-600M
    train_params:
      group_by_length: false

We did you disable grouping by length?


machine/translation/nmt_translation_engine.py line 9 at r2 (raw file):

class NmtTranslationEngine(TranslationEngine, ContextManager["NmtTranslationEngine"]):

This interface isn't needed. You can remove it.


machine/translation/huggingface/hugging_face_nmt_engine.py line 68 at r2 (raw file):

        batch_size = self._pipeline_kwargs.pop("batch_size")
        if batch_size is not None:
            self._batch_size = int(batch_size)  # type: ignore[assignment]

What is causing the type error? You shouldn't need to ignore the error. Use cast if necessary.


machine/translation/huggingface/hugging_face_nmt_engine.py line 70 at r2 (raw file):

            self._batch_size = int(batch_size)  # type: ignore[assignment]
        else:
            self._batch_size = 16

The default batch size should be 1.


machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

                    all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
                return all_results
            except Exception as e:

This will catch any exception and treat it like an OOM error. We need to be more specific here.


machine/translation/huggingface/hugging_face_nmt_engine.py line 109 at r2 (raw file):

                return all_results
            except Exception as e:
                if self._oom_batch_size_backoff_multiplier >= 0.9999:

We should also check if the current batch size is 1.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @johnml1135)


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 137 at r2 (raw file):

        set_seed(self._training_args.seed)

        logger.info("Initializing tokenizer.")

This should be moved to the line that actually initializes the tokenizer.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 197 at r2 (raw file):

            return AutoTokenizer.from_pretrained(str(tokenizer_dir), use_fast=True)

        logger.info("Checking for missing tokens.")

Move this into the if check.


machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 238 at r2 (raw file):

                tokenizer.id_to_lang_token[lang_id] = lang_code

        logger.info("Add new language codes as tokens.")

Move this into the if check.

@johnml1135
Copy link
Collaborator Author

machine/jobs/settings.yaml line 38 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We did you disable grouping by length?

I'll undo it - I was trying to reduce the startup time, but it didn't help.

@johnml1135
Copy link
Collaborator Author

machine/translation/nmt_translation_engine.py line 9 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This interface isn't needed. You can remove it.

Forgot to remove them all done.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 68 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

What is causing the type error? You shouldn't need to ignore the error. Use cast if necessary.

Resolved.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 70 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The default batch size should be 1.

ok. The settings.yaml still has 16 which should be an appropriate default for what we need.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This will catch any exception and treat it like an OOM error. We need to be more specific here.

This is a known issue: pytorch/pytorch#109961. It will be addressed here: #67. Made updates for now incase there is a different type of error.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 109 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We should also check if the current batch size is 1.

Done.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 137 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This should be moved to the line that actually initializes the tokenizer.

Done

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 197 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Move this into the if check.

Done

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_model_trainer.py line 238 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Move this into the if check.

Done

Respond to reviewer comments.
@codecov-commenter
Copy link

codecov-commenter commented Nov 21, 2023

Codecov Report

Attention: 9 lines in your changes are missing coverage. Please review.

Comparison is base (2174c7d) 86.61% compared to head (c675bbc) 86.59%.
Report is 6 commits behind head on main.

Files Patch % Lines
...translation/huggingface/hugging_face_nmt_engine.py 70.96% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #58      +/-   ##
==========================================
- Coverage   86.61%   86.59%   -0.03%     
==========================================
  Files         223      223              
  Lines       13366    13395      +29     
==========================================
+ Hits        11577    11599      +22     
- Misses       1789     1796       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 8 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

This is a known issue: pytorch/pytorch#109961. It will be addressed here: #67. Made updates for now incase there is a different type of error.

I took a look at the issues in the Pytorch repo. There is a class (torch.cuda.OutOfMemoryError) that gets thrown for OOM errors that we can catch. It just doesn't inherit from Exception, so Pylance displays an error stating: "OutOfMemoryError" is not a valid exception class. We should be able to use torch.cuda.OutOfMemoryError and safely ignore the Pylance error.

Also, I don't think we should raise a new error. We should just rethrow the existing OutOfMemoryError.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I took a look at the issues in the Pytorch repo. There is a class (torch.cuda.OutOfMemoryError) that gets thrown for OOM errors that we can catch. It just doesn't inherit from Exception, so Pylance displays an error stating: "OutOfMemoryError" is not a valid exception class. We should be able to use torch.cuda.OutOfMemoryError and safely ignore the Pylance error.

Also, I don't think we should raise a new error. We should just rethrow the existing OutOfMemoryError.

Ok. Will fix and merge.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok. Will fix and merge.

Actually, it won't work. We can't rethrow a OutOfMemoryError because it is not actually an exception. I think we will likely have to leave it as is.

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Actually, it won't work. We can't rethrow a OutOfMemoryError because it is not actually an exception. I think we will likely have to leave it as is.

I figured it out - fixing it.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I figured it out - fixing it.

This worked on my machine:

try:
    for step in range(0, outer_batch_size, self._batch_size):
        all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
    return all_results
except torch.cuda.OutOfMemoryError:  # type: ignore[reportGeneralTypeIssues]
    if self._oom_batch_size_backoff_multiplier >= 0.9999 or self._batch_size == 1:
        raise
    self._batch_size = max(int(round(self._batch_size * self._oom_batch_size_backoff_multiplier)), 1)
    logger.warn(f"Out of memory error caught, reducing batch size to {self._batch_size} and retrying.")
    self._pipeline = _TranslationPipeline(
        model=self._model,
        tokenizer=self._tokenizer,
        batch_size=self._batch_size,
        **self._pipeline_kwargs,
    )

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

machine/translation/huggingface/hugging_face_nmt_engine.py line 108 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This worked on my machine:

try:
    for step in range(0, outer_batch_size, self._batch_size):
        all_results.extend(self._try_translate_n_batch(n, segments[step : step + self._batch_size]))
    return all_results
except torch.cuda.OutOfMemoryError:  # type: ignore[reportGeneralTypeIssues]
    if self._oom_batch_size_backoff_multiplier >= 0.9999 or self._batch_size == 1:
        raise
    self._batch_size = max(int(round(self._batch_size * self._oom_batch_size_backoff_multiplier)), 1)
    logger.warn(f"Out of memory error caught, reducing batch size to {self._batch_size} and retrying.")
    self._pipeline = _TranslationPipeline(
        model=self._model,
        tokenizer=self._tokenizer,
        batch_size=self._batch_size,
        **self._pipeline_kwargs,
    )

I checked it again - switching to ```
torch.cuda.OutOfMemoryError

Change to real error - and suppress warnings
change name to oom_batch_size_backoff_mult
@johnml1135 johnml1135 merged commit 4faa596 into main Nov 22, 2023
13 of 14 checks passed
@ddaspit ddaspit deleted the batch_size_1 branch November 22, 2023 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants