Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip module clone for preparing large model export #18663

Merged
merged 6 commits into from
Dec 5, 2023

Conversation

pengwa
Copy link
Contributor

@pengwa pengwa commented Dec 1, 2023

Skip module clone for preparing large model export

For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It failed during preparing outputs which will be used for torch.onnx.export. The reason, we deep copy all the params including both big sizes of frozen weights, + a little bit of Lora trainable weight.

This PR will firstly check whether the GPU memmory is enough for a cloned module, if not, skip the copy.

Copying the module is to guarantee the fw path run may change the weight, while this case should be rare. But for now, Not-Able-To-Run is worse than Runnable-with-A-little-bit-different-initial-weight, especially for large models.

@pengwa pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Dec 1, 2023
@pengwa pengwa changed the title Skip module clone for large model Skip module clone for preparing large model export Dec 1, 2023
Copy link
Contributor

@thiagocrepaldi thiagocrepaldi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming the current behavior is not changed by default and that a nice warning message is presented to the user!

@justinchuby
Copy link
Contributor

Readability: (opinionated) avoid weak verbs like is/do in naming parameters, options or variables. They tend not to be pythonic and don’t help with readability.

Copy link
Contributor

@askhade askhade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@askhade askhade merged commit 4bfa844 into main Dec 5, 2023
96 checks passed
@askhade askhade deleted the pengwa/disable_model_clone_for_large_model branch December 5, 2023 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants