-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip module clone for preparing large model export #18663
Conversation
…pengwa/disable_model_clone_for_large_model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, assuming the current behavior is not changed by default and that a nice warning message is presented to the user!
Readability: (opinionated) avoid weak verbs like is/do in naming parameters, options or variables. They tend not to be pythonic and don’t help with readability. |
orttraining/orttraining/python/training/ortmodule/_graph_execution_manager.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Skip module clone for preparing large model export
For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It failed during preparing outputs which will be used for torch.onnx.export. The reason, we deep copy all the params including both big sizes of frozen weights, + a little bit of Lora trainable weight.
This PR will firstly check whether the GPU memmory is enough for a cloned module, if not, skip the copy.
Copying the module is to guarantee the fw path run may change the weight, while this case should be rare. But for now, Not-Able-To-Run is worse than Runnable-with-A-little-bit-different-initial-weight, especially for large models.