Skip module clone for preparing large model export #18663

pengwa · 2023-12-01T08:06:20Z

Skip module clone for preparing large model export

For LLAMA2 13B, when running with Lora, DeepSpeed stage2 on 8 GPUs . It failed during preparing outputs which will be used for torch.onnx.export. The reason, we deep copy all the params including both big sizes of frozen weights, + a little bit of Lora trainable weight.

This PR will firstly check whether the GPU memmory is enough for a cloned module, if not, skip the copy.

Copying the module is to guarantee the fw path run may change the weight, while this case should be rare. But for now, Not-Able-To-Run is worse than Runnable-with-A-little-bit-different-initial-weight, especially for large models.

orttraining/orttraining/python/training/ortmodule/_io.py

…pengwa/disable_model_clone_for_large_model

thiagocrepaldi

LGTM, assuming the current behavior is not changed by default and that a nice warning message is presented to the user!

orttraining/orttraining/python/training/ortmodule/_io.py

orttraining/orttraining/python/training/ortmodule/options.py

justinchuby · 2023-12-04T22:15:04Z

Readability: (opinionated) avoid weak verbs like is/do in naming parameters, options or variables. They tend not to be pythonic and don’t help with readability.

orttraining/orttraining/python/training/ortmodule/_graph_execution_manager.py

askhade

LGTM

disable model clone when preparing outputs for export

c1a30a2

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Dec 1, 2023

pengwa requested review from baijumeswani, askhade and ajindal1 December 1, 2023 08:06

pengwa changed the title ~~Skip module clone for large model~~ Skip module clone for preparing large model export Dec 1, 2023

fix cis

02b53e1

askhade reviewed Dec 1, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_io.py Outdated Show resolved Hide resolved

askhade reviewed Dec 1, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_io.py Show resolved Hide resolved

pengwa added 3 commits December 4, 2023 03:56

refine according to comments

ec1e5e6

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

4d72bac

…pengwa/disable_model_clone_for_large_model

docs

16c0c96

thiagocrepaldi reviewed Dec 4, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_io.py Show resolved Hide resolved

justinchuby reviewed Dec 4, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_io.py Outdated Show resolved Hide resolved

orttraining/orttraining/python/training/ortmodule/_io.py Outdated Show resolved Hide resolved

orttraining/orttraining/python/training/ortmodule/options.py Outdated Show resolved Hide resolved

askhade reviewed Dec 4, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_graph_execution_manager.py Outdated Show resolved Hide resolved

refine

6e650fd

askhade approved these changes Dec 5, 2023

View reviewed changes

askhade merged commit 4bfa844 into main Dec 5, 2023
96 checks passed

askhade deleted the pengwa/disable_model_clone_for_large_model branch December 5, 2023 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip module clone for preparing large model export #18663

Skip module clone for preparing large model export #18663

pengwa commented Dec 1, 2023 •

edited

Loading

thiagocrepaldi left a comment

justinchuby commented Dec 4, 2023

askhade left a comment

Skip module clone for preparing large model export #18663

Skip module clone for preparing large model export #18663

Conversation

pengwa commented Dec 1, 2023 • edited Loading