ORTModule memory efficient gradient management #18907

pengwa · 2023-12-21T17:45:07Z

ORTModule memory efficient gradient management

Previously I have tried to solve the coarsed-grained gradient accumulation/update problem in ORTModule with #8979, while that resolution somehow is not fully validated with DDP or there is user hooks on the gradient accumulation on torch parameter.

This PR is addressing the problem in the similar approach as PR 8979, e.g. trigger gradient accumulation once ORT computed the grad, but instead of use a AccumulateGrad op, this time with a ONNX operator PythonOp, internally it will call param.backward(grad), which will help handle all related hooks correctly.

Replacedby #18924

Motivation and Context

orttraining/orttraining/python/training/ortmodule/_mem_efficent_training.py

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Dec 21, 2023

pengwa marked this pull request as draft December 21, 2023 17:49

github-advanced-security bot found potential problems Dec 21, 2023

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_mem_efficent_training.py Fixed Show fixed Hide fixed

orttraining/orttraining/python/training/ortmodule/_mem_efficent_training.py Fixed Show fixed Hide fixed

pengwa changed the title ~~ORTModule memory efficient gradient mangement~~ ORTModule memory efficient gradient management Dec 21, 2023

pengwa added 13 commits December 24, 2023 22:44

update compute

2cc132a

fix

24e7503

fix

7d21142

debug info

c681eae

dump memory

ed8826c

revert some

81739c7

minor

c5c97f7

refinement

ae74921

lint

4ac5eb0

save

76640be

decouple pythonop creation

cd607d5

lint

333f235

remove stage3 related change

be122e3

pengwa force-pushed the pengwa/mem_efficient_grad_mgr branch from c585bc8 to be122e3 Compare December 25, 2023 06:44

pengwa changed the base branch from pengwa/update_recompute to main December 25, 2023 06:44

pengwa marked this pull request as ready for review December 25, 2023 06:45

pengwa closed this Dec 25, 2023

pengwa deleted the pengwa/mem_efficient_grad_mgr branch May 10, 2024 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORTModule memory efficient gradient management #18907

ORTModule memory efficient gradient management #18907

pengwa commented Dec 21, 2023 •

edited

Loading

ORTModule memory efficient gradient management #18907

ORTModule memory efficient gradient management #18907

Conversation

pengwa commented Dec 21, 2023 • edited Loading