Improve perf for mem efficient grad mgmt #20480

pengwa · 2024-04-26T14:56:24Z

Improve perf for mem efficient grad mgmt

When memory efficient gradient mangement feature is enabled, the weight retrieval PythonOp for every layers will be launched at the beginning of the forward, which would make GPU stream idle for few milliseconds. The reason is the ReversedDFS ordering cannot ALWAYS handle such input branching well, so we introduce a distantance-to-input_leaf concepts when doing the reversedDFS, which not only move the problematical PythonOp to the place where it is needed, but also those Cast ops following the weight retrieval to the place where it is needed.

Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples), 61.04sample/second
This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second (+0.8% gains)

Main branch:

This PR:

Motivation and Context

orttraining/orttraining/python/training/ortmodule/_mem_efficient_grad_mgmt.py

onnxruntime/core/graph/graph.cc

…pengwa/perf_efficient_mem

AdamLouly

LGTM

pengwa · 2024-05-10T00:08:46Z

Thanks @AdamLouly !!

### Improve perf for mem efficient grad mgmt When memory efficient gradient mangement feature is enabled, the weight retrieval PythonOp for every layers will be launched at the beginning of the forward, which would make GPU stream idle for few milliseconds. The reason is the ReversedDFS ordering cannot ALWAYS handle such input branching well, so we introduce a distantance-to-input_leaf concepts when doing the reversedDFS, which not only move the problematical PythonOp to the place where it is needed, but also those Cast ops following the weight retrieval to the place where it is needed. Main branch: 102.19 - 26.35s = 75.84s for 260 steps(4627samples), 61.04sample/second This PR: 100.28s - 25.10s = 75.18s for 260 steps. 61.54samples/second (+0.8% gains) Main branch: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/75c4131e-dade-49b0-aa8b-ee1c637ad9a8) This PR: ![image](https://github.com/microsoft/onnxruntime/assets/10530022/e590a536-3b80-4f51-b89f-f25a55ddd7e2) ### Motivation and Context

improve perf for mem efficient grad mgmt

e57fe35

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Apr 26, 2024

github-advanced-security bot found potential problems Apr 26, 2024

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_mem_efficient_grad_mgmt.py Fixed Show fixed Hide fixed

minor

6d07bb5

pengwa requested a review from wschin April 28, 2024 03:53

pengwa mentioned this pull request Apr 28, 2024

Enable mem efficient grad management for ORTModule #20260

Closed

AdamLouly reviewed Apr 30, 2024

View reviewed changes

orttraining/orttraining/python/training/ortmodule/_mem_efficient_grad_mgmt.py Show resolved Hide resolved

orttraining/orttraining/python/training/ortmodule/_mem_efficient_grad_mgmt.py Outdated Show resolved Hide resolved

onnxruntime/core/graph/graph.cc Show resolved Hide resolved

pengwa added 2 commits May 9, 2024 08:41

improvement

6099029

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

343827d

…pengwa/perf_efficient_mem

AdamLouly approved these changes May 9, 2024

View reviewed changes

pengwa merged commit 56f7035 into main May 10, 2024
95 checks passed

pengwa deleted the pengwa/perf_efficient_mem branch May 10, 2024 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve perf for mem efficient grad mgmt #20480

Improve perf for mem efficient grad mgmt #20480

pengwa commented Apr 26, 2024 •

edited

Loading

AdamLouly left a comment

pengwa commented May 10, 2024

Improve perf for mem efficient grad mgmt #20480

Improve perf for mem efficient grad mgmt #20480

Conversation

pengwa commented Apr 26, 2024 • edited Loading