Optimize cuComputePartGradGammaBeta kernel for MI100 #10475

hubertlu-tw · 2022-02-05T00:28:43Z

Description:
Optimized "part_size" on MI100 for layerNorm implementation (specifically for cuComputePartGradGammaBeta and cuComputeGradGammaBeta)

Motivation and Context

Why is this change required? What problem does it solve?
https://ontrack.amd.com/browse/MSRCHA-161 : layer normalization forward & backward kernels slower on MI100 than on V100. It is ported from an Apex PR here: Optimize layer normalization for AMD GPUs ROCm/apex#66
If it fixes an open issue, please link to the issue here.

hubertlu-tw · 2022-02-05T00:48:34Z

@weixingzhang Please review. Thanks.

weixingzhang · 2022-02-08T17:22:30Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline

weixingzhang · 2022-02-08T17:22:41Z

/azp run Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-amd-gpu-ci-pipeline, Linux Nuphar CI Pipeline, orttraining-distributed

azure-pipelines · 2022-02-08T17:23:10Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2022-02-08T17:23:20Z

Azure Pipelines successfully started running 7 pipeline(s).

weixingzhang · 2022-02-08T17:24:47Z

/azp run orttraining-ortmodule, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline, onnxruntime-binary-size-checks-ci-pipeline, ONNX Runtime Web CI Pipeline

azure-pipelines · 2022-02-08T17:25:11Z

Azure Pipelines successfully started running 4 pipeline(s).

jeffdaily

Though the HIP_PLATFORM symbol is indeed defined, the precedent throughout (most) of the sources is to use the USE_ROCM symbol. There are some other places outside of this PR where the platform symbol was incorrectly used instead of USE_ROCM.

That said, if-not-else is harder to read than if-else. I would suggest also reordering your if/else statements to be

#ifdef USE_ROCM
// Optimization for ROCm MI100
#else
// no comment needed, just the original code here
#endif

orttraining/orttraining/training_ops/cuda/nn/layer_norm.cc

Co-authored-by: Jeff Daily <[email protected]>

weixingzhang · 2022-02-09T04:09:32Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline

weixingzhang · 2022-02-09T04:09:46Z

/azp run Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-amd-gpu-ci-pipeline, Linux Nuphar CI Pipeline, orttraining-distributed

weixingzhang · 2022-02-09T04:10:08Z

/azp run orttraining-ortmodule, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline, onnxruntime-binary-size-checks-ci-pipeline, ONNX Runtime Web CI Pipeline

azure-pipelines · 2022-02-09T04:10:12Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2022-02-09T04:10:22Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2022-02-09T04:10:32Z

Azure Pipelines successfully started running 4 pipeline(s).

Optimize cuComputePartGradGammaBeta kernel for MI100

8ed46ca

jeffdaily suggested changes Feb 8, 2022

View reviewed changes

orttraining/orttraining/training_ops/cuda/nn/layer_norm.cc Outdated Show resolved Hide resolved

orttraining/orttraining/training_ops/cuda/nn/layer_norm.cc Outdated Show resolved Hide resolved

hubertlu-tw and others added 2 commits February 8, 2022 12:59

Update orttraining/orttraining/training_ops/cuda/nn/layer_norm.cc

63323d0

Co-authored-by: Jeff Daily <[email protected]>

Update orttraining/orttraining/training_ops/cuda/nn/layer_norm.cc

edc025f

Co-authored-by: Jeff Daily <[email protected]>

weixingzhang approved these changes Feb 9, 2022

View reviewed changes

weixingzhang merged commit c9fbd0b into microsoft:master Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize cuComputePartGradGammaBeta kernel for MI100 #10475

Optimize cuComputePartGradGammaBeta kernel for MI100 #10475

hubertlu-tw commented Feb 5, 2022

hubertlu-tw commented Feb 5, 2022

weixingzhang commented Feb 8, 2022

weixingzhang commented Feb 8, 2022

azure-pipelines bot commented Feb 8, 2022

azure-pipelines bot commented Feb 8, 2022

weixingzhang commented Feb 8, 2022

azure-pipelines bot commented Feb 8, 2022

jeffdaily left a comment

weixingzhang commented Feb 9, 2022

weixingzhang commented Feb 9, 2022

weixingzhang commented Feb 9, 2022

azure-pipelines bot commented Feb 9, 2022

azure-pipelines bot commented Feb 9, 2022

azure-pipelines bot commented Feb 9, 2022

Optimize cuComputePartGradGammaBeta kernel for MI100 #10475

Optimize cuComputePartGradGammaBeta kernel for MI100 #10475

Conversation

hubertlu-tw commented Feb 5, 2022

hubertlu-tw commented Feb 5, 2022

weixingzhang commented Feb 8, 2022

weixingzhang commented Feb 8, 2022

azure-pipelines bot commented Feb 8, 2022

azure-pipelines bot commented Feb 8, 2022

weixingzhang commented Feb 8, 2022

azure-pipelines bot commented Feb 8, 2022

jeffdaily left a comment

Choose a reason for hiding this comment

weixingzhang commented Feb 9, 2022

weixingzhang commented Feb 9, 2022

weixingzhang commented Feb 9, 2022

azure-pipelines bot commented Feb 9, 2022

azure-pipelines bot commented Feb 9, 2022

azure-pipelines bot commented Feb 9, 2022