-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize cuComputePartGradGammaBeta kernel for MI100 #10475
Conversation
@weixingzhang Please review. Thanks. |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline |
/azp run Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-amd-gpu-ci-pipeline, Linux Nuphar CI Pipeline, orttraining-distributed |
Azure Pipelines successfully started running 7 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 7 pipeline(s). |
/azp run orttraining-ortmodule, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline, onnxruntime-binary-size-checks-ci-pipeline, ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though the HIP_PLATFORM symbol is indeed defined, the precedent throughout (most) of the sources is to use the USE_ROCM symbol. There are some other places outside of this PR where the platform symbol was incorrectly used instead of USE_ROCM.
That said, if-not-else is harder to read than if-else. I would suggest also reordering your if/else statements to be
#ifdef USE_ROCM
// Optimization for ROCm MI100
#else
// no comment needed, just the original code here
#endif
Co-authored-by: Jeff Daily <[email protected]>
Co-authored-by: Jeff Daily <[email protected]>
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux CPU x64 NoContribops CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, MacOS NoContribops CI Pipeline, Windows CPU CI Pipeline |
/azp run Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, centos7_cpu, centos7_cpu (linux_centos_ci Debug), centos7_cpu (linux_centos_ci Release), orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-amd-gpu-ci-pipeline, Linux Nuphar CI Pipeline, orttraining-distributed |
/azp run orttraining-ortmodule, orttraining-ortmodule-distributed, onnxruntime-python-checks-ci-pipeline, onnxruntime-binary-size-checks-ci-pipeline, ONNX Runtime Web CI Pipeline |
Azure Pipelines successfully started running 7 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 7 pipeline(s). |
Azure Pipelines successfully started running 4 pipeline(s). |
Description:
Optimized "part_size" on MI100 for layerNorm implementation (specifically for cuComputePartGradGammaBeta and cuComputeGradGammaBeta)
Motivation and Context
Why is this change required? What problem does it solve?
https://ontrack.amd.com/browse/MSRCHA-161 : layer normalization forward & backward kernels slower on MI100 than on V100. It is ported from an Apex PR here: Optimize layer normalization for AMD GPUs ROCm/apex#66
If it fixes an open issue, please link to the issue here.