-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leverage vectorized load/write for SkipLayerNorm #11803
Leverage vectorized load/write for SkipLayerNorm #11803
Conversation
Thinking about this a little bit, it looks like the use of aligned_vector could be straightforward? We use aligned vector only for the parts of global memory read/write (pretty much like what we did for fast gelu?), the reduction part could be unchanged. The tail handling should be similar to that of fast gelu as well? (The reduction part is unchanged/not affected) |
…NormKernelSmall with the vectorized kernel
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
Fix a bug in #11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.
Fix a bug in #11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.
Fix a bug in #11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.
…oft#17943) Fix a bug in microsoft#11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.
Description: Describe your changes.
Optimized SkipLayerNormKernel with half2 vectorized load/write.
Motivation and Context
The perf numbers in the below table were collected on MI200 with various input tensor sizes.