Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage vectorized load/write for SkipLayerNorm #11803

Merged
merged 16 commits into from
Jul 6, 2022

Conversation

hubertlu-tw
Copy link
Contributor

@hubertlu-tw hubertlu-tw commented Jun 9, 2022

Description: Describe your changes.
Optimized SkipLayerNormKernel with half2 vectorized load/write.

Motivation and Context
The perf numbers in the below table were collected on MI200 with various input tensor sizes.

batch-size seq-len hidden-size total time (us) over 110 runs (original) Oneflow total time (us) over 110 runs (half2) Perf improvement (%) Oneflow vs. original Perf improvement (%) half2 vs. original
1 128 768 1164 1151 950 1.116838 18.38488
1 128 1024 1250 1244 924 0.48 26.08
1 384 768 1241 1223 1030 1.450443 17.00242
1 384 1024 1356 1345 1035 0.811209 23.67257
1 512 768 1312 1288 1055 1.829268 19.58841
1 512 1024 1433 1413 1093 1.395673 23.72645
4 128 768 1315 1286 1055 2.205323 19.77186
4 128 1024 1428 1371 1088 3.991597 23.80952
4 384 768 2080 2558 1475 -22.9808 29.08654
4 384 1024 2385 2930 1663 -22.8512 30.27254
4 512 768 2527 2815 1746 -11.3969 30.90621
4 512 1024 2938 3220 2295 -9.59837 21.88564
8 128 768 1630 1607 1206 1.411043 26.01227
8 128 1024 1815 1858 1365 -2.36915 24.79339
8 384 768 3409 3392 2314 0.49868 32.12086
8 384 1024 4051 4388 2708 -8.31893 33.15231
8 512 768 4471 4249 2859 4.965332 36.05457
8 512 1024 5844 5186 3337 11.25941 42.8987
32 128 768 4396 4256 2865 3.184713 34.82712
32 128 1024 5819 5174 3327 11.08438 42.82523
32 384 768 11247 9714 7302 13.6303 35.07602
32 384 1024 14672 12296 8729 16.19411 40.50573
32 512 768 14697 12617 9631 14.15255 34.46962
32 512 1024 19909 16063 11425 19.3179 42.61389
64 128 768 7937 7057 5071 11.08731 36.10936
64 128 1024 10192 9043 5969 11.27355 41.43446
64 384 768 21720 17711 14039 18.45764 35.36372
64 384 1024 28763 22835 16891 20.60981 41.27525
64 512 768 28756 22853 18560 20.52789 35.45695
64 512 1024 41267 30417 22819 26.29219 44.704
128 128 768 14873 12631 9623 15.0743 35.29886
128 128 1024 19855 16052 11423 19.15387 42.46789
128 384 768 42633 32947 27537 22.71949 35.40919
128 384 1024 57361 42337 33028 26.19201 42.42081
128 512 768 56512 43079 36681 23.77017 35.09166
128 512 1024 81459 57703 44867 29.16314 44.92076

@zhangyaobit
Copy link
Contributor

Thinking about this a little bit, it looks like the use of aligned_vector could be straightforward? We use aligned vector only for the parts of global memory read/write (pretty much like what we did for fast gelu?), the reduction part could be unchanged.

The tail handling should be similar to that of fast gelu as well? (The reduction part is unchanged/not affected)

@hubertlu-tw hubertlu-tw changed the title Leverage half2 vectorized load/write for SkipLayerNorm Leverage vectorized load/write for SkipLayerNorm Jun 30, 2022
zhangyaobit
zhangyaobit previously approved these changes Jul 5, 2022
@zhangyaobit
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

@zhangyaobit
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@zhangyaobit
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

@zhangyaobit
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@zhangyaobit
Copy link
Contributor

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

@zhangyaobit
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

@azure-pipelines
Copy link

Azure Pipelines successfully started running 6 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 10 pipeline(s).

@zhangyaobit zhangyaobit merged commit 835ecb2 into microsoft:master Jul 6, 2022
@centwang centwang mentioned this pull request Aug 11, 2022
tianleiwu added a commit that referenced this pull request Oct 16, 2023
Fix a bug in #11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.

Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.

Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.
jchen351 pushed a commit that referenced this pull request Oct 18, 2023
Fix a bug in #11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.

Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.

Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.
tianleiwu added a commit that referenced this pull request Oct 31, 2023
Fix a bug in #11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.

Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.

Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
…oft#17943)

Fix a bug in microsoft#11803:
When hidden size is not exactly same as next size (for example ld=320 in
stable diffusion) current vectorized kernel might read out-of-bounds,
and might cause CUDA failure.

Also resolved another issue: for the first and last size, current macro
will cause some dead code (some branch will never run). Here we change
it to avoid those branches in boundary sizes.

Performance tests with stable diffusion shows that the performance is
on-par before/after this fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants