Leverage vectorized load/write for SkipLayerNorm #11803

hubertlu-tw · 2022-06-09T17:53:42Z

Description: Describe your changes.
Optimized SkipLayerNormKernel with half2 vectorized load/write.

Motivation and Context
The perf numbers in the below table were collected on MI200 with various input tensor sizes.

batch-size	seq-len	hidden-size	total time (us) over 110 runs (original)	Oneflow	total time (us) over 110 runs (half2)	Perf improvement (%) Oneflow vs. original	Perf improvement (%) half2 vs. original
1	128	768	1164	1151	950	1.116838	18.38488
1	128	1024	1250	1244	924	0.48	26.08
1	384	768	1241	1223	1030	1.450443	17.00242
1	384	1024	1356	1345	1035	0.811209	23.67257
1	512	768	1312	1288	1055	1.829268	19.58841
1	512	1024	1433	1413	1093	1.395673	23.72645
4	128	768	1315	1286	1055	2.205323	19.77186
4	128	1024	1428	1371	1088	3.991597	23.80952
4	384	768	2080	2558	1475	-22.9808	29.08654
4	384	1024	2385	2930	1663	-22.8512	30.27254
4	512	768	2527	2815	1746	-11.3969	30.90621
4	512	1024	2938	3220	2295	-9.59837	21.88564
8	128	768	1630	1607	1206	1.411043	26.01227
8	128	1024	1815	1858	1365	-2.36915	24.79339
8	384	768	3409	3392	2314	0.49868	32.12086
8	384	1024	4051	4388	2708	-8.31893	33.15231
8	512	768	4471	4249	2859	4.965332	36.05457
8	512	1024	5844	5186	3337	11.25941	42.8987
32	128	768	4396	4256	2865	3.184713	34.82712
32	128	1024	5819	5174	3327	11.08438	42.82523
32	384	768	11247	9714	7302	13.6303	35.07602
32	384	1024	14672	12296	8729	16.19411	40.50573
32	512	768	14697	12617	9631	14.15255	34.46962
32	512	1024	19909	16063	11425	19.3179	42.61389
64	128	768	7937	7057	5071	11.08731	36.10936
64	128	1024	10192	9043	5969	11.27355	41.43446
64	384	768	21720	17711	14039	18.45764	35.36372
64	384	1024	28763	22835	16891	20.60981	41.27525
64	512	768	28756	22853	18560	20.52789	35.45695
64	512	1024	41267	30417	22819	26.29219	44.704
128	128	768	14873	12631	9623	15.0743	35.29886
128	128	1024	19855	16052	11423	19.15387	42.46789
128	384	768	42633	32947	27537	22.71949	35.40919
128	384	1024	57361	42337	33028	26.19201	42.42081
128	512	768	56512	43079	36681	23.77017	35.09166
128	512	1024	81459	57703	44867	29.16314	44.92076

zhangyaobit · 2022-06-10T04:01:31Z

Thinking about this a little bit, it looks like the use of aligned_vector could be straightforward? We use aligned vector only for the parts of global memory read/write (pretty much like what we did for fast gelu?), the reduction part could be unchanged.

The tail handling should be similar to that of fast gelu as well? (The reduction part is unchanged/not affected)

…DeviceProp

onnxruntime/contrib_ops/cuda/bert/layer_norm.cuh

onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.h

onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu

onnxruntime/contrib_ops/cuda/bert/layer_norm.cuh

onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu

…NormKernelSmall with the vectorized kernel

onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu

zhangyaobit · 2022-07-05T20:36:54Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

zhangyaobit · 2022-07-05T20:37:05Z

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2022-07-05T20:37:29Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2022-07-05T20:37:31Z

Azure Pipelines successfully started running 10 pipeline(s).

…rnorm_half2

zhangyaobit · 2022-07-05T21:31:57Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

zhangyaobit · 2022-07-05T21:32:06Z

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2022-07-05T21:32:31Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2022-07-05T21:32:38Z

Azure Pipelines successfully started running 10 pipeline(s).

zhangyaobit · 2022-07-05T23:43:54Z

/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline

zhangyaobit · 2022-07-05T23:44:06Z

/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed

azure-pipelines · 2022-07-05T23:44:31Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2022-07-05T23:44:31Z

Azure Pipelines successfully started running 10 pipeline(s).

Fix a bug in #11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.

…oft#17943) Fix a bug in microsoft#11803: When hidden size is not exactly same as next size (for example ld=320 in stable diffusion) current vectorized kernel might read out-of-bounds, and might cause CUDA failure. Also resolved another issue: for the first and last size, current macro will cause some dead code (some branch will never run). Here we change it to avoid those branches in boundary sizes. Performance tests with stable diffusion shows that the performance is on-par before/after this fix.

hubertlu-tw added 4 commits May 12, 2022 18:28

First attempt for half2 vectorized memory access in SkipLayerNorm

0557af6

Add some functions for debugging

27d0c31

Clean up the code

90e6c2f

Clean up the code

71bd2ec

hubertlu-tw added 4 commits June 21, 2022 00:12

Generalize the vectorized kernels with aligned_vector and remove cuda…

2ff0aaf

…DeviceProp

Add a unit test for a larger input size

3a262af

Fix some Lint C++ warnings

be39627

Use ILP = 4 for the vectorized kernels

9d0d552

zhangyaobit requested review from tianleiwu and zhangyaobit June 24, 2022 19:10

zhangyaobit reviewed Jun 24, 2022

View reviewed changes

hubertlu-tw added 2 commits June 28, 2022 23:55

Rewrite the vectorized kernel and templatize ComputeSkipLayerNorm

116cc74

Use conditional operator for input_v

3f63b80

zhangyaobit reviewed Jun 29, 2022

View reviewed changes

hubertlu-tw changed the title ~~Leverage half2 vectorized load/write for SkipLayerNorm~~ Leverage vectorized load/write for SkipLayerNorm Jun 30, 2022

hubertlu-tw added 2 commits June 30, 2022 20:36

Refactor LaunchSkipLayerNormKernel and replace the original SkipLayer…

3d50998

…NormKernelSmall with the vectorized kernel

Clean some comments and rename the layernorm function

7d242ca

zhangyaobit reviewed Jul 1, 2022

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu Outdated Show resolved Hide resolved

onnxruntime/contrib_ops/cuda/bert/skip_layer_norm_impl.cu Show resolved Hide resolved

hubertlu-tw added 2 commits July 5, 2022 18:32

Use ComputeSkipLayerNorm to replace LaunchSkipLayerNormKernel

1977b63

Resolve a Lint C++ warning

36c7c50

zhangyaobit previously approved these changes Jul 5, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into hubertlu/skiplaye…

f424cb3

…rnorm_half2

Fix SkipLayerNormBatch1_Float16_vec output data

7900b31

hubertlu-tw dismissed zhangyaobit’s stale review via 7900b31 July 5, 2022 22:52

zhangyaobit approved these changes Jul 6, 2022

View reviewed changes

zhangyaobit merged commit 835ecb2 into microsoft:master Jul 6, 2022

centwang mentioned this pull request Aug 11, 2022

Fix Compile Warning #12552

Merged

tianleiwu mentioned this pull request Oct 13, 2023

[CUDA] Fix SkipLayerNorm vectorized kernel out-of-bounds read #17943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leverage vectorized load/write for SkipLayerNorm #11803

Leverage vectorized load/write for SkipLayerNorm #11803

hubertlu-tw commented Jun 9, 2022 •

edited

Loading

zhangyaobit commented Jun 10, 2022

zhangyaobit commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

Leverage vectorized load/write for SkipLayerNorm #11803

Leverage vectorized load/write for SkipLayerNorm #11803

Conversation

hubertlu-tw commented Jun 9, 2022 • edited Loading

zhangyaobit commented Jun 10, 2022

zhangyaobit commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

zhangyaobit commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

azure-pipelines bot commented Jul 5, 2022

hubertlu-tw commented Jun 9, 2022 •

edited

Loading