Replacing CudaAsyncBuffer with TArray to improve perf #3303

weixingzhang · 2020-03-23T22:39:16Z

Description: Describe your changes.
Using TArray instead of CudaAsyncBuffer to passing tensor meta data(mostly, tensor strides/pitches) to CUDA kernels to improve performance

Motivation and Context
Basically, with CudaAsyncBuffer, the sequence at host side is like below.

fill meta data to CPU Pinned memory
call cudaMemcpyAsync to upload these meta data from CPU Pinned memory to GPU memory.
launch CUDA kernel.

On device(GPU) side, the copy in #2 won't be executed until the moment when kernel in #3 is about to run on GPU. Usually, host side runs far ahead of device side. So, it is better to do data uploading early(during kernel launch time) on host side rather than until kernel execution to save time on GPU. To achieve it, we introduced TArray by which the data is passed to CUDA kernels by pass-by-value. On BERT-L training, we saw about 3-5% perf improvement.

ke1337 · 2020-03-23T23:07:52Z

" On BERT-L training, we saw about 3-5% perf improvement." I assume it's for non-fused model right? The change does not seem to touch contrib ops

onnxruntime/core/providers/cuda/object_detection/non_max_suppression.cc

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc

ke1337 · 2020-03-23T23:22:49Z

In general, I have concerns of regression in this change not caught by existing tests. I had added comments on some ops that fixed sized TArray may overflow.

ke1337

🕐

onnxruntime/core/providers/cuda/tensor/concat.cc

onnxruntime/core/providers/cuda/tensor/split.cc

weixingzhang · 2020-03-23T23:50:51Z

" On BERT-L training, we saw about 3-5% perf improvement." I assume it's for non-fused model right? The change does not seem to touch contrib ops

Yes. In BERT-L, we only fused Layernorm, GELU, Transpose+Matmul.

…se, concat, split

weixingzhang · 2020-03-24T05:59:34Z

In general, I have concerns of regression in this change not caught by existing tests. I had added comments on some ops that fixed sized TArray may overflow.

Agree, I just updated PR to keep CudaAsyncBuffer for non_max_suppression, cudnn_rnn_base, concat, split. Thanks!

ke1337

ke1337

removing using CudaAsyncBuffer

8b88e26

weixingzhang requested a review from a team as a code owner March 23, 2020 22:39

weixingzhang requested review from ke1337 and SherlockNoMad March 23, 2020 22:40

ke1337 reviewed Mar 23, 2020

View reviewed changes

onnxruntime/core/providers/cuda/object_detection/non_max_suppression.cc Outdated Show resolved Hide resolved

ke1337 reviewed Mar 23, 2020

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc Outdated Show resolved Hide resolved

ke1337 suggested changes Mar 23, 2020

View reviewed changes

ke1337 reviewed Mar 23, 2020

View reviewed changes

onnxruntime/core/providers/cuda/tensor/concat.cc Outdated Show resolved Hide resolved

ke1337 reviewed Mar 23, 2020

View reviewed changes

onnxruntime/core/providers/cuda/tensor/split.cc Outdated Show resolved Hide resolved

Keep CudaAsyncBuffer for these ops: non_max_suppression, cudnn_rnn_ba…

0a65194

…se, concat, split

weixingzhang added 2 commits March 24, 2020 07:38

fix windows build error

f953bf9

fix windows build error.

e2047c1

ke1337 previously approved these changes Mar 24, 2020

View reviewed changes

fix build error

51652bf

weixingzhang dismissed ke1337’s stale review via 51652bf March 24, 2020 16:55

fix windows build error

00d2221

ke1337 approved these changes Mar 24, 2020

View reviewed changes

weixingzhang merged commit fef7989 into master Mar 24, 2020

weixingzhang deleted the wezhan/asyncbuffer branch March 24, 2020 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing CudaAsyncBuffer with TArray to improve perf #3303

Replacing CudaAsyncBuffer with TArray to improve perf #3303

weixingzhang commented Mar 23, 2020

ke1337 commented Mar 23, 2020

ke1337 commented Mar 23, 2020 •

edited

Loading

ke1337 left a comment

weixingzhang commented Mar 23, 2020

weixingzhang commented Mar 24, 2020

ke1337 left a comment

ke1337 left a comment

Replacing CudaAsyncBuffer with TArray to improve perf #3303

Replacing CudaAsyncBuffer with TArray to improve perf #3303

Conversation

weixingzhang commented Mar 23, 2020

ke1337 commented Mar 23, 2020

ke1337 commented Mar 23, 2020 • edited Loading

ke1337 left a comment

Choose a reason for hiding this comment

weixingzhang commented Mar 23, 2020

weixingzhang commented Mar 24, 2020

ke1337 left a comment

Choose a reason for hiding this comment

ke1337 left a comment

Choose a reason for hiding this comment

ke1337 commented Mar 23, 2020 •

edited

Loading