-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replacing CudaAsyncBuffer with TArray to improve perf #3303
Conversation
" On BERT-L training, we saw about 3-5% perf improvement." I assume it's for non-fused model right? The change does not seem to touch contrib ops |
onnxruntime/core/providers/cuda/object_detection/non_max_suppression.cc
Outdated
Show resolved
Hide resolved
In general, I have concerns of regression in this change not caught by existing tests. I had added comments on some ops that fixed sized TArray may overflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
Yes. In BERT-L, we only fused Layernorm, GELU, Transpose+Matmul. |
…se, concat, split
Agree, I just updated PR to keep CudaAsyncBuffer for non_max_suppression, cudnn_rnn_base, concat, split. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description: Describe your changes.
Using TArray instead of CudaAsyncBuffer to passing tensor meta data(mostly, tensor strides/pitches) to CUDA kernels to improve performance
Motivation and Context
Basically, with CudaAsyncBuffer, the sequence at host side is like below.
On device(GPU) side, the copy in #2 won't be executed until the moment when kernel in #3 is about to run on GPU. Usually, host side runs far ahead of device side. So, it is better to do data uploading early(during kernel launch time) on host side rather than until kernel execution to save time on GPU. To achieve it, we introduced TArray by which the data is passed to CUDA kernels by pass-by-value. On BERT-L training, we saw about 3-5% perf improvement.