-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Poor Performance - Very Low bandwidth of SYCL kernel for LRN #8292
Comments
I have a question. What is the performance of the CUDA kernel for LRN ? |
CUDA Performance - For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5, Result, For, N = 6,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5, Result, For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5, Result, For, N = 5,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5, Result, To compile, We are attaching the source code below: |
I kind of rewrite your host programs to change data types for some variables, and the grid size for the CUDA kernel. Then I comment all the codes related to lrn_fwd, and just evaluate the performance of lrn_bwd. The optimization option is "-O3" for both compilers. For the large problem size you mentioned, the global work size is larger than the largest number represented by an integer. So I think the option "-fno-sycl-id-queries-fit-in-int" is needed. I observe that the SYCL kernel "lrn_bwd_kernel" takes 9.5 s and the CUDA kernel takes 0.6 s on an NV100 GPU. The performance gap is significant. The CUDA kernel uses 64 registers and the SYCL kernel 80 registers |
As you specified the performance gap is significant between SYCL and CUDA kernel. Any suggestion on how to improve time on SYCL kernel to match the execution time to CUDA version? |
Is there some license needed for your original example ? After observing a performance gap between the Cuda and Hip kernels, I would like to report the issue to ROCm too. |
As per my understanding we don't need any license for SYCL reproducer. Currently we are focusing on enhancing the performance of SYCL kernel only to achieve higher bandwidth. |
I have a question about the "lrn_bwd_kernel". When the value of "across_channel" is one, is the value of "channel" also expected to be one? The kernel consumes many registers, so it may be split into two kernels, one of which is selected with the boolean value of "channel".
|
Describe the bug
This reproducer is created for enhancing the performance, to achieve higher bandwidth for the SYCL implementation of
LRN primitive on Nvidia.
This reproducer computes the LRN algorithm for forward and then the memory bandwidth is calculated. Similarly, for backward propagation LRN algorithm is computed and memory bandwidth is calculated.
Reproduce
For the reproducer code, refer the attachments setup.sh, lrn.cpp and lrn_kernels.hpp
Go to the directory having the reproducer files and run the below script to setup the environment.
source setup.sh
To compile, run.
clang++ -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda lrn.cpp
The above generates the output file. To see the output bandwidth, run
./a.out
Observed behaviour
For, N = 2,C = 15,D = 10,H = 16,W = 16, size = 5, ndims = 5,
Propagation : Forward
Alg : LRN
Result,
Total time = 0.000588sec
Total bandwidth = 1.044898 Gb/s
For, N = 2,C = 150,D = 100,H = 160,W = 160, size = 5, ndims = 5,
Propagation : Backward
Alg : LRN
Result,
Total time = 0.688375 sec
Total bandwidth = 8.925368 Gb/s
Expected behavior
The ideal behavior is to attain the maximum bandwidth (i.e., CLPeak value for Float16 = 251.94 GBPS) for any input size.
For the current reproducer higher bandwidth is expected i.e., a minimum of 100 GBPS is expected.
Environment
-OS: Ubuntu 22.04.1 LTS
-Target device and vendor: Nvidia, Tesla T4
-DPC++ version: clang version 15.0.0 (https://github.com/intel/llvm.git 0c7a1e1)
-Dependencies version: Driver Version: 495.29.05 CUDA Version: 11.5
Additional context
Currently N, C, D, H, W, size, ndims are hard coded, can be changed as per need..
Attached the source code of the reproducer below
LRN_Reproducer.zip
The text was updated successfully, but these errors were encountered: