Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow 4x4 convolutions on gfx803 #134

Closed
huanzhang12 opened this issue Mar 30, 2020 · 4 comments
Closed

Very slow 4x4 convolutions on gfx803 #134

huanzhang12 opened this issue Mar 30, 2020 · 4 comments

Comments

@huanzhang12
Copy link

Since the ASM kernels were disabled on gfx803 in commit ce51a4c, 4x4 convolutions on gfx803 default to the very slow gemm algorithm:

./bin/MIOpenDriver conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver: conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 0, Solution: 33/gemm
GPU Kernel Time Forward Conv. Elapsed: 28.185415 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv4x4u2, 256, 16, 14, 14, 4, 4, 16,  411041792, 12861440, 3211264, 15, 1, 28.185415
Forward Convolution Verifies on CPU and GPU (7.39356e-08)

Before ASM kernels were disabled, it was much faster:

MIOpenDriver: conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 3, Solution: 15/ConvBinWinogradRxS
GPU Kernel Time Forward Conv. Elapsed: 0.592643 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv4x4u2, 256, 16, 14, 14, 4, 4, 16,  411041792, 12861440, 3211264, 694, 27, 0.592643
Forward Convolution Verifies on CPU and GPU (4.98272e-08)

The performance reduces from 694 GFLOPs to 15 GFLOPs.

I am wondering why all ASM kernels were disabled for gfx803 instead of disabling individual problematic ones?

Also, even without an ASM implementation, can we use a general OpenCL implementation in this case rather than rely on the extremely slow GEMM? (It seems conv_ocl_dir2Dfwd.cpp is not enabled for most 4x4 convolutions)

@atamazov
Copy link
Contributor

Which version of miopen do you use?

@huanzhang12
Copy link
Author

@atamazov Thanks for asking! I am using the current master of this repository (commit 2fe2ae9)
I also tried the pre-compiled MIOpen shipped with Rocm 3.1.
To obtain the performance of ASM kernels, I manually revert commit ce51a4c on the current master.

@atamazov
Copy link
Contributor

@huanzhang12 AFAICS you are using 2.2.0. Version 2.3.0 is just released. It includes c58488b that should restore gfx8 performance. Please close this if the issue is resolved.

@huanzhang12
Copy link
Author

@atamazov I tried the just released version 2.3.0 and it is amazing! It is great news that ASM kernels are re-enabled on gfx803. The same 4x4 convolution runs at 1684 GFLOPs:

./bin/MIOpenDriver conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 256 -c 16 -H 28 -W 28 -k 16 -y 4 -x 4 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 26/ConvHipImplicitGemmV4R1Fwd
GPU Kernel Time Forward Conv. Elapsed: 0.244142 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv4x4u2, 256, 16, 14, 14, 4, 4, 16,  411041792, 12861440, 3211264, 1684, 66, 0.244142
Forward Convolution Verifies on CPU and GPU (7.70828e-08)

My workload involving some 4x4 convolutions runs 10 times faster on v2.3.0. Thank you so much for the hard work and I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants