-
Notifications
You must be signed in to change notification settings - Fork 235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3x3 convolutions performance problem #19
Comments
@Hex000 Thanks for contacting us. While I go and run your test code and understand the issue, can you please tell me about your system configuration?
How was MIOpen installed? Was it built from source? |
@Hex000 I am mostly certain something is incorrect with your environment. On my Vega GPU I see quite similar times to NVIDIA 1080Ti for that particular layer you mentioned above. Can you please let me know the exact model of your GPU as well.
|
@dagamayank, thank you! GPU is AMD RX Vega 64 hipconfig output: The MiOpen installation info: Driver info: |
@dagamayank We tried to run with driver provided with kernel 4.11.0-kfd-compute-rocm-rel-1.6-148, from http://repo.radeon.com/rocm and got around 120 ns. |
@Hex000 @prostowww Please consider the performance that I shared as "dev-preview". MIOpen requires some improvements in the base software stack which are planned to be released within the next two weeks as part of ROCm 1.6.4. You systems are currently configured with ROCm 1.6.3 (the last public release), which may be the cause of the poor performance that you notice. I will highly recommend not to mix the AMDGPU-PRO and ROCm software stacks on the same system. As an experiment can you please try setting this env. variable and check the performance again? I will ping you once ROCm 1.6.4 is public. |
@Hex000 just curious, do you have both cuda and rocm installed on the same machine? Have you experienced any issue? |
Dear friends, thank you for the answers! No, CUDA and ROCM are on different machines, so this was not tested. One more complain from me as a user, warping nine-dimensional space for 30 min is too tough: |
Just wanted to let you know we released ROCm 1.6.4 last week along with MIOpen v1.1.4, if you are already not aware. Can you please try updating your systems and run the performance experiment again? For performance measurements I would first like you to set an additional parameter for now.
The above is right now a workaround for an issue with page migration in ROCm. This will be fixed soon. |
dagamayank, thank you! |
@Hex000 FYI - there is a new release of both ROCm and MIOpen, so you may want to benchmark again. For now, I am closing this issue. Please create a new issue if you have questions or notice discrepancies. |
646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop 6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) 3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all 211dae822 Merge branch 'develop' into miopen_downstream_all 5890e3007 [Composable Kernel] update develop branch code to ck_upstream d5297abae fix bug in gridwise gemm xdlops v2r3 (#45) 38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop c3018794b bug fix (#39) fd49ff808 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34) b3e8d57d5 Tweak GEMM kernel (#38) 846f462bd Add VectorType support into StaticBuffer (#27) dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) 8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr 88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream f3acd2510 Add a version of Merge transform that use integerdivision and mod (#25) 19613902b GEMM driver and kernel (#29) 627d8ef35 Backward weight v4r4r2 with xdlops (#18) 10bb81106 Misc fixes (#24) 9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) a7a758d8c GlobalAtomicAdd for fp32/int32 (#23) 9d3f634a3 Xdlops refactor fix (#22) c6f26bb48 magic division use __umulhi() (#19) 6fe3627a9 Composable kernel init integration v3 (#1097) a2ad6d353 refactor dynamic xdlops iGemm (#13) ba6f79a75 Added host_conv_wrw for verification (#15) git-subtree-dir: src/composable_kernel git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1
Hi, dear friends!
We are investigating MiOpen potential to use in our deep learning applications, but Resnet-52 shows strangely long times for layers with 3x3 convolutions (batch size 64).
For 1x1 convolutions AMD Vega often performs on a par with GTX 1080 Ti, but for 3x3 convs the time is up to 10 times longer.
For example,
{28, 28, 128} -> {28, 28, 128} by filter {3, 3} with stride {1, 1}:
NVIDIA 1080 Ti: 13.6 ns (73470 ips)
AMD VX VEGA: 127.2 ns (7862 ips)
All layers perf in TSV: https://gist.github.com/hex000/de2aebf622d2120fb6a57c42a0c7d90e
Here the the time is given per image. The test goes for the whole 1 sec before hipDeviceSynchronize(), so the sync time is amortized.
What can be wrong here? Maybe I haven't set some secret flag? The algorithm chosen by FindConvolution is always "miopenConvolutionFwdAlgoWinograd", and for such convolutions it offers no other options.
The test code is at https://gist.github.com/hex000/81fe99ae8c10f4384e64960275e91554, it compiles both for AMD and NVIDIA.
The text was updated successfully, but these errors were encountered: