3x3 convolutions performance problem #19

ghost · 2017-10-12T17:45:30Z

Hi, dear friends!

We are investigating MiOpen potential to use in our deep learning applications, but Resnet-52 shows strangely long times for layers with 3x3 convolutions (batch size 64).

For 1x1 convolutions AMD Vega often performs on a par with GTX 1080 Ti, but for 3x3 convs the time is up to 10 times longer.

For example,
{28, 28, 128} -> {28, 28, 128} by filter {3, 3} with stride {1, 1}:
NVIDIA 1080 Ti: 13.6 ns (73470 ips)
AMD VX VEGA: 127.2 ns (7862 ips)

All layers perf in TSV: https://gist.github.com/hex000/de2aebf622d2120fb6a57c42a0c7d90e

Here the the time is given per image. The test goes for the whole 1 sec before hipDeviceSynchronize(), so the sync time is amortized.

What can be wrong here? Maybe I haven't set some secret flag? The algorithm chosen by FindConvolution is always "miopenConvolutionFwdAlgoWinograd", and for such convolutions it offers no other options.

The test code is at https://gist.github.com/hex000/81fe99ae8c10f4384e64960275e91554, it compiles both for AMD and NVIDIA.

dagamayank · 2017-10-12T19:30:10Z

@Hex000 Thanks for contacting us. While I go and run your test code and understand the issue, can you please tell me about your system configuration?

Which distro?
Output of uname -a
Output of /opt/rocm/bin/hipconfig

How was MIOpen installed? Was it built from source?

dagamayank · 2017-10-12T21:16:24Z

@Hex000 I am mostly certain something is incorrect with your environment. On my Vega GPU I see quite similar times to NVIDIA 1080Ti for that particular layer you mentioned above. Can you please let me know the exact model of your GPU as well.

12	13.0 ns (76852 ips)	conv {28, 28, 128} -> {28, 28, 128} by filter {3, 3} with stride {1, 1}	algo 3	expected 11.9 ns	0.0 Mb

ghost · 2017-10-13T13:31:22Z

@dagamayank, thank you!
This is good news, I need to check my configuration.

GPU is AMD RX Vega 64

hipconfig output:
https://gist.github.com/hex000/2fdc37178854038cf7600133acebf92d

The MiOpen installation info:
https://gist.github.com/hex000/12462cd6e0a75796b9beb24b6a798f4b
Official binary distribution from http://repo.radeon.com/rocm, is it the best origin?
Or maybe it's better to build from source?

Driver info:
https://gist.github.com/hex000/1c311c4a6fc5545b5aba4231bbf9300b
It's integrated into kernel, is it the best way?

prostowww · 2017-10-13T19:07:25Z

@dagamayank We tried to run with driver provided with kernel 4.11.0-kfd-compute-rocm-rel-1.6-148, from http://repo.radeon.com/rocm and got around 120 ns.
With ubuntu default kernel, 4.4.0-62-generic, and AMDGPU-PRO 17.30 driver we got 19 ns, which is much better, but still more than your result.

dagamayank · 2017-10-13T19:34:48Z

@Hex000 @prostowww Please consider the performance that I shared as "dev-preview".

MIOpen requires some improvements in the base software stack which are planned to be released within the next two weeks as part of ROCm 1.6.4. You systems are currently configured with ROCm 1.6.3 (the last public release), which may be the cause of the poor performance that you notice.

I will highly recommend not to mix the AMDGPU-PRO and ROCm software stacks on the same system.

As an experiment can you please try setting this env. variable and check the performance again?
export MIOPEN_DEBUG_AMD_ROCM_PRECOMPILED_BINARIES=0

I will ping you once ROCm 1.6.4 is public.

masahi · 2017-10-18T00:23:12Z

@Hex000 just curious, do you have both cuda and rocm installed on the same machine? Have you experienced any issue?

ghost · 2017-10-18T14:00:24Z

Dear friends, thank you for the answers!

No, CUDA and ROCM are on different machines, so this was not tested.

One more complain from me as a user, warping nine-dimensional space for 30 min is too tough:
[02] Testing conv {56, 56, 64} -> {56, 56, 64} by filter {3, 3} with stride {1, 1}
Searching the best solution in the 9 dim space. Please, be patient it may take few minutes.
Runs left : 13714, min time so far : 3.7838, curr time : 3.96322, 8, 16, 8, 16, 1, 2, 5, 3, 1

dagamayank · 2017-10-30T22:16:20Z

@Hex000 @prostowww

Just wanted to let you know we released ROCm 1.6.4 last week along with MIOpen v1.1.4, if you are already not aware. Can you please try updating your systems and run the performance experiment again?

For performance measurements I would first like you to set an additional parameter for now.

sudo -s
echo 1 > /sys/module/amdkfd/parameters/noretry 
exit

The above is right now a workaround for an issue with page migration in ROCm. This will be fixed soon.

ghost · 2017-11-14T13:22:06Z

dagamayank, thank you!
noretry gives ~10%

Activoff

dagamayank · 2017-12-23T15:55:58Z

@Hex000 FYI - there is a new release of both ROCm and MIOpen, so you may want to benchmark again. For now, I am closing this issue. Please create a new issue if you have questions or notice discrepancies.

646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop 6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) 3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all 211dae822 Merge branch 'develop' into miopen_downstream_all 5890e3007 [Composable Kernel] update develop branch code to ck_upstream d5297abae fix bug in gridwise gemm xdlops v2r3 (#45) 38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop c3018794b bug fix (#39) fd49ff808 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34) b3e8d57d5 Tweak GEMM kernel (#38) 846f462bd Add VectorType support into StaticBuffer (#27) dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) 8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr 88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream f3acd2510 Add a version of Merge transform that use integerdivision and mod (#25) 19613902b GEMM driver and kernel (#29) 627d8ef35 Backward weight v4r4r2 with xdlops (#18) 10bb81106 Misc fixes (#24) 9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) a7a758d8c GlobalAtomicAdd for fp32/int32 (#23) 9d3f634a3 Xdlops refactor fix (#22) c6f26bb48 magic division use __umulhi() (#19) 6fe3627a9 Composable kernel init integration v3 (#1097) a2ad6d353 refactor dynamic xdlops iGemm (#13) ba6f79a75 Added host_conv_wrw for verification (#15) git-subtree-dir: src/composable_kernel git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1

dagamayank pushed a commit that referenced this issue Dec 20, 2017

Merge pull request #19 from ce1adon/activoff

7c01fa1

Activoff

dagamayank closed this as completed Dec 23, 2017

alexandraBara mentioned this issue Sep 11, 2020

Solver generic_search fail: ConvHipImplicitGemmBwdDataV1R1Xdlops and ConvHipImplicitGemmForwardV4R4Xdlops #427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3x3 convolutions performance problem #19

3x3 convolutions performance problem #19

ghost commented Oct 12, 2017

dagamayank commented Oct 12, 2017 •

edited

Loading

dagamayank commented Oct 12, 2017

ghost commented Oct 13, 2017 •

edited by ghost

Loading

prostowww commented Oct 13, 2017

dagamayank commented Oct 13, 2017

masahi commented Oct 18, 2017

ghost commented Oct 18, 2017

dagamayank commented Oct 30, 2017

ghost commented Nov 14, 2017

dagamayank commented Dec 23, 2017

3x3 convolutions performance problem #19

3x3 convolutions performance problem #19

Comments

ghost commented Oct 12, 2017

dagamayank commented Oct 12, 2017 • edited Loading

dagamayank commented Oct 12, 2017

ghost commented Oct 13, 2017 • edited by ghost Loading

prostowww commented Oct 13, 2017

dagamayank commented Oct 13, 2017

masahi commented Oct 18, 2017

ghost commented Oct 18, 2017

dagamayank commented Oct 30, 2017

ghost commented Nov 14, 2017

dagamayank commented Dec 23, 2017

dagamayank commented Oct 12, 2017 •

edited

Loading

ghost commented Oct 13, 2017 •

edited by ghost

Loading