Workaround for the failure of ConvHipImplicitGemmV4R4GenWrWXdlops #409

zjing14 · 2020-08-31T16:44:54Z

Resolves issue #406

…dlops

zjing14 · 2020-08-31T18:29:04Z

Tested with configs from the issue. No performance drop after workaround and ConvHipImplicitGemmV4R4GenWrWXdlops still the fastest solver.

atamazov

LGTM

atamazov · 2020-08-31T22:17:03Z

src/solver/implicitgemm_util.hpp

@@ -23,6 +23,8 @@ MIOPEN_DECLARE_ENV_VAR(MIOPEN_DEBUG_CONV_IMPLICIT_GEMM_BLOCK_SYNC_LDS_WITHOUT_SY
 // LLVM xdlops instrinsic will do unnecessey VGRP <--> AGPR movement, and result in
 // register spill, for bfloat16 datatype, when doing wave-wise GEMM larger than 64x64
 #define WORKAROUND_SWDEV_240356 1
+// workaround failure of ConvHipImplicitGemmV4R4GenWrWXdlops with vector load
+#define WORKAROUND_ISSUE_2532 1


[Recommendation] Used only once in a .cpp? Difine right there.

atamazov · 2020-08-31T22:20:50Z

Topmost comment & priority fixed.

ekuznetsov139 · 2020-09-01T22:00:55Z

"No performance drop after workaround"

That sounds wrong to me. In my tests, I've seen performance drops as high as 50%.

atamazov · 2020-09-01T22:43:24Z

@ekuznetsov139 Perhaps re-tuning is required. @zjing14 What do you think?

zjing14 · 2020-09-01T22:46:42Z

@ekuznetsov139 Could you post the config with regression here? @atamazov No, I did not retune.

ekuznetsov139 · 2020-09-01T23:02:03Z

MIOpenDriver convfp16 -n 256 -c 512 -H 28 -W 28 -k 128 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1

Without the fix:
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 256, 512, 28, 28, 1, 1, 128, 26306674688, 0, 0, 29434, 0, 0.893762
Backward Convolution Weights Failed: 0.241877 > 0.082

With the fix:
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 256, 512, 28, 28, 1, 1, 128, 26306674688, 0, 0, 14505, 0, 1.813586
Backward Convolution Weights Verifies on CPU and GPU (4.11355e-05)

The exact algorithm is
runcl gridwise_convolution_implicit_gemm_v4r4_gen_xdlops_nchw_kcyx_nkhw_lds_double_buffer.cpp -k gridwise_convolution_implicit_gemm_v4r4_gen_xdlops_nchw_kcyx_nkhw_lds_double_buffer -dumpilisa -r 10 if#0: if#0: if#0: iv#0 65536,1,1/256,1,1 -DCK_PARAM_PROBLEM_K=128 -DCK_PARAM_PROBLEM_C=512 -DCK_PARAM_PROBLEM_HI=28 -DCK_PARAM_PROBLEM_WI=28 -DCK_PARAM_PROBLEM_HO=28 -DCK_PARAM_PROBLEM_WO=28 -DCK_PARAM_PROBLEM_CONV_DIRECTION_FORWARD=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_DATA=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_WEIGHT=1 -std=c++14 -DCK_PARAM_PROBLEM_DIRECTION=2 -DCK_PARAM_PROBLEM_N=256 -DCK_PARAM_PROBLEM_Y=1 -DCK_PARAM_PROBLEM_X=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_PROBLEM_LEFT_PAD_H=0 -DCK_PARAM_PROBLEM_LEFT_PAD_W=0 -DCK_PARAM_PROBLEM_RIGHT_PAD_H=0 -DCK_PARAM_PROBLEM_RIGHT_PAD_W=0 -DCK_PARAM_PROBLEM_CONV_GROUP_COUNTS=1 -DCK_PARAM_TUNABLE_BLOCK_SIZE=256 -DCK_PARAM_TUNABLE_GEMM_N_PER_BLOCK=128 -DCK_PARAM_TUNABLE_GEMM_M_PER_BLOCK=128 -DCK_PARAM_TUNABLE_GEMM_K_PER_BLOCK=8 -DCK_PARAM_TUNABLE_GEMM_K_BLOCKS=64 -DCK_PARAM_DEPENDENT_GRID_SIZE=256 -DCK_PARAM_GEMM_M_PER_WAVE=64 -DCK_PARAM_GEMM_N_PER_WAVE=64 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=8 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_N=32 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=2 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_M=128 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_N=1 -DCK_PARAM_GEMM_KPACK_LENGTH=4 -DCK_USE_AMD_XDLOPS=1 -DCK_USE_AMD_XDLOPS_INLINE_ASM=0 -DCK_USE_AMD_XDLOPS_EMULATE=0 -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=4 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=4 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_K=4

zjing14 · 2020-09-02T00:43:21Z

@ekuznetsov139 Thanks. Sorry, I did not compare the performance of failed configs, since I think it does not fair. Yes, that is huge performance degradation.

daniellowell · 2020-09-02T01:56:22Z

@zjing14 So, this is a WIP again?

zjing14 · 2020-09-02T02:20:57Z

@zjing14 So, this is a WIP again?

No, the PR is ready.

daniellowell · 2020-09-02T14:41:14Z

@zjing14 What is the plan to reduce the impact of the performance regression? How widespread is this regression?

zjing14 · 2020-09-02T15:40:12Z

@daniellowell This solver will be deprecated after the new wrw solver merged in. So, the impact is temporary.

atamazov · 2020-09-03T13:05:47Z

@zjing14

@ekuznetsov139 Thanks. Sorry, I did not compare the performance of failed configs, since I think it does not fair. Yes, that is huge performance degradation.

You were correct. The notion of performance is not applicable to the kernels that produce wrong outputs.

However, currently find-db contains too optimistic information about ConvHipImplicitGemmV4R4GenWrWXdlops, especially where find-db records say that it wins. That might lead to performance drops, unless ConvHipImplicitGemmV4R4GenWrWXdlops remains the winner even after workaround applied (or at least on par with the real winner). Does it?

…dlops (#409)

enforce BBlockCopySrcDataPerRead=1 for ConvHipImplicitGemmV4R4GenWrWX…

069a735

…dlops

zjing14 requested review from ekuznetsov139, atamazov and asroy August 31, 2020 16:45

daniellowell added bug value_high labels Aug 31, 2020

atamazov approved these changes Aug 31, 2020

View reviewed changes

atamazov added urgency_blocker workaround and removed value_high labels Aug 31, 2020

daniellowell merged commit c825731 into develop Sep 2, 2020

zjing14 deleted the workaround_issue_2532 branch September 17, 2020 15:40

JehandadKhan pushed a commit that referenced this pull request Nov 20, 2020

enforce BBlockCopySrcDataPerRead=1 for ConvHipImplicitGemmV4R4GenWrWX…

c1205d4

…dlops (#409)

junliume mentioned this pull request Nov 18, 2021

Investigating CI Instability #1147

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for the failure of ConvHipImplicitGemmV4R4GenWrWXdlops #409

Workaround for the failure of ConvHipImplicitGemmV4R4GenWrWXdlops #409

zjing14 commented Aug 31, 2020 •

edited by atamazov

Loading

zjing14 commented Aug 31, 2020 •

edited

Loading

atamazov left a comment

atamazov Aug 31, 2020

atamazov commented Aug 31, 2020

ekuznetsov139 commented Sep 1, 2020

atamazov commented Sep 1, 2020

zjing14 commented Sep 1, 2020

ekuznetsov139 commented Sep 1, 2020 •

edited

Loading

zjing14 commented Sep 2, 2020

daniellowell commented Sep 2, 2020

zjing14 commented Sep 2, 2020

daniellowell commented Sep 2, 2020

zjing14 commented Sep 2, 2020

atamazov commented Sep 3, 2020

Workaround for the failure of ConvHipImplicitGemmV4R4GenWrWXdlops #409

Workaround for the failure of ConvHipImplicitGemmV4R4GenWrWXdlops #409

Conversation

zjing14 commented Aug 31, 2020 • edited by atamazov Loading

zjing14 commented Aug 31, 2020 • edited Loading

atamazov left a comment

Choose a reason for hiding this comment

atamazov Aug 31, 2020

Choose a reason for hiding this comment

atamazov commented Aug 31, 2020

ekuznetsov139 commented Sep 1, 2020

atamazov commented Sep 1, 2020

zjing14 commented Sep 1, 2020

ekuznetsov139 commented Sep 1, 2020 • edited Loading

zjing14 commented Sep 2, 2020

daniellowell commented Sep 2, 2020

zjing14 commented Sep 2, 2020

daniellowell commented Sep 2, 2020

zjing14 commented Sep 2, 2020

atamazov commented Sep 3, 2020

zjing14 commented Aug 31, 2020 •

edited by atamazov

Loading

zjing14 commented Aug 31, 2020 •

edited

Loading

ekuznetsov139 commented Sep 1, 2020 •

edited

Loading