Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for the failure of ConvHipImplicitGemmV4R4GenWrWXdlops #409

Merged
merged 1 commit into from
Sep 2, 2020

Conversation

zjing14
Copy link
Contributor

@zjing14 zjing14 commented Aug 31, 2020

Resolves issue #406

@zjing14
Copy link
Contributor Author

zjing14 commented Aug 31, 2020

Tested with configs from the issue. No performance drop after workaround and ConvHipImplicitGemmV4R4GenWrWXdlops still the fastest solver.

Copy link
Contributor

@atamazov atamazov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -23,6 +23,8 @@ MIOPEN_DECLARE_ENV_VAR(MIOPEN_DEBUG_CONV_IMPLICIT_GEMM_BLOCK_SYNC_LDS_WITHOUT_SY
// LLVM xdlops instrinsic will do unnecessey VGRP <--> AGPR movement, and result in
// register spill, for bfloat16 datatype, when doing wave-wise GEMM larger than 64x64
#define WORKAROUND_SWDEV_240356 1
// workaround failure of ConvHipImplicitGemmV4R4GenWrWXdlops with vector load
#define WORKAROUND_ISSUE_2532 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Recommendation] Used only once in a .cpp? Difine right there.

@atamazov
Copy link
Contributor

Topmost comment & priority fixed.

@ekuznetsov139
Copy link

"No performance drop after workaround"

That sounds wrong to me. In my tests, I've seen performance drops as high as 50%.

@atamazov
Copy link
Contributor

atamazov commented Sep 1, 2020

@ekuznetsov139 Perhaps re-tuning is required. @zjing14 What do you think?

@zjing14
Copy link
Contributor Author

zjing14 commented Sep 1, 2020

@ekuznetsov139 Could you post the config with regression here? @atamazov No, I did not retune.

@ekuznetsov139
Copy link

ekuznetsov139 commented Sep 1, 2020

MIOpenDriver convfp16 -n 256 -c 512 -H 28 -W 28 -k 128 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1

Without the fix:
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 256, 512, 28, 28, 1, 1, 128, 26306674688, 0, 0, 29434, 0, 0.893762
Backward Convolution Weights Failed: 0.241877 > 0.082

With the fix:
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 256, 512, 28, 28, 1, 1, 128, 26306674688, 0, 0, 14505, 0, 1.813586
Backward Convolution Weights Verifies on CPU and GPU (4.11355e-05)

The exact algorithm is
runcl gridwise_convolution_implicit_gemm_v4r4_gen_xdlops_nchw_kcyx_nkhw_lds_double_buffer.cpp -k gridwise_convolution_implicit_gemm_v4r4_gen_xdlops_nchw_kcyx_nkhw_lds_double_buffer -dumpilisa -r 10 if#0: if#0: if#0: iv#0 65536,1,1/256,1,1 -DCK_PARAM_PROBLEM_K=128 -DCK_PARAM_PROBLEM_C=512 -DCK_PARAM_PROBLEM_HI=28 -DCK_PARAM_PROBLEM_WI=28 -DCK_PARAM_PROBLEM_HO=28 -DCK_PARAM_PROBLEM_WO=28 -DCK_PARAM_PROBLEM_CONV_DIRECTION_FORWARD=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_DATA=0 -DCK_PARAM_PROBLEM_CONV_DIRECTION_BACKWARD_WEIGHT=1 -std=c++14 -DCK_PARAM_PROBLEM_DIRECTION=2 -DCK_PARAM_PROBLEM_N=256 -DCK_PARAM_PROBLEM_Y=1 -DCK_PARAM_PROBLEM_X=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_PROBLEM_LEFT_PAD_H=0 -DCK_PARAM_PROBLEM_LEFT_PAD_W=0 -DCK_PARAM_PROBLEM_RIGHT_PAD_H=0 -DCK_PARAM_PROBLEM_RIGHT_PAD_W=0 -DCK_PARAM_PROBLEM_CONV_GROUP_COUNTS=1 -DCK_PARAM_TUNABLE_BLOCK_SIZE=256 -DCK_PARAM_TUNABLE_GEMM_N_PER_BLOCK=128 -DCK_PARAM_TUNABLE_GEMM_M_PER_BLOCK=128 -DCK_PARAM_TUNABLE_GEMM_K_PER_BLOCK=8 -DCK_PARAM_TUNABLE_GEMM_K_BLOCKS=64 -DCK_PARAM_DEPENDENT_GRID_SIZE=256 -DCK_PARAM_GEMM_M_PER_WAVE=64 -DCK_PARAM_GEMM_N_PER_WAVE=64 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=8 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_N=32 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=2 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_M=128 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_N=1 -DCK_PARAM_GEMM_KPACK_LENGTH=4 -DCK_USE_AMD_XDLOPS=1 -DCK_USE_AMD_XDLOPS_INLINE_ASM=0 -DCK_USE_AMD_XDLOPS_EMULATE=0 -DMIOPEN_USE_FP16=1 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=4 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=4 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_K=4

@zjing14
Copy link
Contributor Author

zjing14 commented Sep 2, 2020

@ekuznetsov139 Thanks. Sorry, I did not compare the performance of failed configs, since I think it does not fair. Yes, that is huge performance degradation.

@daniellowell
Copy link
Contributor

@zjing14 So, this is a WIP again?

@zjing14
Copy link
Contributor Author

zjing14 commented Sep 2, 2020

@zjing14 So, this is a WIP again?

No, the PR is ready.

@daniellowell
Copy link
Contributor

@zjing14 What is the plan to reduce the impact of the performance regression? How widespread is this regression?

@zjing14
Copy link
Contributor Author

zjing14 commented Sep 2, 2020

@daniellowell This solver will be deprecated after the new wrw solver merged in. So, the impact is temporary.

@daniellowell daniellowell merged commit c825731 into develop Sep 2, 2020
@atamazov
Copy link
Contributor

atamazov commented Sep 3, 2020

@zjing14

@ekuznetsov139 Thanks. Sorry, I did not compare the performance of failed configs, since I think it does not fair. Yes, that is huge performance degradation.

You were correct. The notion of performance is not applicable to the kernels that produce wrong outputs.

However, currently find-db contains too optimistic information about ConvHipImplicitGemmV4R4GenWrWXdlops, especially where find-db records say that it wins. That might lead to performance drops, unless ConvHipImplicitGemmV4R4GenWrWXdlops remains the winner even after workaround applied (or at least on par with the real winner). Does it?

@zjing14 zjing14 deleted the workaround_issue_2532 branch September 17, 2020 15:40
@junliume junliume mentioned this pull request Nov 18, 2021
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants