[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

atamazov · 2020-06-23T22:17:02Z

Radeon VII, vanilla ROCm 3.5:

$ ./bin/MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 0 -w 2 -t 1 -i 6 -C /spb-hlc-fs2/1/users/atamazov/miopen-verify2 ...
Memory access fault by GPU node-2 (Agent handle: 0x2281ba0) on address 0x7f1690084000. Reason: Page not present or supervisor privilege.

$ ./bin/MIOpenDriver convbfp16 -x 5 -y 5 -W 350 -H 80 -c 64 -n 16 -k 128 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -V 1 -s 0 -F 0 -w 2 -t 1 -i 6 -C /spb-hlc-fs2/1/users/atamazov/miopen-verify2 ...
Memory access fault by GPU node-2 (Agent handle: 0x184fba0) on address 0x7feba4004000. Reason: Page not present or supervisor privilege.

The text was updated successfully, but these errors were encountered:

TejashShah · 2020-06-24T00:18:12Z

@atamazov Thanks for reporting the issue.

I was able to reproduce the issue on develop. One thing I am noticing here, it skipped all solvers that needs workspace, including bwddatav1r1 igemm solver.

MIOpen(HIP): Info [EvaluateInvokers] **Warning: skipping solver** <ConvHipImplicitGemmBwdDataV1R1> due to no workspace provided (5971968 required)
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] miopenConvolutionBwdDataAlgoDirect 0.801439        0
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] BWD Chosen Algorithm: ConvOclDirectFwd , 0, 0.801439
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
Wall-clock Time Backward Data Conv. Elapsed: 0.889791 ms, Auxiliary API calls: 6939.21 ms (GWSS: 0.539062)
MIOpen Backward Data Conv. Algorithm: 1, Solution: 11/ConvOclDirectFwd
GPU Kernel Time Backward Data Conv. Elapsed: 0.765567 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv3x3u1, 8, 64, 3, 3, 64, 54, 54,  1719926784, 139264, 2985984, 2247, 4, 0.765567
MIOpen(HIP): Info [FindConvBwdWeightsAlgorithm] requestAlgoCount = 2, workspace = 14745600
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info [FindSolutionImpl] ConvOclBwdWrW2NonTunable (not searchable)
MIOpen(HIP): Info [FindSolutionImpl] ConvOclBwdWrW53 (not searchable)
MIOpen(HIP): Info [EvaluateInvokers] Warning: skipping solver <ConvOclBwdWrW2NonTunable> due to no workspace provided (589824 required)
MIOpen(HIP): Info [EvaluateInvokers] Warning: skipping solver <ConvOclBwdWrW53> due to no workspace provided (589824 required)
MIOpen(HIP): Info [FindSolutionImpl] ConvWinograd3x3MultipassWrW<3-4> (not searchable)
MIOpen(HIP): Info [FindSolutionImpl] ConvWinograd3x3MultipassWrW<3-5> (not searchable)
MIOpen(HIP): Info [FindSolutionImpl] ConvWinograd3x3MultipassWrW<3-6> (not searchable)
Memory access fault by GPU node-2 (Agent handle: 0x3695de0) on address 0xf5000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

It raises a couple of questions

Why are solvers with workspace requirement getting skipped?
Which solver/kernel was executed that caused seg fault?

TejashShah · 2020-06-24T00:42:30Z

I just ran with -F 2. It tells me that segfault comes from SetTensor kernel that initializes workspace to 0 before running bwdv1r1 kernel


MIOpen(HIP): Info [SystemCmd] cd /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527;  /opt/rocm/llvm/bin/clang++  -std=c++14  -DCK_PARAM_PROBLEM_N=8 -DCK_PARAM_PROBLEM_K=64 -DCK_PARAM_PROBLEM_C=64 -DCK_PARAM_PROBLEM_HI=54 -DCK_PARAM_PROBLEM_WI=54 -DCK_PARAM_PROBLEM_HO=54 -DCK_PARAM_PROBLEM_WO=54 -DCK_PARAM_PROBLEM_Y=3 -DCK_PARAM_PROBLEM_X=3 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_PROBLEM_IN_LEFT_PAD_H=1 -DCK_PARAM_PROBLEM_IN_LEFT_PAD_W=1 -DCK_PARAM_PROBLEM_IN_RIGHT_PAD_H=1 -DCK_PARAM_PROBLEM_IN_RIGHT_PAD_W=1 -DCK_PARAM_TUNABLE_BLOCK_SIZE=64 -DCK_PARAM_TUNABLE_GEMM_M_PER_BLOCK=64 -DCK_PARAM_TUNABLE_GEMM_N_PER_BLOCK=32 -DCK_PARAM_TUNABLE_GEMM_K_PER_BLOCK=16 -DCK_PARAM_TUNABLE_GEMM_M_PER_THREAD=4 -DCK_PARAM_TUNABLE_GEMM_N_PER_THREAD=2 -DCK_PARAM_TUNABLE_GEMM_M_LEVEL0_CLUSTER=4 -DCK_PARAM_TUNABLE_GEMM_N_LEVEL0_CLUSTER=4 -DCK_PARAM_TUNABLE_GEMM_M_LEVEL1_CLUSTER=2 -DCK_PARAM_TUNABLE_GEMM_N_LEVEL1_CLUSTER=2 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=4 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_M=16 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_M=4 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=8 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_N=8 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_N=4 -DCK_PARAM_TUNABLE_GEMM_C_THREAD_COPY_DST_DATA_PER_WRITE_GEMM_N1=1 -DCK_PARAM_DEPENDENT_GRID_SIZE=6561 -DCK_THREADWISE_GEMM_USE_AMD_INLINE_ASM=1 -DCK_USE_AMD_INLINE_ASM=1 -DCK_USE_AMD_BUFFER_ATOMIC_ADD=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=1 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DCK_PARAM_KPACK_LENGTH=2 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=2 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=2 -mcpu=gfx906 -Werror -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-disabled-macro-expansion -Wno-undefined-reinterpret-cast --cuda-gpu-arch=gfx906 --cuda-device-only -c -O3  -Wno-unused-command-line-argument -I. -x hip --hip-device-lib-path=/opt/rocm/lib -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -D__HIP_ROCclr__=1 -isystem /opt/rocm-3.5.0/hip/../include -isystem /opt/rocm/llvm/lib/clang/11.0.0/include/.. -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -isystem /opt/rocm-3.5.0/hip/include -isystem /opt/rocm/include --hip-device-lib-path=/opt/rocm/lib --hip-link -mllvm -amdgpu-enable-global-sgpr-addr -mllvm --amdgpu-spill-vgpr-to-agpr=0 gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp -o /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp.o
MIOpen(HIP): Info [SystemCmd] cd /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527; /opt/rocm/llvm/bin/clang-offload-bundler --type=o --targets=hip-amdgcn-amd-amdhsa-gfx906 --inputs=/tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp.o --outputs=/tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp.o.hsaco --unbundle
SetTensor
real descritor: 8, 64, 54, 54
flat descritor: 1492992
MIOpen(HIP): Info [SystemCmd] cd /tmp/miopen-MIOpenSubTensorOpWithScalarKernel.cl-02b4-120b-cb16-3bba; /opt/rocm/bin/clang-ocl  -DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_SET -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=65536 -mcpu=gfx906 -Werror -Weverything -Wno-shorten-64-to-32 -Wno-unused-macros -Wno-unused-function -Wno-sign-compare -Wno-reserved-id-macro -Wno-sign-conversion -Wno-missing-prototypes -Wno-cast-qual -Wno-cast-align -Wno-conversion -Wno-double-promotion -Wno-float-equal -Wno-pass-failed -Xclang -target-feature -Xclang +code-object-v3 MIOpenSubTensorOpWithScalarKernel.cl -o /tmp/miopen-MIOpenSubTensorOpWithScalarKernel.cl-02b4-120b-cb16-3bba/MIOpenSubTensorOpWithScalarKernel.cl.o
Memory access fault by GPU node-2 (Agent handle: 0x2a9fde0) on address 0x7fdca2e84000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

This is similar to what I am currently debugging for my PR #305 . I am suspecting something is going wrong in creating workspace buffer after this new invoker rearchitect. Nothing has changed on Radeon VII pertaining to bwdv1r1 kernel except moving to invoker design.

Please point me to the invoker design doc or any PR describing its design.

TejashShah · 2020-06-24T17:33:10Z

@atamazov I tried on rocm3.1 using rocm/miopen-private:rocm3.1-tf1.15-dev-modified_clamp_device docker which uses hcc. It works fine.

./bin/MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 2 -w 2 -t 1 -i 6
MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 2 -w 2 -t 1 -i 6
MIOpen(OpenCL): Info [Handle] Device name: gfx906
MIOpen(OpenCL): Info [Handle] stream: 0x274ba40, device_id: 0x271fa80
Wall-clock Time Backward Data Conv. Elapsed: 16.0617 ms, Auxiliary API calls: 15599.5 ms (GWSS: 0.378906)
MIOpen Backward Data Conv. Algorithm: 5, Solution: 55/ConvHipImplicitGemmBwdDataV1R1
GPU Kernel Time Backward Data Conv. Elapsed: 14.980084 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv3x3u1, 8, 64, 3, 3, 64, 54, 54, 1719926784, 139264, 2985984, 115, 0, 14.980084
Backward Convolution Data Verifies on CPU and GPU (0.000384929)

asroy · 2020-06-24T17:34:27Z

@TejashShah You can try -O1 instead of -O3 flag for hip-clang, in src/hip/hip_build_util.cpp

TejashShah · 2020-06-24T18:15:52Z

@TejashShah You can try -O1 instead of -O3 flag for hip-clang, in src/hip/hip_build_util.cpp

Yes, it fixes the issue in rocm3.5, pointing to the hipclang compiler issue

One more thing: Although I suspected SetTensor kernel to be the cause of segfault from earlier log as it was the last in log. But this lowering of optimization is targeted only for hip kernel, not ocl kernel (SetTensor kernel). Thus, it leads me to conclude that the actual segfaul came from bwdv1r1 kernel but, for some reason, the log seems to be running behind the actual kernel.

So, until the compiler fix arrives, #310 could be temporary workaround.

atamazov · 2020-06-24T21:30:45Z

@TejashShah

Why are solvers with workspace requirement getting skipped?

This is protection implemented in Invokers.

Which solver/kernel was executed that caused seg fault?

ConvHipImplicitGemmBwdDataV1R1, as shown in the issue title.

atamazov · 2020-06-24T21:47:12Z

@TejashShah Could you please check on ROCm 3.3 with a number of BF16 configs and confirm that it is free of this problem. Thanks.

atamazov · 2020-06-24T21:49:50Z

Thus, it leads me to conclude that the actual segfaul came from bwdv1r1 kernel but, for some reason, the log seems to be running behind the actual kernel.

Most likely the program is terminated before the relevant log line is printed (console output is buffered). This is happens very often in case of fatal errors.

TejashShah · 2020-06-24T22:16:05Z

@TejashShah Could you please check on ROCm 3.3 with a number of BF16 configs and confirm that it is free of this problem.

It passed.

asroy · 2020-06-25T00:09:13Z

@TejashShah Worth another JIRA ticket for hip-clang

TejashShah · 2020-06-25T16:56:40Z

@asroy Yes, I am trying to pinpoint particular instruction causing it. Apparently, rocm-debug-agent doesnt work in rocm3.5

TejashShah · 2020-07-01T20:23:31Z

http://ontrack-internal.amd.com/browse/SWDEV-243048

atamazov · 2020-12-07T20:53:57Z

@daniellowell Right now we have workaround. I recommend checking if problem is fixed in the recent compiler and disable it. Could you please make an assignment?

AFAICS http://ontrack-internal.amd.com/browse/SWDEV-243048 is not resolved yet.

asroy · 2020-12-10T00:55:25Z

I tried rocm3.9, but these cases still failed if using -O3, but pass if using -O1.

But they are no longer having memory access fault, instead they produce wrong result (if using -O3). This error can be captured by:
bin/test_conv2d --input 8, 64, 54, 54 --weights 64, 64, 3, 3 --pads_strides_dilations 1, 1, 1, 1, 1, 1 --disable-forward --disable-backward-weights --bfloat16 -v

which is testing the same as MIOpenDriver cmd n the problem description.

===== output with -O1:
error: 0
Max diff: 0
Backward convolution: ConvHipImplicitGemmBwdDataV1R1
Input tensor: 8, 64, 54, 54
Weights tensor: 64, 64, 3, 3
Output tensor: 8, 64, 54, 54

===== output with -O3:
error: 0.163848
Max diff: 6720
Mismatch at 1: -376 != 390
Backward convolution: ConvHipImplicitGemmBwdDataV1R1
Input tensor: 8, 64, 54, 54
Weights tensor: 64, 64, 3, 3
Output tensor: 8, 64, 54, 54

asroy · 2021-01-16T02:59:50Z

Tried rocm/miopen-private:compute-rocm-dkms-staging-4309

The issue is gone

atamazov · 2021-01-16T23:27:51Z

@asroy Great! We can switch the workaround OFF and then close the issue as soon as we see that the issue is gone in some ROCm release.

atamazov · 2021-02-26T18:45:11Z

As per http://ontrack-internal.amd.com/browse/SWDEV-264644, the fix for compiler is in the mainline.

atamazov · 2021-02-26T20:53:35Z

The fix is expected in 4.2 release. For validation, refer to SWDEV for docker image with release candidate.

atamazov added bug value_high labels Jun 23, 2020

atamazov mentioned this issue Jun 23, 2020

[WORKAROUND] Disabled ConvHipImplicitGemmBwdDataV1R1 for BF16. #310

Merged

atamazov assigned TejashShah Jun 23, 2020

atamazov assigned DrizztDoUrden Jun 24, 2020

atamazov unassigned DrizztDoUrden Jun 24, 2020

atamazov added non-miopen-bug value_middle workaround and removed value_high labels Jun 29, 2020

atamazov removed the bug label Jul 2, 2020

atamazov assigned daniellowell and unassigned TejashShah Dec 7, 2020

atamazov added the urgency_unknown label Dec 7, 2020

junliume closed this as completed Aug 9, 2022

atamazov mentioned this issue Dec 19, 2022

[tests] Added smoke tests for conv solvers (ConvAsmBwdWrW1x1 and more). Some fixes. #1906

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

atamazov commented Jun 23, 2020

TejashShah commented Jun 24, 2020 •

edited

Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

asroy commented Jun 24, 2020 •

edited

Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

atamazov commented Jun 24, 2020

atamazov commented Jun 24, 2020

atamazov commented Jun 24, 2020

TejashShah commented Jun 24, 2020

asroy commented Jun 25, 2020

TejashShah commented Jun 25, 2020

TejashShah commented Jul 1, 2020

atamazov commented Dec 7, 2020 •

edited

Loading

asroy commented Dec 10, 2020

asroy commented Jan 16, 2021

atamazov commented Jan 16, 2021

atamazov commented Feb 26, 2021

atamazov commented Feb 26, 2021 •

edited

Loading

[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

Comments

atamazov commented Jun 23, 2020

TejashShah commented Jun 24, 2020 • edited Loading

TejashShah commented Jun 24, 2020 • edited Loading

TejashShah commented Jun 24, 2020 • edited Loading

asroy commented Jun 24, 2020 • edited Loading

TejashShah commented Jun 24, 2020 • edited Loading

atamazov commented Jun 24, 2020

atamazov commented Jun 24, 2020

atamazov commented Jun 24, 2020

TejashShah commented Jun 24, 2020

asroy commented Jun 25, 2020

TejashShah commented Jun 25, 2020

TejashShah commented Jul 1, 2020

atamazov commented Dec 7, 2020 • edited Loading

asroy commented Dec 10, 2020

asroy commented Jan 16, 2021

atamazov commented Jan 16, 2021

atamazov commented Feb 26, 2021

atamazov commented Feb 26, 2021 • edited Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

asroy commented Jun 24, 2020 •

edited

Loading

TejashShah commented Jun 24, 2020 •

edited

Loading

atamazov commented Dec 7, 2020 •

edited

Loading

atamazov commented Feb 26, 2021 •

edited

Loading