Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

Closed
atamazov opened this issue Jun 23, 2020 · 18 comments
Closed

[BF16] ConvHipImplicitGemmBwdDataV1R1: Memory access faults. #309

atamazov opened this issue Jun 23, 2020 · 18 comments

Comments

@atamazov
Copy link
Contributor

Radeon VII, vanilla ROCm 3.5:

$ ./bin/MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 0 -w 2 -t 1 -i 6 -C /spb-hlc-fs2/1/users/atamazov/miopen-verify2 ...
Memory access fault by GPU node-2 (Agent handle: 0x2281ba0) on address 0x7f1690084000. Reason: Page not present or supervisor privilege.

$ ./bin/MIOpenDriver convbfp16 -x 5 -y 5 -W 350 -H 80 -c 64 -n 16 -k 128 -p 1 -q 1 -u 2 -v 2 -l 1 -j 1 -V 1 -s 0 -F 0 -w 2 -t 1 -i 6 -C /spb-hlc-fs2/1/users/atamazov/miopen-verify2 ...
Memory access fault by GPU node-2 (Agent handle: 0x184fba0) on address 0x7feba4004000. Reason: Page not present or supervisor privilege.
@TejashShah
Copy link
Contributor

TejashShah commented Jun 24, 2020

@atamazov Thanks for reporting the issue.

I was able to reproduce the issue on develop. One thing I am noticing here, it skipped all solvers that needs workspace, including bwddatav1r1 igemm solver.

MIOpen(HIP): Info [EvaluateInvokers] **Warning: skipping solver** <ConvHipImplicitGemmBwdDataV1R1> due to no workspace provided (5971968 required)
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] miopenConvolutionBwdDataAlgoDirect 0.801439        0
MIOpen(HIP): Info [FindConvBwdDataAlgorithm] BWD Chosen Algorithm: ConvOclDirectFwd , 0, 0.801439
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
MIOpen(HIP): Info [ConvolutionBackwardData] algo = 1, workspace = 0
Wall-clock Time Backward Data Conv. Elapsed: 0.889791 ms, Auxiliary API calls: 6939.21 ms (GWSS: 0.539062)
MIOpen Backward Data Conv. Algorithm: 1, Solution: 11/ConvOclDirectFwd
GPU Kernel Time Backward Data Conv. Elapsed: 0.765567 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv3x3u1, 8, 64, 3, 3, 64, 54, 54,  1719926784, 139264, 2985984, 2247, 4, 0.765567
MIOpen(HIP): Info [FindConvBwdWeightsAlgorithm] requestAlgoCount = 2, workspace = 14745600
MIOpen(HIP): Info [TryLoad] Find-db regenerating.
MIOpen(HIP): Info [FindSolutionImpl] ConvOclBwdWrW2NonTunable (not searchable)
MIOpen(HIP): Info [FindSolutionImpl] ConvOclBwdWrW53 (not searchable)
MIOpen(HIP): Info [EvaluateInvokers] Warning: skipping solver <ConvOclBwdWrW2NonTunable> due to no workspace provided (589824 required)
MIOpen(HIP): Info [EvaluateInvokers] Warning: skipping solver <ConvOclBwdWrW53> due to no workspace provided (589824 required)
MIOpen(HIP): Info [FindSolutionImpl] ConvWinograd3x3MultipassWrW<3-4> (not searchable)
MIOpen(HIP): Info [FindSolutionImpl] ConvWinograd3x3MultipassWrW<3-5> (not searchable)
MIOpen(HIP): Info [FindSolutionImpl] ConvWinograd3x3MultipassWrW<3-6> (not searchable)
Memory access fault by GPU node-2 (Agent handle: 0x3695de0) on address 0xf5000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

It raises a couple of questions

  1. Why are solvers with workspace requirement getting skipped?
  2. Which solver/kernel was executed that caused seg fault?

@TejashShah
Copy link
Contributor

TejashShah commented Jun 24, 2020

I just ran with -F 2. It tells me that segfault comes from SetTensor kernel that initializes workspace to 0 before running bwdv1r1 kernel


MIOpen(HIP): Info [SystemCmd] cd /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527;  /opt/rocm/llvm/bin/clang++  -std=c++14  -DCK_PARAM_PROBLEM_N=8 -DCK_PARAM_PROBLEM_K=64 -DCK_PARAM_PROBLEM_C=64 -DCK_PARAM_PROBLEM_HI=54 -DCK_PARAM_PROBLEM_WI=54 -DCK_PARAM_PROBLEM_HO=54 -DCK_PARAM_PROBLEM_WO=54 -DCK_PARAM_PROBLEM_Y=3 -DCK_PARAM_PROBLEM_X=3 -DCK_PARAM_PROBLEM_CONV_STRIDE_H=1 -DCK_PARAM_PROBLEM_CONV_STRIDE_W=1 -DCK_PARAM_PROBLEM_CONV_DILATION_H=1 -DCK_PARAM_PROBLEM_CONV_DILATION_W=1 -DCK_PARAM_PROBLEM_IN_LEFT_PAD_H=1 -DCK_PARAM_PROBLEM_IN_LEFT_PAD_W=1 -DCK_PARAM_PROBLEM_IN_RIGHT_PAD_H=1 -DCK_PARAM_PROBLEM_IN_RIGHT_PAD_W=1 -DCK_PARAM_TUNABLE_BLOCK_SIZE=64 -DCK_PARAM_TUNABLE_GEMM_M_PER_BLOCK=64 -DCK_PARAM_TUNABLE_GEMM_N_PER_BLOCK=32 -DCK_PARAM_TUNABLE_GEMM_K_PER_BLOCK=16 -DCK_PARAM_TUNABLE_GEMM_M_PER_THREAD=4 -DCK_PARAM_TUNABLE_GEMM_N_PER_THREAD=2 -DCK_PARAM_TUNABLE_GEMM_M_LEVEL0_CLUSTER=4 -DCK_PARAM_TUNABLE_GEMM_N_LEVEL0_CLUSTER=4 -DCK_PARAM_TUNABLE_GEMM_M_LEVEL1_CLUSTER=2 -DCK_PARAM_TUNABLE_GEMM_N_LEVEL1_CLUSTER=2 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=4 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_M=16 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_M=4 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_K=8 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_CLUSTER_LENGTHS_GEMM_N=8 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_SRC_DATA_PER_READ_GEMM_N=4 -DCK_PARAM_TUNABLE_GEMM_C_THREAD_COPY_DST_DATA_PER_WRITE_GEMM_N1=1 -DCK_PARAM_DEPENDENT_GRID_SIZE=6561 -DCK_THREADWISE_GEMM_USE_AMD_INLINE_ASM=1 -DCK_USE_AMD_INLINE_ASM=1 -DCK_USE_AMD_BUFFER_ATOMIC_ADD=0 -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=0 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=1 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DCK_PARAM_KPACK_LENGTH=2 -DCK_PARAM_TUNABLE_GEMM_A_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=2 -DCK_PARAM_TUNABLE_GEMM_B_BLOCK_COPY_DST_DATA_PER_WRITE_GEMM_KPACK=2 -mcpu=gfx906 -Werror -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-disabled-macro-expansion -Wno-undefined-reinterpret-cast --cuda-gpu-arch=gfx906 --cuda-device-only -c -O3  -Wno-unused-command-line-argument -I. -x hip --hip-device-lib-path=/opt/rocm/lib -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -D__HIP_ROCclr__=1 -isystem /opt/rocm-3.5.0/hip/../include -isystem /opt/rocm/llvm/lib/clang/11.0.0/include/.. -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -isystem /opt/rocm-3.5.0/hip/include -isystem /opt/rocm/include --hip-device-lib-path=/opt/rocm/lib --hip-link -mllvm -amdgpu-enable-global-sgpr-addr -mllvm --amdgpu-spill-vgpr-to-agpr=0 gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp -o /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp.o
MIOpen(HIP): Info [SystemCmd] cd /tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527; /opt/rocm/llvm/bin/clang-offload-bundler --type=o --targets=hip-amdgcn-amd-amdhsa-gfx906 --inputs=/tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp.o --outputs=/tmp/miopen-gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp-c8c2-5ec7-e2e4-2527/gridwise_convolution_backward_data_implicit_gemm_v1r1_nchw_kcyx_nkhw.cpp.o.hsaco --unbundle
SetTensor
real descritor: 8, 64, 54, 54
flat descritor: 1492992
MIOpen(HIP): Info [SystemCmd] cd /tmp/miopen-MIOpenSubTensorOpWithScalarKernel.cl-02b4-120b-cb16-3bba; /opt/rocm/bin/clang-ocl  -DSUBTENSOR_OP_WITH_SCALAR=SUBTENSOR_OP_WITH_SCALAR_SET -DMIOPEN_USE_FP16=0 -DMIOPEN_USE_FP32=1 -DMIOPEN_USE_INT8=0 -DMIOPEN_USE_INT8x4=0 -DMIOPEN_USE_BFP16=0 -DMIOPEN_USE_INT32=0 -DMIOPEN_USE_RNE_BFLOAT16=1 -DWORK_LENGTH_0=65536 -mcpu=gfx906 -Werror -Weverything -Wno-shorten-64-to-32 -Wno-unused-macros -Wno-unused-function -Wno-sign-compare -Wno-reserved-id-macro -Wno-sign-conversion -Wno-missing-prototypes -Wno-cast-qual -Wno-cast-align -Wno-conversion -Wno-double-promotion -Wno-float-equal -Wno-pass-failed -Xclang -target-feature -Xclang +code-object-v3 MIOpenSubTensorOpWithScalarKernel.cl -o /tmp/miopen-MIOpenSubTensorOpWithScalarKernel.cl-02b4-120b-cb16-3bba/MIOpenSubTensorOpWithScalarKernel.cl.o
Memory access fault by GPU node-2 (Agent handle: 0x2a9fde0) on address 0x7fdca2e84000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

This is similar to what I am currently debugging for my PR #305 . I am suspecting something is going wrong in creating workspace buffer after this new invoker rearchitect. Nothing has changed on Radeon VII pertaining to bwdv1r1 kernel except moving to invoker design.

Please point me to the invoker design doc or any PR describing its design.

@TejashShah
Copy link
Contributor

TejashShah commented Jun 24, 2020

@atamazov I tried on rocm3.1 using rocm/miopen-private:rocm3.1-tf1.15-dev-modified_clamp_device docker which uses hcc. It works fine.

./bin/MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 2 -w 2 -t 1 -i 6
MIOpenDriver convbfp16 -x 3 -y 3 -W 54 -H 54 -c 64 -n 8 -k 64 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -V 1 -s 0 -F 2 -w 2 -t 1 -i 6
MIOpen(OpenCL): Info [Handle] Device name: gfx906
MIOpen(OpenCL): Info [Handle] stream: 0x274ba40, device_id: 0x271fa80
Wall-clock Time Backward Data Conv. Elapsed: 16.0617 ms, Auxiliary API calls: 15599.5 ms (GWSS: 0.378906)
MIOpen Backward Data Conv. Algorithm: 5, Solution: 55/ConvHipImplicitGemmBwdDataV1R1
GPU Kernel Time Backward Data Conv. Elapsed: 14.980084 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdd-conv3x3u1, 8, 64, 3, 3, 64, 54, 54, 1719926784, 139264, 2985984, 115, 0, 14.980084
Backward Convolution Data Verifies on CPU and GPU (0.000384929)

@asroy
Copy link
Contributor

asroy commented Jun 24, 2020

@TejashShah You can try -O1 instead of -O3 flag for hip-clang, in src/hip/hip_build_util.cpp

@TejashShah
Copy link
Contributor

TejashShah commented Jun 24, 2020

@TejashShah You can try -O1 instead of -O3 flag for hip-clang, in src/hip/hip_build_util.cpp

Yes, it fixes the issue in rocm3.5, pointing to the hipclang compiler issue

One more thing: Although I suspected SetTensor kernel to be the cause of segfault from earlier log as it was the last in log. But this lowering of optimization is targeted only for hip kernel, not ocl kernel (SetTensor kernel). Thus, it leads me to conclude that the actual segfaul came from bwdv1r1 kernel but, for some reason, the log seems to be running behind the actual kernel.

So, until the compiler fix arrives, #310 could be temporary workaround.

@atamazov
Copy link
Contributor Author

@TejashShah

Why are solvers with workspace requirement getting skipped?

This is protection implemented in Invokers.

Which solver/kernel was executed that caused seg fault?

ConvHipImplicitGemmBwdDataV1R1, as shown in the issue title.

@atamazov
Copy link
Contributor Author

@TejashShah Could you please check on ROCm 3.3 with a number of BF16 configs and confirm that it is free of this problem. Thanks.

@atamazov
Copy link
Contributor Author

Thus, it leads me to conclude that the actual segfaul came from bwdv1r1 kernel but, for some reason, the log seems to be running behind the actual kernel.

Most likely the program is terminated before the relevant log line is printed (console output is buffered). This is happens very often in case of fatal errors.

@TejashShah
Copy link
Contributor

@TejashShah Could you please check on ROCm 3.3 with a number of BF16 configs and confirm that it is free of this problem.

It passed.

@asroy
Copy link
Contributor

asroy commented Jun 25, 2020

@TejashShah Worth another JIRA ticket for hip-clang

@TejashShah
Copy link
Contributor

@asroy Yes, I am trying to pinpoint particular instruction causing it. Apparently, rocm-debug-agent doesnt work in rocm3.5

@TejashShah
Copy link
Contributor

http://ontrack-internal.amd.com/browse/SWDEV-243048

@atamazov atamazov removed the bug label Jul 2, 2020
@atamazov
Copy link
Contributor Author

atamazov commented Dec 7, 2020

@daniellowell Right now we have workaround. I recommend checking if problem is fixed in the recent compiler and disable it. Could you please make an assignment?

AFAICS http://ontrack-internal.amd.com/browse/SWDEV-243048 is not resolved yet.

@asroy
Copy link
Contributor

asroy commented Dec 10, 2020

I tried rocm3.9, but these cases still failed if using -O3, but pass if using -O1.

But they are no longer having memory access fault, instead they produce wrong result (if using -O3). This error can be captured by:
bin/test_conv2d --input 8, 64, 54, 54 --weights 64, 64, 3, 3 --pads_strides_dilations 1, 1, 1, 1, 1, 1 --disable-forward --disable-backward-weights --bfloat16 -v

which is testing the same as MIOpenDriver cmd n the problem description.

===== output with -O1:
error: 0
Max diff: 0
Backward convolution: ConvHipImplicitGemmBwdDataV1R1
Input tensor: 8, 64, 54, 54
Weights tensor: 64, 64, 3, 3
Output tensor: 8, 64, 54, 54

===== output with -O3:
error: 0.163848
Max diff: 6720
Mismatch at 1: -376 != 390
Backward convolution: ConvHipImplicitGemmBwdDataV1R1
Input tensor: 8, 64, 54, 54
Weights tensor: 64, 64, 3, 3
Output tensor: 8, 64, 54, 54

@asroy
Copy link
Contributor

asroy commented Jan 16, 2021

Tried rocm/miopen-private:compute-rocm-dkms-staging-4309

The issue is gone

@atamazov
Copy link
Contributor Author

@asroy Great! We can switch the workaround OFF and then close the issue as soon as we see that the issue is gone in some ROCm release.

@atamazov
Copy link
Contributor Author

As per http://ontrack-internal.amd.com/browse/SWDEV-264644, the fix for compiler is in the mainline.

@atamazov
Copy link
Contributor Author

atamazov commented Feb 26, 2021

The fix is expected in 4.2 release. For validation, refer to SWDEV for docker image with release candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants