Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gfx1030][FP16][ROCm5.2] test_lrn_test failure due to hipRTC issue #1674

Closed
junliume opened this issue Aug 5, 2022 · 32 comments
Closed

[gfx1030][FP16][ROCm5.2] test_lrn_test failure due to hipRTC issue #1674

junliume opened this issue Aug 5, 2022 · 32 comments

Comments

@junliume
Copy link
Collaborator

junliume commented Aug 5, 2022

[How To Reproduce]:

CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror'  cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=release -DBUILD_DEV=On -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off  -DMIOPEN_TEST_ALL=On -DMIOPEN_TEST_HALF=On -DCMAKE_PREFIX_PATH="/opt/rocm/;/opt/rocm/hip;/root/MIOpen/install_dir"  .. 

LLVM_PATH=/opt/rocm/llvm CTEST_PARALLEL_LEVEL=4 MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 make -j 48 check

test_lrn_test (Failed)

[More Details]:

 4/108 Test  #32: test_lrn_test ..........................................***Failed  Error regular expression found in output. Regex=[FAILED]414.24 sec
/root/MIOpen/build/bin/test_lrn_test --half --input 1, 128, 56, 56 --N 1 --alpha 1 --beta 1 --K 1 --mode Across_Channel 
FAILED: /root/MIOpen/src/hipoc/hipoc_program.cpp:162: Failed creating module from file /tmp/miopen-interim-hsaco-083f-0ba5-deda-fab5/file shared object initialization failed
verify_lrn_bwd
Input Tensor Y 1, 128, 56, 56
Input Tensor DY 1, 128, 56, 56
Input Tensor X 1, 128, 56, 56
 29/108 Test  #27: test_handle_test .......................................***Failed    1.25 sec
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-b21968/input/CompileSource:6:10: error: use of undeclared identifier 'num'
    data[num] *= 2;
         ^
1 error generated when compiling for gfx1030.
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-ddd4a0/input/CompileSource:6:10: error: use of undeclared identifier 'num'
    data[num] *= 2;
         ^
1 error generated when compiling for gfx1030.
@junliume
Copy link
Collaborator Author

junliume commented Aug 5, 2022

@atamazov could you also take a look?

@junliume
Copy link
Collaborator Author

junliume commented Aug 5, 2022

@muralinr from these two tickets that it might be the test's problem, could you check on it?
https://ontrack-internal.amd.com/browse/SWDEV-278594
https://ontrack-internal.amd.com/browse/SWDEV-290352
Could you check why this problem is coming back again?

@muralinr
Copy link
Contributor

muralinr commented Aug 5, 2022

Hi @junliume I looked at these errors. These "test_lrn_test" and "test_handle_test" failures are related to HIPRTC error compilation issues. We should ask Artem or Paul to look at this issue.

test_lrn_test ..........................................***Failed Error regular expression found in output. Regex=[FAILED]414.24 sec
/root/MIOpen/build/bin/test_lrn_test --half --input 1, 128, 56, 56 --N 1 --alpha 1 --beta 1 --K 1 --mode Across_Channel
FAILED: /root/MIOpen/src/hipoc/hipoc_program.cpp:162: Failed creating module from file /tmp/miopen-interim-hsaco-083f-0ba5-deda-fab5/file shared object initialization failed

Test #27: test_handle_test .......................................***Failed 1.25 sec
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-b21968/input/CompileSource:6:10: error: use of undeclared identifier 'num'
data[num] *= 2;
^
1 error generated when compiling for gfx1030.
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-ddd4a0/input/CompileSource:6:10: error: use of undeclared identifier 'num'
data[num] *= 2;
^
1 error generated when compiling for gfx1030.

@shurale-nkn
Copy link
Contributor

@junliume @atamazov
Always happens during testing in our CI.
CI can't detect failures because of regexp in upper case

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/46b85e8f2dcaaee26b9cacbaf6a9babe2335f5b5/test/CMakeLists.txt#L333

"test_handle_test" start time: Aug 10 19:34 UTC
Output:
----------------------------------------------------------
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-5caf37/input/CompileSource:6:10: error: use of undeclared identifier 'num'
    data[num] *= 2;
         ^
1 error generated when compiling for gfx1030.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-51ea2f/input/CompileSource:6:10: error: use of undeclared identifier 'num'
    data[num] *= 2;
         ^
1 error generated when compiling for gfx1030.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' nop_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: nop_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-72bce5/input/CompileSource:5:28: error: unused parameter 'data' [-Werror,-Wunused-parameter]
__global__ void write(int* data) {
                           ^
1 error generated when compiling for gfx1030.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] /tmp/comgr-67a46a/input/tinygemm.cl:1:48: error: use of undeclared identifier 'i'
__kernel void write(__global int* data) { data[i] = 0; }
                                               ^
1 error generated.

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] /tmp/comgr-0fabf1/input/tinygemm.cl:1:48: error: use of undeclared identifier 'i'
__kernel void write(__global int* data) { data[i] = 0; }
                                               ^
1 error generated.

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] /tmp/comgr-86efe8/input/tinygemm.cl:1:35: error: unused parameter 'data' [-Werror,-Wunused-parameter]
__kernel void write(__global int* data) {}
                                  ^
1 error generated.

<end of output>

@junliume
Copy link
Collaborator Author

@muralinr it seems that we need to reopen one of the above mentioned tickets. It should be assigned to compiler I think.

@junliume
Copy link
Collaborator Author

@junliume @atamazov Always happens during testing in our CI. CI can't detect failures because of regexp in upper case

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/46b85e8f2dcaaee26b9cacbaf6a9babe2335f5b5/test/CMakeLists.txt#L333

"test_handle_test" start time: Aug 10 19:34 UTC
Output:
----------------------------------------------------------
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-5caf37/input/CompileSource:6:10: error: use of undeclared identifier 'num'
    data[num] *= 2;
         ^
1 error generated when compiling for gfx1030.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-51ea2f/input/CompileSource:6:10: error: use of undeclared identifier 'num'
    data[num] *= 2;
         ^
1 error generated when compiling for gfx1030.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' nop_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: nop_hip.cpp
MIOpen(HIP): Warning [BuildHip] /tmp/comgr-72bce5/input/CompileSource:5:28: error: unused parameter 'data' [-Werror,-Wunused-parameter]
__global__ void write(int* data) {
                           ^
1 error generated when compiling for gfx1030.
Error: Failed to compile source (from CL or HIP source to LLVM IR).

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] /tmp/comgr-67a46a/input/tinygemm.cl:1:48: error: use of undeclared identifier 'i'
__kernel void write(__global int* data) { data[i] = 0; }
                                               ^
1 error generated.

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] /tmp/comgr-0fabf1/input/tinygemm.cl:1:48: error: use of undeclared identifier 'i'
__kernel void write(__global int* data) { data[i] = 0; }
                                               ^
1 error generated.

MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] /tmp/comgr-86efe8/input/tinygemm.cl:1:35: error: unused parameter 'data' [-Werror,-Wunused-parameter]
__kernel void write(__global int* data) {}
                                  ^
1 error generated.

<end of output>

Thanks @shurale-nkn I think we need to fix this testing defect too! :)

@junliume
Copy link
Collaborator Author

Taken from official documentation, shall we change the line to the follwing?

set_tests_properties(${TEST_NAME} PROPERTIES FAIL_REGULAR_EXPRESSION "FAILED;[^a-z]Error;ERROR;Failed")

@shurale-nkn
Copy link
Contributor

Taken from official documentation, shall we change the line to the follwing?

set_tests_properties(${TEST_NAME} PROPERTIES FAIL_REGULAR_EXPRESSION "FAILED;[^a-z]Error;ERROR;Failed")

@junliume
no, that would be too easy. We are not looking for easy ways 😄

The real problem is the variability of the words used in our output. Error can be used in Verification to define difference between gold value and result, it also can be a part of compilation argument. Therefore, we can't say that it is always a negative marker.

@shurale-nkn
Copy link
Contributor

shurale-nkn commented Aug 12, 2022

On the one hand, if the error is not critical and the program can perform a task by another method, it is not necessary to terminate it. But we should be able to track it in our CI, where in that case we should always get Error message.

So the mechanism is similar to the current behavior of the assert function. Which works only in debug builds. But this should work in any build if some flag was defined.

@atamazov
Copy link
Contributor

atamazov commented Aug 13, 2022

[Informative] Current convention is that the failing test should print FAILED (uppercase) or return non-zero code.

@atamazov
Copy link
Contributor

@junliume Build errors in test_handle_test is not a bug. Among other checks, this test intentionally causes build errors in order to check if the library properly catch build errors. Error messages on the console are normal.

@atamazov
Copy link
Contributor

I am looking into test_lrn_test problem. Is it rocm or device specific?

@shurale-nkn
Copy link
Contributor

@junliume Build errors in test_handle_test is not a bug. Among other checks, this test intentionally causes build errors in order to check if the library properly catch build errors. Error messages on the console are normal.

In this case, the CI should check the output data for an error and check the correctness of the handling.

@atamazov

This comment was marked as off-topic.

@atamazov

This comment was marked as off-topic.

@atamazov
Copy link
Contributor

atamazov commented Aug 15, 2022

The previous comment is partially incorrect.

@junliume In which docker container have you found the issue with test_handle_test?

@atamazov
Copy link
Contributor

atamazov commented Aug 15, 2022

[Informative] I've used 5.2 container with MIOpen installed. Rebuilding the library didn't matter because tests still use the installed library. The tests were built with BUILD_DEV=On and warning test was enabled. But the installed library was built with BUILD_DEV=Off and unable to issue build warnings. Therefore the test has failed, but the failures were actually false positives, and I got the impression that there are compiler problems in 5.2.

Russian Gandalf forget his staff

image

@junliume
Copy link
Collaborator Author

The previous comment is partially incorrect.
@junliume In which docker container have you found the issue with test_handle_test?

@atamazov I am using a standalone gfx1030 node with base OS (bare metal) ROCm-5.2.0. Let me pull a docker and see if the issue is reproducible from a docker too.

@atamazov
Copy link
Contributor

@junliume Thank you. I have gfx1030 on hand, but the base driver is 4.3.0, so may I need to upgrade the node. Please update me with your results with a docker.

@junliume
Copy link
Collaborator Author

junliume commented Aug 16, 2022

[ENV]:
base: ROCm-5.2.0
docker: latest mainline docker

[Observations]:

  1. Category 1: CMake Error undefined symbol. test_immed_conv3d test_conv2D test_conv3D test_immed_conv2d test_conv_ck_igemm_fwd_v6r1_dlops_nchw
    e.g.
    /root/MIOpen/build/bin/test_immed_conv3d: symbol lookup error: /root/MIOpen/build/bin/test_immed_conv3d: undefined symbol: _ZN6miopen20LoadOrPrepareInvokerERNS_6HandleERNS_18ConvolutionContextENS_6solver2IdENS_4conv9DirectionE, version MIOPEN_HIP_1
    CMake Error at test_test_immed_conv3d.cmake:7 (message):
      Test failed
    
  2. Category 2: error: use of undeclared identifier 'num' test_handle_test
    e.g.
     29/110 Test  #27: test_handle_test .......................................***Failed    1.47 sec
    MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: 
    HIPRTC_ERROR_COMPILATION (6)
    MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
    MIOpen(HIP): Warning [BuildHip] /tmp/comgr-1c7576/input/CompileSource:6:10: error: use of undeclared identifier 'num'
        data[num] *= 2;
             ^
    1 error generated when compiling for gfx1030.
    MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' error_hip.cpp: 
    HIPRTC_ERROR_COMPILATION (6)
    MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: error_hip.cpp
    MIOpen(HIP): Warning [BuildHip] /tmp/comgr-b96a5b/input/CompileSource:6:10: error: use of undeclared identifier 'num'
        data[num] *= 2;
             ^
    1 error generated when compiling for gfx1030.
    CMake Error at test_test_handle_test.cmake:7 (message):
      Test failed
    

@junliume
Copy link
Collaborator Author

@atamazov it looks all related to handle issues which may or may not be related to hipRTC

@junliume junliume reopened this Aug 16, 2022
@atamazov
Copy link
Contributor

[Informative] Not reproducible with 5.2 release docker and 4.3.0 base driver.

@junliume
Copy link
Collaborator Author

@atamazov let's put this issue on hold then. I just checked my last run with this docker and cannot reproduce the issues either.

We are trying to update base OS ROCm to 5.2 and docekr ROCm to the same, let's see if gfx1030-fp16 stability has changed.

Thanks!

@atamazov
Copy link
Contributor

atamazov commented Aug 17, 2022

Results of running test_handle_test with Mainline build 5.3.0-10659

⚠️ The error shown at #1674 (comment) is most likely because of installed MIOpen library was built with BUILD_DEV=Off, while the test was built with BUILD_DEV=On, see #1674 (comment). The following results were obtained with MIOpen removed from /opt/rocm:

(A) When MIOpen is build with BUILD_DEV=Off and installed, the test passes.

(B) When MIOpen is build with BUILD_DEV=On, the test fails with this:

root@Drakkar:~/MIOpen/build# ctest --output-on-failure -C ./test -R test_handle_test
Test project /root/MIOpen/build
    Start 27: test_handle_test
1/1 Test #27: test_handle_test .................***Failed    0.59 sec
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
MIOpen(HIP): Info [Handle] stream: 0x16e8180, device_id: 0
MIOpen(HIP): Info [get_device_name] Raw device name: gfx1030
MIOpen(HIP): Info [Handle] stream: 0x16e8180, device_id: 0
MIOpen(HIP): Info [PrintVersion] HIPRTC v.9.0
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' test_hip.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: test_hip.cpp
MIOpen(HIP): Warning [BuildHip] In file included from <built-in>:1:
/tmp/comgr-da295e/include/hiprtc_runtime.h:1:5: error: this style of line directive is a GNU extension [-Werror,-Wgnu-line-marker]
# 1 "/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h"
    ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1:5: error: this style of line directive is a GNU extension [-Werror,-Wgnu-line-marker]
# 1 "<built-in>" 1
    ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:5:5: error: this style of line directive is a GNU extension [-Werror,-Wgnu-line-marker]
# 1 "/long_pathname_so_that_rpms_can_package_the_debug_info/src/external/hip-on-vdi/include/hip/hip_runtime.h" 1 3
    ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:6:5: error: this style of line directive is a GNU extension [-Werror,-Wgnu-line-marker]
# 1 "/long_pathname_so_that_rpms_can_package_the_debug_info/src/external/hip-on-vdi/include/hip/hip_fp16.h" 1 3
    ^
4 errors generated when compiling for gfx1030.
terminate called after throwing an instance of 'miopen::Exception'
  what():  /root/MIOpen/src/hipoc/hipoc_program.cpp:299: Code object build failed. Source: test_hip.cpp
CMake Error at test_test_handle_test.cmake:7 (message):
  Test failed

❗ The root reason of the issue is

I am working on isolating the issue and providing a fix or workaround.

@atamazov
Copy link
Contributor

More info on the guilty GCC extension: As an extension, the preprocessor accepts linemarkers in non-assembler input files (see https://debrouxl.github.io/gcc4ti/cpp.html#SEC43 for details).

@atamazov
Copy link
Contributor

atamazov commented Aug 17, 2022

@junliume Please add non_miopen_bug.

I see some issues in pooling kernels (build errors due to unused variables). I am going to continue fixing the issues until DEV builds pass make check with HIP Mainline.

@atamazov
Copy link
Contributor

@junliume In DEV builds, test_lrn_test and test_activation fail due to memory allocation error. Tested on both gfx906 (16GB) and Navi21 (16GB), on different ROCm versions (5.0.0 and Mainline build 10659) with different base drivers. Non-DEV builds pass! Reason is still unknown. I am working on the workaround.

   640  /root/MIOpen/build/bin/test_lrn_test --half --input 1, 16, 4096, 4096 --N 5 --alpha 1 --beta 1 --K 1 --mode Within_Channel
   641  FAILED: /root/MIOpen/src/hip/handlehip.cpp:85: Memory not available to allocate buffer: 536870912
...
   431  1: /root/MIOpen/build/bin/test_activation --half --input 1, 16, 4096, 4096 --alpha 0.95 --beta 2.3 --gamma 3.4 --mode CLIPPEDRELU --packed 0
   432  1: FAILED: /root/MIOpen/src/hip/handlehip.cpp:85: Memory not available to allocate buffer: 536870912

@shurale-nkn
Copy link
Contributor

This is SWDEV-345683.
PR1729 is workaround for this issue, main problem in HIP. Fixed in 5.3.0-27.
necessary to arrange changes as workaround and continue throwing exceptions after hip 5.3.0-27 in order to monitor the unexpected behavior of the system.

@atamazov
Copy link
Contributor

atamazov commented Sep 7, 2022

@shurale-nkn What is this SWDEV-345683 about? Some quirks of hipMemGetInfo or hipFree or...?

@junliume I can prepare the followup PR but I need 5.3.0-27 where the problem is expected to be fixed.

I recommend removing help_wanted and setting urgency_low or normal (as this affects tests only).

@shurale-nkn
Copy link
Contributor

RuntimeError: HIP out of memory while running pytorch resnext101 for FP32

@junliume junliume removed this from the ROCm 5.4 milestone Jan 11, 2023
@ppanchad-amd
Copy link

@junliume Is this fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!

@atamazov
Copy link
Contributor

@ppanchad-amd I think this may be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants