Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gfx11] MIOpen unit tests are failing - possible false alarm? #2079

Closed
junliume opened this issue Apr 7, 2023 · 14 comments
Closed

[gfx11] MIOpen unit tests are failing - possible false alarm? #2079

junliume opened this issue Apr 7, 2023 · 14 comments
Assignees

Comments

@junliume
Copy link
Collaborator

junliume commented Apr 7, 2023

[Symptom]:
The following unit tests from MIOpen are failing on gfx1100:

 74 - test_conv_igemm_mlir_bwd_wrw (Failed)
103 - test_conv_igemm_dynamic_dlops_nchwc_nchwc_fwd_fp16x4 (Failed)
105 - test_conv_igemm_dynamic_dlops_nchwc_nchwc_fwd_fp16x8 (Failed)
132 - smoke_solver_ConvHipImplicitGemmV4R1WrW (Failed)
133 - smoke_solver_ConvHipImplicitGemmV4R1Fwd_fp16_bf16 (Failed)
155 - smoke_solver_ConvMlirIgemm (Failed)

[Analysis]
However, the log shows that

/home/junliu/MIOpen/build/bin/test_conv2d --half --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1 --output_type int32 --int8_vectorize 0
FAILED: /home/junliu/MIOpen/src/ocl/convolutionocl.cpp:517: Forward Convolution cannot be executed due to incorrect params

So likely it failed because the log contains strings like "FAILED" but it means no applicable kernel is available.

@muralinr could you check if this is a false alarm? Thanks!

@junliume
Copy link
Collaborator Author

junliume commented Apr 8, 2023

@JehandadKhan @atamazov

LastTest.log
From the attached log, it seems that

MIOpen(HIP): Trace [GenericSearch] ##(n_current, n_failed, n_runs_total):  0/0/5 elapsed_time: 0.893117, best_time: 3.40282e+38, 16,64,4,2,4,4,2,4,4,4,4,2,16,1,4,32

is causing the issue due to Regex=[(FAILED)|(Error)|(failed)]

We should avoid using such keywords in log printing.

@atamazov
Copy link
Contributor

atamazov commented Apr 8, 2023

@junliume This is known problem, you can find detailed description at #2038 (comment) Please do not use log level 7 (TRACE) until it is fixed.

Generally, I do not recommend using log levels > 4 in our CI unless really necessary or explicitly set in the test (as logs become really huge and this may affect testing performance).

I'll change the format of the guilty TRACE message to match other messages in GenericSearch. Please assign this ticket to me and rename it like "Logs of GenericSearch contain "failed" at TRACE level".

CC @JehandadKhan

@junliume
Copy link
Collaborator Author

LastTest_no_applicable_solver.log
Hi @atamazov thanks! I think it might be more than just the debug level of print outs. As shown in the attached file, it says:

FAILED: MIOpen/src/ocl/convolutionocl.cpp:1590: Backward Weights Convolution cannot be executed due to incorrect params

Usually this means that we do not have applicable solver for this combination of config and platform, but I am curious why?

@atamazov
Copy link
Contributor

atamazov commented Apr 10, 2023

This problem is different. I'll look into it ASAP and provide a fix for Convolution cannot be executed due to incorrect params. I would appreciate if you attach a full log from your system (including CMake command), plus output of this:

env | grep -E "MIO|AMD|ROC|GPU|DEVICE" | sort

@atamazov
Copy link
Contributor

atamazov commented Apr 10, 2023

@junliume test_conv_igemm_mlir_bwd_wrw is not applicable for gfx11, should not be run on it (and not enabled for this target in tests/CMakeLists.txt). Does this problem really happen on gfx11?

@junliume
Copy link
Collaborator Author

env | grep -E "MIO|AMD|ROC|GPU|DEVICE" | sort

@atamazov interesting. and yes, I am on a gfx1100 system:

CXX=/opt/rocm/llvm/bin/clang++ CC=/opt/rocm/llvm/bin/clang cmake -DMIOPEN_BACKEND=HIP -DCMAKE_PREFIX_PATH="/opt/rocm/;/opt/rocm/hip" -DMIOPEN_TEST_ALL=On -DMIOPEN_TEST_HALF=On ..

make -j$(nproc) check

I guess these tests should be disabled?

74 - test_conv_igemm_mlir_bwd_wrw (Failed)
155 - smoke_solver_ConvMlirIgemm (Failed)

but how about these ones:

103 - test_conv_igemm_dynamic_dlops_nchwc_nchwc_fwd_fp16x4 (Failed)
105 - test_conv_igemm_dynamic_dlops_nchwc_nchwc_fwd_fp16x8 (Failed)
132 - smoke_solver_ConvHipImplicitGemmV4R1WrW (Failed)
133 - smoke_solver_ConvHipImplicitGemmV4R1Fwd_fp16_bf16 (Failed)

@atamazov
Copy link
Contributor

@junliume All these tests are NOT enabled for gfx11. Something is wrong on your system. I need out put of env | grep -E "MIO|AMD|ROC|GPU|DEVICE" | sort and full logs to be able to say something more valuable ;)

@JehandadKhan
Copy link
Contributor

132 - smoke_solver_ConvHipImplicitGemmV4R1WrW (Failed)
133 - smoke_solver_ConvHipImplicitGemmV4R1Fwd_fp16_bf16 (Failed)

I think we should disable these tests on gfx1100 since these solvers have been restricted existing architectures only

@junliume
Copy link
Collaborator Author

I have both gfx1030 and gfx1100 on my system (two cards), so maybe that is the reason?
nothing special about this though:

env | grep -E "MIO|AMD|ROC|GPU|DEVICE" | sort

@atamazov
Copy link
Contributor

atamazov commented Apr 13, 2023

@JehandadKhan

132 - smoke_solver_ConvHipImplicitGemmV4R1WrW (Failed)
133 - smoke_solver_ConvHipImplicitGemmV4R1Fwd_fp16_bf16 (Failed)

I think we should disable these tests on gfx1100 since these solvers have been restricted existing architectures only

@junliume All these tests are NOT enabled for gfx11.

@atamazov
Copy link
Contributor

@junliume

I have both gfx1030 and gfx1100 on my system (two cards), so maybe that is the reason?

if you would attach the full logs as I requested, then I would be able to answer this question 😄

nothing special about this though:

env | grep -E "MIO|AMD|ROC|GPU|DEVICE" | sort

Your system must have ROCR_VISIBLE_DEVICES set to 0 or 1, so rocminfo shows only gfx1100 and ignores gfx1030. More info at ROCm/ROCm#841 (comment)

@junliume
Copy link
Collaborator Author

@atamazov I think we need to modify CMakeLists.txt in test folder, so that ROCMINFO shows Device - 0 only.
Currently I have both gfx1030 and gfx1100 in the system which is why these tests are accidentally enabled.

@atamazov
Copy link
Contributor

atamazov commented Apr 26, 2023

@junliume This won't help because the tests (executables) will likely use both devices. Please use ROCR_VISIBLE_DEVICES as suggested in my previous comment.

Currently MIOpen does not support non-uniform GPU configurations.

@junliume
Copy link
Collaborator Author

@atamazov thanks! yep ROCR_VISIBLE_DEVICES should work. I see that usually the MIOpen tests are executed only on the first device Device 0, so I thought maybe in CMakeLists we can emit a message that even though multiple devices are found, we are defaulting to the first device only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants