Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] Fix for DEV builds. Fix testing on 16GB GPUs. #1729

Merged
merged 11 commits into from
Sep 4, 2022

Conversation

atamazov
Copy link
Contributor

@atamazov atamazov commented Sep 1, 2022

A pack of changes that expected to fix problems with testing of DEV builds and problems with testing on 16GB GPUs (e.g. Radeon VII, MI25, Radeon RX 6800/6900). May improve CI stability.

@shurale-nkn @junliume @muralinr Please review. I would recommend urgency_high. Thanks.

Local make check passed on Radeon VII with both FP16 and FP32.

[Informative] COMMANDS TO BUILD AND RUN TESTS:

$ CXX=/opt/rocm/llvm/bin/clang++ \
CXXFLAGS='-Werror' \
cmake \
-DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' \
-DCMAKE_BUILD_TYPE=release \
-DBUILD_DEV=On \
-DMIOPEN_GPU_SYNC=Off \
-DMIOPEN_TEST_ALL=On \
-DMIOPEN_TEST_HALF=On \
../..
$ CTEST_PARALLEL_LEVEL=4 make -j `nproc --ignore=4` check

@atamazov atamazov changed the title Fix issue 1674 2 [tests] Fix for DEV builds. Fix testing on 16GB GPUs. Sep 1, 2022
@atamazov atamazov marked this pull request as ready for review September 1, 2022 23:13
@atamazov
Copy link
Contributor Author

atamazov commented Sep 1, 2022

@junliume @shurale-nkn Locally tested. Description updated. Ready for review.

@atamazov
Copy link
Contributor Author

atamazov commented Sep 1, 2022

@junliume If some memory-consuming tests were disabled due to instability, then we can try enabling them after this PR is merged in.

@@ -284,9 +284,8 @@ mloPoolingMaxBwd(const __global _FLOAT* top_df,

barrier(CLK_LOCAL_MEM_FENCE);
_FLOAT add_val;
int bt_y = (y + lcl_id1 * MLO_POOLBWD_N_VERT_OUT_PIX);
int bt_x = (x + lcl_id0 * MLO_POOLBWD_N_HORIZ_OUT_PIX);
int b_idx = bt_y * mlo_bot_width + bt_x;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "b_idx" not used anywhere and redundant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muralinr it seems that we have met the same issue:
#1732

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Jun. I looked at code. 'b_idx" is computed and it was not used inside code anywhere. It's should be removed it to avoid compilation issues.

@@ -30,7 +30,7 @@ namespace miopen {

std::string HIPErrorMessage(int error, const std::string& msg)
{
return msg + " " + hipGetErrorString(static_cast<hipError_t>(error));
return msg + ": " + hipGetErrorString(static_cast<hipError_t>(error));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ":" change mandatory?

Copy link
Contributor Author

@atamazov atamazov Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we want to automatically separate error message from the error code, see https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1729/files#r962057473.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

@@ -211,6 +211,7 @@ void test_warnings(kernel_type_t kern_type)
if(kern_type == miopenOpenCLKernelType)
EXPECT(throws([&] {
h.AddKernel("GEMM", "", WriteNop(kern_type), "write", {1, 1, 1}, {1, 1, 1}, "");
MIOPEN_LOG_E("FAILED: Build of the OpenCL kernel should produce warnings");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is very good.

@@ -223,7 +224,8 @@ void test_warnings(kernel_type_t kern_type)
"",
0,
false,
WriteNop(miopenHIPKernelType));
WriteNop(kern_type));
MIOPEN_LOG_E("FAILED: Build of the HIP kernel 'nop_hip.cpp' should produce warnings");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. This change is good.

auto status = hipMemPtrGetInfo(mem, &size);
if(status != hipSuccess)
MIOPEN_LOG_W("hipMemPtrGetInfo at " << mem << " status: " << status);
status = hipFree(mem);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok to use hipFree for device and for host memory?

src/hip/handlehip.cpp Outdated Show resolved Hide resolved
src/hip/handlehip.cpp Outdated Show resolved Hide resolved
Comment on lines -92 to -93
MIOPEN_THROW_HIP_STATUS(status,
"Hip error creating buffer " + std::to_string(sz) + ": ");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muralinr I do not think we want to append + ": " to each MIOPEN_THROW_HIP_STATUS.

junliume
junliume previously approved these changes Sep 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants