Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LTO] Unit test testCudaCheck failing for aarch64 architecture #40834

Closed
aandvalenzuela opened this issue Feb 21, 2023 · 7 comments · Fixed by #40840
Closed

[LTO] Unit test testCudaCheck failing for aarch64 architecture #40834

aandvalenzuela opened this issue Feb 21, 2023 · 7 comments · Fixed by #40840

Comments

@aandvalenzuela
Copy link
Contributor

Hello,

Test testCudaCheck (module HeterogeneousCore/CUDAUtilities) is failing since 1st of Feb in LTO IBs (aarch64 architecture only) due to segmentation violation when checking the driver API:

/data/cmsbld/jenkins_a/workspace/build-any-ib/w/tmp/BUILDROOT/d82d494993e018306d251480a039b1b4/opt/cmssw/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_1_LTO_X_2023-02-19-0000/src/HeterogeneousCore/CUDAUtilities/test/testCudaCheck.cpp:13: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGSEGV - Segmentation violation signal

===============================================================================
test cases: 1 | 1 failed
assertions: 2 | 1 passed | 1 failed


 *** Break *** segmentation violation

See stacktrace.

Test was added on 1st of Feb via #40619 and it succeeds since then in amd64, but not on aarch64. Tests are supposed to pass on machines with and without GPUs (#40619 (comment)). I can reproduce the issue in all our aarch64 machines.

Thanks,
Andrea.

@cmsbuild
Copy link
Contributor

A new Issue was created by @aandvalenzuela Andrea Valenzuela.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@aandvalenzuela
Copy link
Contributor Author

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@aandvalenzuela
Copy link
Contributor Author

Not sure why it succeeds in non-LTO IBs:

===== Test "testCudaCheck" ====
===============================================================================
All tests passed (4 assertions in 1 test case)


---> test testCudaCheck succeeded

CMSSW_13_1_X_2023-02-20-2300 for el8_aarch64_gcc11 also.

@makortel
Copy link
Contributor

I took a look and it seems that the error and message in

const char* error;
const char* message;
cuGetErrorName(result, &error);
cuGetErrorString(result, &message);
abortOnCudaError(file, line, cmd, error, message, description);

stay uninitialized after the cuGetErrorName() and cuGetErrorString() calls. Ok, I didn't really check that, but their value in the abortOnCudaError() are garbage, and initializing them to nullptr seems to fix the crash.

@makortel
Copy link
Contributor

Fixed in #40840

@aandvalenzuela
Copy link
Contributor Author

Thanks @makortel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants