-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]cudaErrorIllegalAddress an illegal memory access was encountered #563
Comments
I think I've seen similar errors, but haven't been able to reproduce it reliably. |
The fact that these errors are picked up in RMM does not mean that they originate in RMM. RMM could just be picking up latent, asynchronous CUDA errors caused from somewhere else. |
Hmm, seems like this can reliably produce a test failure when built with for i in {1..100}; do ./gtests/DEVICE_SCALAR_TEST || break; done |
Aha. I see what the problem is. I'll have a PR with the fix in a moment. |
This should resolve it: #569 |
Also see #570 re: asynchrony of |
Two kind of " illegal", one is cudaErrorIllegalAddress, one is cudaErrorMisalignedAddress (one the same code lines) And tpc-ds query 14a, 14b has very high probability to trigger this bug |
@jrhemstad I'm still seeing test failures after syncing to head and rebuilding:
|
Seems we need to add a |
Looked at |
tpc-ds 9,14a,14b,99,88,35,58,82,46,87,38,70,48,59,11,61,24b,78,50,8,25 are most affected @rongou |
maybe it is due to some unknown issue in |
@rongou found a cudaErrorIllegalAddress when using arena (tpc-ds 64 and tpc-ds 46) |
@jrhemstad seems #569 did not resolve this? |
Building RMM debug and running under |
It's likely there are race conditions in libcudf and/or the spark-rapids plugin. RMM is only surfacing these issue when synchronizing the stream or waiting for events. |
So are you no longer seeing the DEVICE_SCALAR_TEST failures? |
I am, but as I said, that seems to be caused by creating and destroying streams in every test. Maybe the CUDA driver is reusing some data structure for streams that's causing the problem. In any case, it seems like a different issue from the cudaErrorIllegalAddress errors. |
I don't understand why you get these failures and nobody else in the team does (and CI doesn't). CI is currently only testing PTDS mode. (#555) |
@JustPlay are you still seeing these errors? I ran all the TPC-DS queries in a loop for several times and haven't see any memory errors. I'm using the latest |
sorry, i have not tested rapids recently; i will report if i found some; Thanks |
Closing for now. Please reopen if you run into the errors again. |
Describe the bug
When I running rapids-0.2 with cudf-0.16,rmm-0.16, i encountered the following rmm error
I'am using
RMM with commit-id: f591436
cuDF with commit-id: 32e6c1d9369f9a5bfe81958bf1a4a81af51bb59e
cuDF, rmm was build by using https://github.com/rapidsai/cudf/blob/branch-0.16/java/ci/build-in-docker.sh but turn PTDS=on
The text was updated successfully, but these errors were encountered: