-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator #4710
Comments
Can you try running without any pooling? It would help us to know if it is a bug in CUDF/Plugin or if it is in the ASYNC allocator. I suspect it is in CUDF and ARENA is hiding it. |
Yes, I am trying this now. |
I have run q95 100 times without pooling, and the issue doesn't reproduce. |
I wonder if it's because the async pool was running out memory. Can you try to add |
Wouldn't that manifest as a CUDA out of memory error code rather than an illegal address error? |
Agree, I don't understand this either, but I tried it with |
If I synchronize before calling Here's the shuffled hash join node where I think we are needing this:
This change "fixes" it, but we need to figure out why exactly:
Note that I previously thought it was the non-mixed gatherer code, but adding a synchronize there didn't help. So it seems specific to mixed joins so far. |
This also just reproed in our local performance cluster. |
I saw a similar error with
|
An update on this issue. I believe a lot of this has to do with running out of memory. The original issue can be reproduced pretty easily if the number of concurrent threads in q95 is set high (16 concurrent allowed on the gpu for example). One issue looks to be that we are loosing stream information given that we use PTDS, and additionally we don't set the stream for gatherMap and for contiguous_split buffers (these ARENA has a per-thread tracking system when PTDS is on (not per stream). So it is adding synchronization in cases where an allocation was created in thread A and freed in thread B. This happens during spill pretty often, and that looks to be one of the areas where we see issues. I think this implies that there are paths where we are not synchronizing appropriately before we hand a buffer to ASYNC to free on stream B. When I run the workers under That said, I am able to see an illegal access with ARENA in a single threaded way as well, this may be a different issue, but it is happening in q95. For this issue I do have a I should say I tried the concatenate in a sequential way:
And I am able to get through some of these concats, indicating I haven't reached a bad table, or I am somehow running out of memory in the
|
I'll build again with lineinfo to see if we can get to the bottom of the concatenate issue. There is an old issue in cuDF that has a similar output (rapidsai/cudf#7722), but I am not sure if it is related. |
Ok, with linenumber, I get that this is the line for the exception: https://github.com/rapidsai/cudf/blob/branch-22.04/cpp/src/copying/concatenate.cu#L186
|
The concatenate issue that I have reported here is specific to the configuration used to try and reproduce the ASYNC issue. The reason for this is I am starting q95 with 1 executor core, and I used 16 shuffle partitions, which changes the size of the batches that are materialized and sent to cuDF. The invalid access I'll link once I have a small repro likely a PR fix in cuDF. |
This is the cuDF concatenate issue: rapidsai/cudf#10333. |
Unfortunately I have seen the error again in dataproc, even when I am synchronizing events #4818 in the spill framework (I also didn't see us spilling in this case). Again in the gather stack. Note that this is the error from
These are T4s running driver 460.106.00 (11.2), so this should be a supported configuration. I am failing to see why ASYNC would change things, other than it is another corruption like rapidsai/cudf#10333, that is hiding. I'll try to run with compute-sanitize in the dataproc cluster, I should be able to make that work. |
This specific issue is not reproducible for drivers at or above 11.4.3. The PR (#4947), is making sure we don't use ASYNC by mistake in drivers < 11.5.0, as that's the minimum driver version we can easily test for. |
I'm adding a check in RMM: rapidsai/rmm#993 |
With NVIDIA/spark-rapids#4710 we found some issues with the async pool that may cause memory errors with older drivers. This was confirmed with the cuda team. For driver version < 11.5, we'll disable `cudaMemPoolReuseAllowOpportunistic`. @abellina Authors: - Rong Ou (https://github.com/rongou) Approvers: - Alessandro Bellina (https://github.com/abellina) - Jake Hemstad (https://github.com/jrhemstad) - Mark Harris (https://github.com/harrism) - Leo Fang (https://github.com/leofang) URL: #993
I am seeing
cudaErrorIllegalAddress
for q95 pretty consistently with ASYNC allocator. This was in dataproc (Spark 3.1.2) withn1-standard-32
instances with 2 T4s attached. The RAPIDS Shuffle Manager was not used, as opposed to #4695.JARS used:
I ran q95 100 times with ARENA and I can't reproduce it, with ASYNC I got it to happen 12 times.
I see the
cudaErrorIllegalAddress
in two stacks:Full stack:
Full stack:
Configs used:
The text was updated successfully, but these errors were encountered: