You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are seeing an odd race condition in spark-rapids here NVIDIA/spark-rapids#10585 that seems to be very dependent on timing. For example, we enabled debug logging to try and find the root cause but the logging looks to "fix" the problem.
I was able to narrow it to state being tracked in a class in spark-rapids called HostAlloc where we are keeping a counter of pinned memory allocated and triggering state transitions for a state machine we use for our OOM handling. These state changes are happening too soon, when MemoryBuffer.onClosed is called, because onClosed is getting triggered before we actually free the buffer: so we update state thinking that the memory is free but we haven't actually freed it yet. As a result, we are blocking a thread and it will never come out of that state because the event that would have woken it up happened too soon (triggered by onClosed).
The fix we are looking at unfortunately is in cuDF and is very late for 24.04, but I think it's needed. We would change the order for the onClosed callback to right after the free.
I should have a PR today. The main thing I am worried about is that our spill framework uses this callback and the timing will be changed by this work, so I want to make sure we don't cause issues while spilling.
The text was updated successfully, but these errors were encountered:
We are seeing an odd race condition in spark-rapids here NVIDIA/spark-rapids#10585 that seems to be very dependent on timing. For example, we enabled debug logging to try and find the root cause but the logging looks to "fix" the problem.
I was able to narrow it to state being tracked in a class in spark-rapids called
HostAlloc
where we are keeping a counter of pinned memory allocated and triggering state transitions for a state machine we use for our OOM handling. These state changes are happening too soon, whenMemoryBuffer.onClosed
is called, becauseonClosed
is getting triggered before we actually free the buffer: so we update state thinking that the memory is free but we haven't actually freed it yet. As a result, we are blocking a thread and it will never come out of that state because the event that would have woken it up happened too soon (triggered byonClosed
).The fix we are looking at unfortunately is in cuDF and is very late for 24.04, but I think it's needed. We would change the order for the
onClosed
callback to right after the free.cudf/java/src/main/java/ai/rapids/cudf/MemoryBuffer.java
Line 244 in f9ac427
cudf/java/src/main/java/ai/rapids/cudf/MemoryBuffer.java
Line 253 in f9ac427
I should have a PR today. The main thing I am worried about is that our spill framework uses this callback and the timing will be changed by this work, so I want to make sure we don't cause issues while spilling.
The text was updated successfully, but these errors were encountered: