-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] INC AFTER CLOSE for ColumnVector during shutdown in the join code #7581
Comments
Need to relook at this after addressing issue #7255 |
It would be worth re-running NDS at 3TB with the restricted config with the INC/DEC debug config. I didn't call the specific test here, but given my configs in that cluster it was this benchmark. |
I ran NDS at 3TB with 0.34 gpu memory, and did not see a recurrence of this. I also did not see any task failures, so I tried running with GPU memory at 6gb, and still did not any INC after CLOSEs in the join code. I think this one is fixed. |
Thanks for running it @jbrennan333. |
In the AbstractGpuJoinIterator/JoinGatherer code, I am seeing INC AFTER CLOSE when an OOM exception occurs and we are shutting down. This was seen in our performance cluster, with regular settings except setting:
spark.rapids.memory.gpu.allocFraction=0.34
.Here is one of the INC stack traces, showing what ColumnVector was problematic, but with memory debug settings we get a lot more output showing all the INC/DEC stacks, and we'd need that on to debug further.
The text was updated successfully, but these errors were encountered: