-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628
Comments
These two queries have the problem of horrific skew on join keys followed by an exploding join. One of the join conditions in both queries is We will likely need some kind of chunked join output functionality from libcudf to handle this. |
I believe that this is likely fixed now that #2310 has been merged in. I was able to run both query 24a and 24b at scale factor 200, but with only 2 shuffle partitions. This should be equivalent to running at scale factor 3000 with 30 partitions. But because this deals with skewed data (specifically |
@revans2 sorry I missed this comment. I ran both 24a and 24b myself at 3TB and they are both passing for me. I used 200 shuffle partitions (default), and ditto with
|
Thanks for the update, @abellina! Based on them now passing with defaults, closing this as fixed. |
I have seen this with and without the RapidsShuffleManager. In this case, the device store I see two tasks wanting to allocate ~1GB each.
This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)
The text was updated successfully, but these errors were encountered: