-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630
Comments
@abellina does the query pass with 1 concurrent task? |
@sameerz: 1 concurrent or 2 concurrent both don't reproduce the OOM for me yet. 4 concurrent definitely increases the chance. With 4 concurrent: The stage that failed had 200 tasks. In this stage, I don't see task skew (each task was loading 1.5GB of shuffle data). I see that these tasks were causing OOM and spilled quite a bit from the shuffle store. One of the joins (a left semi on With 1 concurrent: this stage had no spilling at all. |
@abellina What was the performance difference for the different concurrency levels? |
Running cold runs each time: So ~10% difference with some noise. |
Now that #2310 is merged in it would be nice to see if we can test this again. I was able to make this work with.
on a 16GB V100 at scale factor 200. This should be close to a concurrency of 4 on a 40GB A100 at scale factor 3000. But it is not a perfect estimation of this. |
This is working now |
This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)
There are other pieces that are failing with OOM (like the filter), but I believe this to join that got too big.
The text was updated successfully, but these errors were encountered: