-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642
Comments
The failure appears to be happening in the middle of a table concat, that is being done to get a single batch so that From what I have seen we probably could increase the number of shuffle partitions and get this query to pass, although I am not 100% sure on that. I have tried to reproduce the error with a smaller data set and less hardware, but I am having some trouble doing that. It looks like the fix for this would be to create an out of core sort algorithm that allows parts of the data to spill from GPU memory. With that we would need to be sure that the sort interacted properly with other functions, like window operations on the GPU. We would either need to make sure that they still get a single batch of data when they want it, or update them to not require a single batch of input and instead split their input on partition key boundaries so each operation gets all of the needed data for a given key. |
I took a look at the data/query a bit too. The query is trying to do a ranking window function that is partitioned by "i_category". There are only 11 "i_category" values (10 real values and one null value, which is not as common as the non null values). Of the 10 real values the distribution, at least on a smaller data set, is fairly even. So it looks like it does not matter how many shuffle partitions we have so long as there are more than 11, if the data is large enough we run the risk of hitting this issue. Looking at the data itself
I estimate that each category (in the SF=3k use case) is getting about 14.3 GB of data and will be producing another 542 MB result to go with the data as a part of the window function. This means that there is no way we are going to be able to sort 14.3 GB of data unless we have a 40+ GB gpu, and even then if we get unlucky and two tasks happen to both be running on the GPU at the same time it will not work. That is what I expect is happening. Because we are looking at implementing rank on the GPU too (#1584) an out of core sort is not likely to be enough. We may need to think if there are some other ways for us to calculate a window function without all of the data for that window being present at once. We might also need to have some kind of a special case optimization for this query where they are essentially doing a TopN for rank. But I am not sure on that either. |
I thought a bit more about Rank, and I think there are a number of window operations that we could get some help from cudf to be able to make them operate on chunks of data, without needing all of the data. Rank is one of these, because it only needs to know the last line from the previous batch/chunk. Row number also works this same way. Depending on how much memory we are willing to use in between queries any row based query that has bounded preceding and following will work too. The hard part with those is that we would could not output the entire input because we will not know the answers for the last set of rows until we get the next batch, know we are done with all the data, or hit a boundary between partitions. In theory we could do the same thing with range based queries but the amount of data saved would potentially be unbounded. We can probably implement most of this ourselves but might need some help from cudf to know where to split the output/input so that we get the right answer each time. |
I am going to try and implement an out of core sort. It should hopefully give us an idea of the performance cost of doing it and see if it is something we should just have on all the time. |
This is working now |
This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)
There are other pieces that are failing with OOM (like the filter), but I believe this to join that got too big.
The text was updated successfully, but these errors were encountered: