-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] TPC-ds 14a and 14b failed to run #650
Comments
From a quick look through the logs, it looks like you ran out of memory on the GPU as a part of a concat operation. Not really sure if it is the same type of concat operation that we have seen in other places where we need all of the data in memory for a join or a sort, but I would guess that is the case. It sounds like we need to work on our out of core processing, but I am not 100% sure on that. |
what is |
out of core processing generally refers to algorithms that support processing data that is larger that fits in memory. |
This may be related to data skew, issue #20. |
We just merged in #2310 to do some initial support for out of core joins. It is not perfect, but should help a lot with large joins. I have tested it on TPC-DS 14a and 14b at scale factor 200, but with a much smaller number of shuffle partitions.
2 shuffle partitions at scale factor 200 should be equivalent to 100 shuffle partitions at scale factor 10,000, assuming that there is not some kind of skew that shows up at larger scale factors. If you could try to retest this it would be great. |
Closing this for now as I think it is fixed. Please reopen if you see more issue with this after the current SNAPSHOT version. |
Signed-off-by: Robert (Bobby) Evans <[email protected]>
TPC-DS query 14a && 14b failed to run
we have two machine, each machine has 4 V100 (16GB) (total 8 executors)
rapids-0.2
cuDF-0.15 with PTDS=off
gpu concurrent = 2
rapids batch size=256M
for the tpc-ds dataset, we use a scale-factor=10000 (the final parquet file is around 3.6T)
i have tried multiple shuffle partition settings from 384 to 1024, all failed to run those two query
I see
RMM failure
andother failure
from the log; But I want to ensure which failure is the root cause?BTW:
when rapids batch-size=64M, gpu concurrent=1 and shuffle partition=3000, then we can run them successfully, but the performance is very very poor.
the tar package is the log (please rm the .txt postfix)
tpc-ds.q14.tgz.txt
Thanks
The text was updated successfully, but these errors were encountered: