Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TPC-ds 14a and 14b failed to run #650

Closed
JustPlay opened this issue Sep 3, 2020 · 6 comments
Closed

[BUG] TPC-ds 14a and 14b failed to run #650

JustPlay opened this issue Sep 3, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@JustPlay
Copy link

JustPlay commented Sep 3, 2020

TPC-DS query 14a && 14b failed to run

we have two machine, each machine has 4 V100 (16GB) (total 8 executors)
rapids-0.2
cuDF-0.15 with PTDS=off
gpu concurrent = 2
rapids batch size=256M
for the tpc-ds dataset, we use a scale-factor=10000 (the final parquet file is around 3.6T)

i have tried multiple shuffle partition settings from 384 to 1024, all failed to run those two query

I see RMM failure and other failure from the log; But I want to ensure which failure is the root cause?

BTW:
when rapids batch-size=64M, gpu concurrent=1 and shuffle partition=3000, then we can run them successfully, but the performance is very very poor.

the tar package is the log (please rm the .txt postfix)
tpc-ds.q14.tgz.txt

Thanks

@JustPlay JustPlay added ? - Needs Triage Need team to review and classify bug Something isn't working labels Sep 3, 2020
@revans2
Copy link
Collaborator

revans2 commented Sep 3, 2020

From a quick look through the logs, it looks like you ran out of memory on the GPU as a part of a concat operation. Not really sure if it is the same type of concat operation that we have seen in other places where we need all of the data in memory for a join or a sort, but I would guess that is the case. It sounds like we need to work on our out of core processing, but I am not 100% sure on that.

@JustPlay
Copy link
Author

JustPlay commented Sep 8, 2020

out of core processing

what is out of core processing?

@revans2
Copy link
Collaborator

revans2 commented Sep 8, 2020

out of core processing generally refers to algorithms that support processing data that is larger that fits in memory.

@sameerz
Copy link
Collaborator

sameerz commented Sep 8, 2020

This may be related to data skew, issue #20.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Sep 8, 2020
@revans2
Copy link
Collaborator

revans2 commented May 3, 2021

@JustPlay

We just merged in #2310 to do some initial support for out of core joins. It is not perfect, but should help a lot with large joins. I have tested it on TPC-DS 14a and 14b at scale factor 200, but with a much smaller number of shuffle partitions.

--conf 'spark.sql.shuffle.partitions=2' 
--conf 'spark.rapids.sql.concurrentGpuTasks=2'
--conf 'spark.rapids.sql.batchSizeBytes=2047m'
--conf 'spark.rapids.memory.pinnedPool.size=32g'
--conf 'spark.rapids.memory.host.spillStorageSize=16g'

2 shuffle partitions at scale factor 200 should be equivalent to 100 shuffle partitions at scale factor 10,000, assuming that there is not some kind of skew that shows up at larger scale factors.

If you could try to retest this it would be great.

@revans2
Copy link
Collaborator

revans2 commented May 18, 2021

Closing this for now as I think it is fixed. Please reopen if you see more issue with this after the current SNAPSHOT version.

@revans2 revans2 closed this as completed May 18, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants