Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

Closed
abellina opened this issue Jan 29, 2021 · 7 comments
Closed

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

abellina opened this issue Jan 29, 2021 · 7 comments
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@abellina
Copy link
Collaborator

This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)

There are other pieces that are failing with OOM (like the filter), but I believe this to join that got too big.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: CUDA error at: /usr/local/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
	at ai.rapids.cudf.Table.leftSemiJoin(Native Method)
	at ai.rapids.cudf.Table.access$3700(Table.java:44)
@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 29, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 16, 2021
@sameerz
Copy link
Collaborator

sameerz commented Feb 16, 2021

@abellina does the query pass with 1 concurrent task?

@abellina
Copy link
Collaborator Author

Q95 sometimes works fine with 4 concurrent, and sometimes not, so it definitely looks like it is at the edge of GPU memory in this setup. I think this is essentially the same as: #1628, but I need to run do the skew analysis that @jlowe did for Q24a and Q24b to make sure.

@abellina
Copy link
Collaborator Author

abellina commented Feb 17, 2021

@sameerz: 1 concurrent or 2 concurrent both don't reproduce the OOM for me yet. 4 concurrent definitely increases the chance.

With 4 concurrent: The stage that failed had 200 tasks. In this stage, I don't see task skew (each task was loading 1.5GB of shuffle data). I see that these tasks were causing OOM and spilled quite a bit from the shuffle store. One of the joins (a left semi on ws_order_number) takes from two other joins 1.8GB or so per task. If we allow four tasks, I think this means 4*1.8*2 =~ 14GB of input to the join.

With 1 concurrent: this stage had no spilling at all.

@revans2
Copy link
Collaborator

revans2 commented Feb 17, 2021

@abellina What was the performance difference for the different concurrency levels?

@abellina
Copy link
Collaborator Author

Running cold runs each time:
1 concurrent: 22s, 23.1s, 20.4s, 21.59s, 22.3s => avg = 21.8s
2 concurrent: 18.3s, 18.4s, 20.4s, 20.6s, 19.9s => avg = 19.5s
3+ concurrent is unreliable

So ~10% difference with some noise.

@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@revans2
Copy link
Collaborator

revans2 commented May 3, 2021

Now that #2310 is merged in it would be nice to see if we can test this again. I was able to make this work with.

--conf 'spark.rapids.sql.batchSizeBytes=2047m'
--conf 'spark.sql.shuffle.partitions=15'
--conf 'spark.rapids.sql.concurrentGpuTasks=2'
--conf 'spark.rapids.memory.pinnedPool.size=32g' 
--conf 'spark.rapids.memory.host.spillStorageSize=16g'
--conf 'spark.sql.files.maxPartitionBytes=512m'

on a 16GB V100 at scale factor 200. This should be close to a concurrency of 4 on a 40GB A100 at scale factor 3000. But it is not a perfect estimation of this.

@revans2
Copy link
Collaborator

revans2 commented May 18, 2021

This is working now

@revans2 revans2 closed this as completed May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

3 participants