[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

abellina · 2021-01-29T18:58:27Z

This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)

There are other pieces that are failing with OOM (like the filter), but I believe this to join that got too big.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: CUDA error at: /usr/local/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
	at ai.rapids.cudf.Table.leftSemiJoin(Native Method)
	at ai.rapids.cudf.Table.access$3700(Table.java:44)

The text was updated successfully, but these errors were encountered:

sameerz · 2021-02-16T21:40:37Z

@abellina does the query pass with 1 concurrent task?

abellina · 2021-02-16T22:00:54Z

Q95 sometimes works fine with 4 concurrent, and sometimes not, so it definitely looks like it is at the edge of GPU memory in this setup. I think this is essentially the same as: #1628, but I need to run do the skew analysis that @jlowe did for Q24a and Q24b to make sure.

abellina · 2021-02-17T00:08:27Z

@sameerz: 1 concurrent or 2 concurrent both don't reproduce the OOM for me yet. 4 concurrent definitely increases the chance.

With 4 concurrent: The stage that failed had 200 tasks. In this stage, I don't see task skew (each task was loading 1.5GB of shuffle data). I see that these tasks were causing OOM and spilled quite a bit from the shuffle store. One of the joins (a left semi on ws_order_number) takes from two other joins 1.8GB or so per task. If we allow four tasks, I think this means 4*1.8*2 =~ 14GB of input to the join.

With 1 concurrent: this stage had no spilling at all.

revans2 · 2021-02-17T14:17:41Z

@abellina What was the performance difference for the different concurrency levels?

abellina · 2021-02-17T15:21:00Z

Running cold runs each time:
1 concurrent: 22s, 23.1s, 20.4s, 21.59s, 22.3s => avg = 21.8s
2 concurrent: 18.3s, 18.4s, 20.4s, 20.6s, 19.9s => avg = 19.5s
3+ concurrent is unreliable

So ~10% difference with some noise.

revans2 · 2021-05-03T20:03:22Z

Now that #2310 is merged in it would be nice to see if we can test this again. I was able to make this work with.

--conf 'spark.rapids.sql.batchSizeBytes=2047m'
--conf 'spark.sql.shuffle.partitions=15'
--conf 'spark.rapids.sql.concurrentGpuTasks=2'
--conf 'spark.rapids.memory.pinnedPool.size=32g' 
--conf 'spark.rapids.memory.host.spillStorageSize=16g'
--conf 'spark.sql.files.maxPartitionBytes=512m'

on a 16GB V100 at scale factor 200. This should be close to a concurrency of 4 on a 40GB A100 at scale factor 3000. But it is not a perfect estimation of this.

revans2 · 2021-05-18T20:58:52Z

This is working now

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 29, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Feb 16, 2021

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

sameerz mentioned this issue Feb 18, 2021

[FEA] an operator that will give you a fairly accurate size per row for a table. rapidsai/cudf#7408

Closed

revans2 closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

abellina commented Jan 29, 2021

sameerz commented Feb 16, 2021

abellina commented Feb 16, 2021

abellina commented Feb 17, 2021 •

edited

Loading

revans2 commented Feb 17, 2021

abellina commented Feb 17, 2021

revans2 commented May 3, 2021

revans2 commented May 18, 2021

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

Comments

abellina commented Jan 29, 2021

sameerz commented Feb 16, 2021

abellina commented Feb 16, 2021

abellina commented Feb 17, 2021 • edited Loading

revans2 commented Feb 17, 2021

abellina commented Feb 17, 2021

revans2 commented May 3, 2021

revans2 commented May 18, 2021

abellina commented Feb 17, 2021 •

edited

Loading