[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628

abellina · 2021-01-29T18:46:18Z

I have seen this with and without the RapidsShuffleManager. In this case, the device store I see two tasks wanting to allocate ~1GB each.

This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)

21/01/28 22:44:17 INFO DeviceMemoryEventHandler: Device allocation of 1325112360 bytes failed, device store has 0 bytes. Total RMM allocated is 26554743808 bytes.
21/01/28 22:44:17 INFO DeviceMemoryEventHandler: Device allocation of 1013508408 bytes failed, device store has 0 bytes. Total RMM allocated is 26554923008 bytes.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: CUDA error at: /usr/local/rapids/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
        at ai.rapids.cudf.Table.innerJoin(Native Method)
        at ai.rapids.cudf.Table.access$3500(Table.java:44)
        at ai.rapids.cudf.Table$TableOperation.innerJoin(Table.java:2233)
        at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin.doJoinLeftRight(GpuHashJoin.scala:307)
        at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin.com$nvidia$spark$rapids$shims$spark300$GpuHashJoin$$doJoin(GpuHashJoin.scala:274)
        at com.nvidia.spark.rapids.shims.spark300.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:223)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at com.nvidia.spark.rapids.GpuHashAggregateExec.$anonfun$doExecuteColumnar$1(aggregate.scala:433)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

jlowe · 2021-02-03T22:11:04Z

These two queries have the problem of horrific skew on join keys followed by an exploding join. One of the join conditions in both queries is c_birth_country == upper(ca_country). There's only two distinct values for upper(ca_country) which is null and UNITED STATES, leading to bad join key skew. The join also explodes since it mixes in a store zip join with customer zip before this, so just a few tasks end up joining two reasonably-sized tables into a very large result (few million rows becomes billions of rows within the same task).

We will likely need some kind of chunked join output functionality from libcudf to handle this.

revans2 · 2021-05-03T19:31:30Z

I believe that this is likely fixed now that #2310 has been merged in. I was able to run both query 24a and 24b at scale factor 200, but with only 2 shuffle partitions. This should be equivalent to running at scale factor 3000 with 30 partitions. But because this deals with skewed data (specifically upper(ca_country)) it is likely not really equivalent. @jlowe or @abellina can we try and rerun these no and see if they are still failing?

abellina · 2021-05-05T21:58:34Z

@revans2 sorry I missed this comment. I ran both 24a and 24b myself at 3TB and they are both passing for me.

I used 200 shuffle partitions (default), and ditto with batchSizeBytes (left it alone -> 2GB)

--conf 'spark.rapids.sql.concurrentGpuTasks=2'
--conf 'spark.rapids.memory.pinnedPool.size=8g' 
--conf 'spark.rapids.memory.host.spillStorageSize=32g'
--conf 'spark.sql.files.maxPartitionBytes=1g'

jlowe · 2021-05-05T22:02:24Z

Thanks for the update, @abellina! Based on them now passing with defaults, closing this as fixed.

abellina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 29, 2021

abellina changed the title ~~[BUG] TPC-DS-like query 24a at scale=3TB fails with OOM~~ [BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM Jan 29, 2021

sameerz assigned jlowe Feb 2, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Feb 2, 2021

sameerz unassigned jlowe Feb 2, 2021

abellina mentioned this issue Feb 16, 2021

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

Closed

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

sameerz mentioned this issue Feb 18, 2021

[FEA] an operator that will give you a fairly accurate size per row for a table. rapidsai/cudf#7408

Closed

sameerz closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628

[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628

abellina commented Jan 29, 2021

jlowe commented Feb 3, 2021

revans2 commented May 3, 2021

abellina commented May 5, 2021

jlowe commented May 5, 2021

[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628

[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628

Comments

abellina commented Jan 29, 2021

jlowe commented Feb 3, 2021

revans2 commented May 3, 2021

abellina commented May 5, 2021

jlowe commented May 5, 2021