[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

revans2 · 2021-02-01T20:33:13Z

This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)

There are other pieces that are failing with OOM (like the filter), but I believe this to join that got too big.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/home/abellina/anaconda3/envs/cudf_dev_0.18_11/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Table.concatenate(Native Method)
        at ai.rapids.cudf.Table.concatenate(Table.java:1194)
        at com.nvidia.spark.rapids.ConcatAndConsumeAll$.buildNonEmptyBatch(GpuCoalesceBatches.scala:55)
        at com.nvidia.spark.rapids.GpuCoalesceIterator.concatAllAndPutOnGPU(GpuCoalesceBatches.scala:540)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$3(GpuCoalesceBatches.scala:325)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$1(GpuCoalesceBatches.scala:324)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:257)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:134)
        at com.nvidia.spark.rapids.GpuColumnarBatchSorter$$anon$1.loadNextBatch(GpuSortExec.scala:147)
        at com.nvidia.spark.rapids.GpuColumnarBatchSorter$$anon$1.hasNext(GpuSortExec.scala:198)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:191)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:225)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.window.WindowExec$$anon$1.fetchNextRow(WindowExec.scala:137)
        at org.apache.spark.sql.execution.window.WindowExec$$anon$1.<init>(WindowExec.scala:146)
        at org.apache.spark.sql.execution.window.WindowExec.$anonfun$doExecute$3(WindowExec.scala:126)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

The text was updated successfully, but these errors were encountered:

revans2 · 2021-02-01T20:41:02Z

The failure appears to be happening in the middle of a table concat, that is being done to get a single batch so that GpuSort can sort all of the data before it sends the data to a Window operation (that happens to be on the CPU because rank is not supported yet on the GPU).

From what I have seen we probably could increase the number of shuffle partitions and get this query to pass, although I am not 100% sure on that. I have tried to reproduce the error with a smaller data set and less hardware, but I am having some trouble doing that.

It looks like the fix for this would be to create an out of core sort algorithm that allows parts of the data to spill from GPU memory.

With that we would need to be sure that the sort interacted properly with other functions, like window operations on the GPU. We would either need to make sure that they still get a single batch of data when they want it, or update them to not require a single batch of input and instead split their input on partition key boundaries so each operation gets all of the needed data for a given key.

revans2 · 2021-02-01T21:53:01Z

I took a look at the data/query a bit too.

The query is trying to do a ranking window function that is partitioned by "i_category". There are only 11 "i_category" values (10 real values and one null value, which is not as common as the non null values). Of the 10 real values the distribution, at least on a smaller data set, is fairly even. So it looks like it does not matter how many shuffle partitions we have so long as there are more than 11, if the data is large enough we run the risk of hitting this issue.

Looking at the data itself

 |-- i_category: string (nullable = true)
 |-- i_class: string (nullable = true)
 |-- i_brand: string (nullable = true)
 |-- i_product_name: string (nullable = true)
 |-- d_year: integer (nullable = true)
 |-- d_qoy: integer (nullable = true)
 |-- d_moy: integer (nullable = true)
 |-- s_store_id: string (nullable = true)
 |-- sumsales: double (nullable = true)

I estimate that each category (in the SF=3k use case) is getting about 14.3 GB of data and will be producing another 542 MB result to go with the data as a part of the window function. This means that there is no way we are going to be able to sort 14.3 GB of data unless we have a 40+ GB gpu, and even then if we get unlucky and two tasks happen to both be running on the GPU at the same time it will not work. That is what I expect is happening.

Because we are looking at implementing rank on the GPU too (#1584) an out of core sort is not likely to be enough. We may need to think if there are some other ways for us to calculate a window function without all of the data for that window being present at once. We might also need to have some kind of a special case optimization for this query where they are essentially doing a TopN for rank. But I am not sure on that either.

revans2 · 2021-02-02T14:49:17Z

I thought a bit more about Rank, and I think there are a number of window operations that we could get some help from cudf to be able to make them operate on chunks of data, without needing all of the data. Rank is one of these, because it only needs to know the last line from the previous batch/chunk. Row number also works this same way.

Depending on how much memory we are willing to use in between queries any row based query that has bounded preceding and following will work too. The hard part with those is that we would could not output the entire input because we will not know the answers for the last set of rows until we get the next batch, know we are done with all the data, or hit a boundary between partitions.

In theory we could do the same thing with range based queries but the amount of data saved would potentially be unbounded.

We can probably implement most of this ourselves but might need some help from cudf to know where to split the output/input so that we get the right answer each time.

revans2 · 2021-02-05T17:09:24Z

I am going to try and implement an out of core sort. It should hopefully give us an idea of the performance cost of doing it and see if it is something we should just have on all the time.

revans2 · 2021-05-18T20:58:36Z

This is working now

revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 1, 2021

revans2 changed the title ~~[BUG] TPC-DS-like query 67 at scale=eTB failse with OOM~~ [BUG] TPC-DS-like query 67 at scale=3TB fails with OOM Feb 2, 2021

revans2 self-assigned this Feb 5, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Feb 9, 2021

revans2 mentioned this issue Feb 12, 2021

Add in out of core sort #1719

Merged

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021

sameerz mentioned this issue Feb 18, 2021

Execution error when I run TPC-DS Query-67 [QST] #458

Closed

revans2 mentioned this issue Feb 22, 2021

[FEA] Support batched window operations #1789

Open

4 tasks

revans2 mentioned this issue Mar 3, 2021

[FEA] Optimize row_number/rank for memory usage #1859

Closed

revans2 closed this as completed May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

revans2 commented Feb 1, 2021

revans2 commented Feb 1, 2021

revans2 commented Feb 1, 2021

revans2 commented Feb 2, 2021

revans2 commented Feb 5, 2021

revans2 commented May 18, 2021

[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

Comments

revans2 commented Feb 1, 2021

revans2 commented Feb 1, 2021

revans2 commented Feb 1, 2021

revans2 commented Feb 2, 2021

revans2 commented Feb 5, 2021

revans2 commented May 18, 2021