Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

Closed
revans2 opened this issue Feb 1, 2021 · 5 comments
Closed

[BUG] TPC-DS-like query 67 at scale=3TB fails with OOM #1642

revans2 opened this issue Feb 1, 2021 · 5 comments
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 1, 2021

This is with 8 executors (each with an A100 with 40GB) and 4 concurrent tasks (4 cores/exec)

There are other pieces that are failing with OOM (like the filter), but I believe this to join that got too big.

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at:/home/abellina/anaconda3/envs/cudf_dev_0.18_11/include/rmm/mr/device/detail/arena.hpp:382: Maximum pool size exceeded
        at ai.rapids.cudf.Table.concatenate(Native Method)
        at ai.rapids.cudf.Table.concatenate(Table.java:1194)
        at com.nvidia.spark.rapids.ConcatAndConsumeAll$.buildNonEmptyBatch(GpuCoalesceBatches.scala:55)
        at com.nvidia.spark.rapids.GpuCoalesceIterator.concatAllAndPutOnGPU(GpuCoalesceBatches.scala:540)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$3(GpuCoalesceBatches.scala:325)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.$anonfun$next$1(GpuCoalesceBatches.scala:324)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.withResource(GpuCoalesceBatches.scala:134)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:257)
        at com.nvidia.spark.rapids.AbstractGpuCoalesceIterator.next(GpuCoalesceBatches.scala:134)
        at com.nvidia.spark.rapids.GpuColumnarBatchSorter$$anon$1.loadNextBatch(GpuSortExec.scala:147)
        at com.nvidia.spark.rapids.GpuColumnarBatchSorter$$anon$1.hasNext(GpuSortExec.scala:198)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.loadNextBatch(GpuColumnarToRowExec.scala:191)
        at com.nvidia.spark.rapids.ColumnarToRowIterator.hasNext(GpuColumnarToRowExec.scala:225)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.window.WindowExec$$anon$1.fetchNextRow(WindowExec.scala:137)
        at org.apache.spark.sql.execution.window.WindowExec$$anon$1.<init>(WindowExec.scala:146)
        at org.apache.spark.sql.execution.window.WindowExec.$anonfun$doExecute$3(WindowExec.scala:126)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:837)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:837)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 1, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Feb 1, 2021

The failure appears to be happening in the middle of a table concat, that is being done to get a single batch so that GpuSort can sort all of the data before it sends the data to a Window operation (that happens to be on the CPU because rank is not supported yet on the GPU).

From what I have seen we probably could increase the number of shuffle partitions and get this query to pass, although I am not 100% sure on that. I have tried to reproduce the error with a smaller data set and less hardware, but I am having some trouble doing that.

It looks like the fix for this would be to create an out of core sort algorithm that allows parts of the data to spill from GPU memory.

With that we would need to be sure that the sort interacted properly with other functions, like window operations on the GPU. We would either need to make sure that they still get a single batch of data when they want it, or update them to not require a single batch of input and instead split their input on partition key boundaries so each operation gets all of the needed data for a given key.

@revans2
Copy link
Collaborator Author

revans2 commented Feb 1, 2021

I took a look at the data/query a bit too.

The query is trying to do a ranking window function that is partitioned by "i_category". There are only 11 "i_category" values (10 real values and one null value, which is not as common as the non null values). Of the 10 real values the distribution, at least on a smaller data set, is fairly even. So it looks like it does not matter how many shuffle partitions we have so long as there are more than 11, if the data is large enough we run the risk of hitting this issue.

Looking at the data itself

 |-- i_category: string (nullable = true)
 |-- i_class: string (nullable = true)
 |-- i_brand: string (nullable = true)
 |-- i_product_name: string (nullable = true)
 |-- d_year: integer (nullable = true)
 |-- d_qoy: integer (nullable = true)
 |-- d_moy: integer (nullable = true)
 |-- s_store_id: string (nullable = true)
 |-- sumsales: double (nullable = true)

I estimate that each category (in the SF=3k use case) is getting about 14.3 GB of data and will be producing another 542 MB result to go with the data as a part of the window function. This means that there is no way we are going to be able to sort 14.3 GB of data unless we have a 40+ GB gpu, and even then if we get unlucky and two tasks happen to both be running on the GPU at the same time it will not work. That is what I expect is happening.

Because we are looking at implementing rank on the GPU too (#1584) an out of core sort is not likely to be enough. We may need to think if there are some other ways for us to calculate a window function without all of the data for that window being present at once. We might also need to have some kind of a special case optimization for this query where they are essentially doing a TopN for rank. But I am not sure on that either.

@revans2 revans2 changed the title [BUG] TPC-DS-like query 67 at scale=eTB failse with OOM [BUG] TPC-DS-like query 67 at scale=3TB fails with OOM Feb 2, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Feb 2, 2021

I thought a bit more about Rank, and I think there are a number of window operations that we could get some help from cudf to be able to make them operate on chunks of data, without needing all of the data. Rank is one of these, because it only needs to know the last line from the previous batch/chunk. Row number also works this same way.

Depending on how much memory we are willing to use in between queries any row based query that has bounded preceding and following will work too. The hard part with those is that we would could not output the entire input because we will not know the answers for the last set of rows until we get the next batch, know we are done with all the data, or hit a boundary between partitions.

In theory we could do the same thing with range based queries but the amount of data saved would potentially be unbounded.

We can probably implement most of this ourselves but might need some help from cudf to know where to split the output/input so that we get the right answer each time.

@revans2 revans2 self-assigned this Feb 5, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Feb 5, 2021

I am going to try and implement an out of core sort. It should hopefully give us an idea of the performance cost of doing it and see if it is something we should just have on all the time.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 9, 2021
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Feb 18, 2021
@revans2
Copy link
Collaborator Author

revans2 commented May 18, 2021

This is working now

@revans2 revans2 closed this as completed May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

2 participants