Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Memory OOM handling for RowToColumnarIterator #10617

Merged
merged 8 commits into from
Apr 1, 2024

Conversation

jbrennan333
Copy link
Contributor

@jbrennan333 jbrennan333 commented Mar 20, 2024

Closes #8887

This adds host memory oom handling for the slower path of GpuRowToColumnarExec.

Most of this patch involves adding code to allocate a single spillable host buffer up front and then slice it up for each column builder. It does a first pass of slicing (one slice for each column) and then the RapidHostColumnBuilder slices its buffer further to pre-allocate data, offsets and validity buffers for itself and its children. The pre-allocation logic is optional for the column builders - we can still use them in the old way of dynamically growing host buffers as needed.
The validity buffer is pre-allocated for all nullable columns, but we only use it if nulls were added.

This also handles cases where we overwrite one of the pre-allocated buffers for the columns. We use the Retryable interface to add checkpoint/restore logic to the RapidsHostColumnBuilders. We checkpoint before writing out a row, and then if we overwrite while writing a row, we restore all of the columns to the checkpointed state. We then save the row that was in progress for later processing. If it is the very first row, or if we have a coalesce goal for a single batch, we re-enable dynamic growth for the builders and try that row again. This risks an OOM, but prevents an outright failure for this case.

This has primarily been tested via existing integration and unit tests, and also running nds locally and on a larger cluster. A performance check of nds at 3TB was run and found no significant performance impact. I added some smaller batch sizes to the main row_conversion_test to force it into the overwrite code paths.

@jbrennan333 jbrennan333 self-assigned this Mar 20, 2024
@jbrennan333 jbrennan333 added feature request New feature or request reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Mar 20, 2024
@jbrennan333
Copy link
Contributor Author

build

@jbrennan333
Copy link
Contributor Author

Put up commits to merge up to latest, fix unit test failure, and parameterize batchSizeBytes for the test_row_conversion integration test. By testing with 4mb and 1kb batch sizes, the test now exercises the new code paths that deal with overwriting one of the host columns.

@jbrennan333
Copy link
Contributor Author

build

@jbrennan333
Copy link
Contributor Author

Some integration tests were failing because column views were being created with a valid buffer when there were no nulls. Old code would never create the valid buffer if there were no nulls. This code pre-allocates it in case we need it, but if we end up not using it, we need to close it and set to null before creating the gpu columns.

@jbrennan333
Copy link
Contributor Author

build

1 similar comment
@jbrennan333
Copy link
Contributor Author

build

@jbrennan333
Copy link
Contributor Author

I have added the host memory retries to this. I will update the description.

@jbrennan333
Copy link
Contributor Author

build

@jbrennan333 jbrennan333 marked this pull request as ready for review March 22, 2024 21:54
@jbrennan333
Copy link
Contributor Author

build

@jbrennan333
Copy link
Contributor Author

I think the premerge failures may be unrelated:

[2024-03-24T18:49:20.236Z] Caused by: ai.rapids.cudf.CudfException: CUDF failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-dev-708-cuda11/thirdparty/cudf/cpp/src/io/comp/nvcomp_adapter.cpp:688: Compression error: nvCOMP 2.4 or newer is required for Zstandard compression

It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version.

@jlowe
Copy link
Member

jlowe commented Mar 25, 2024

It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version.

That's seems like a scary error. How could we be pulling in such an old nvcomp version during the build?

Tracked by #10627

@jbrennan333
Copy link
Contributor Author

build

revans2
revans2 previously approved these changes Mar 26, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good.

@jbrennan333
Copy link
Contributor Author

This looks like a testing framework failure:

2024-03-26T17:09:05.0979031Z [2024-03-26T17:08:28.028Z] ../../../../integration_tests/src/main/python/join_test.py::test_broadcast_join_right_struct_as_key[Right-Struct(['child0', String],['child1', Byte],['child2', Short],['child3', Integer],['child4', Long],['child5', Boolean],['child6', Date],['child7', Timestamp],['child8', Null],['child9', Decimal(12,2)])][DATAGEN_SEED=1711463898, TZ=UTC, INJECT_OOM, IGNORE_ORDER({'local': True})] Could not connect to ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt to send interrupt signal to process
2024-03-26T17:09:05.0980062Z [2024-03-26T17:08:28.052Z] ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
2024-03-26T17:09:05.0980760Z [2024-03-26T17:08:28.053Z] 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
2024-03-26T17:09:05.0981428Z [2024-03-26T17:08:28.053Z] 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
2024-03-26T17:09:05.0981952Z [2024-03-26T17:08:28.053Z] 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
2024-03-26T17:09:05.0982662Z [2024-03-26T17:08:28.053Z] 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:30)
2024-03-26T17:09:05.0983346Z [2024-03-26T17:08:28.053Z] 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:70)
2024-03-26T17:09:05.0984040Z [2024-03-26T17:08:28.053Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2024-03-26T17:09:05.0984768Z [2024-03-26T17:08:28.053Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2024-03-26T17:09:05.0985228Z [2024-03-26T17:08:28.053Z] 	at java.base/java.lang.Thread.run(Thread.java:829)
2024-03-26T17:09:05.0985393Z [2024-03-26T17:08:28.053Z] 
2024-03-26T17:09:05.0986407Z [2024-03-26T17:08:28.066Z] Fail to find or publish report...java.io.IOException: Unable to create live FilePath for ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt
2024-03-26T17:09:05.0987491Z [2024-03-26T17:08:28.084Z] ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt was marked offline: Connection was broken: java.nio.channels.ClosedChannelException
2024-03-26T17:09:05.0988131Z [2024-03-26T17:08:28.084Z] 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
2024-03-26T17:09:05.0988807Z [2024-03-26T17:08:28.084Z] 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
2024-03-26T17:09:05.0989327Z [2024-03-26T17:08:28.084Z] 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
2024-03-26T17:09:05.0990089Z [2024-03-26T17:08:28.084Z] 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:30)
2024-03-26T17:09:05.0990769Z [2024-03-26T17:08:28.084Z] 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:70)
2024-03-26T17:09:05.0991459Z [2024-03-26T17:08:28.084Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2024-03-26T17:09:05.0992138Z [2024-03-26T17:08:28.084Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2024-03-26T17:09:05.0992542Z [2024-03-26T17:08:28.084Z] 	at java.base/java.lang.Thread.run(Thread.java:829)
2024-03-26T17:09:05.0992717Z [2024-03-26T17:08:28.084Z] 
2024-03-26T17:09:05.0993619Z Unable to create live FilePath for ci-scala213-jenkins-rapids-premerge-github-9235-fwfbp-6fdvt****** Result of stage Premerge CI 2 is FAILURE ******

@jbrennan333
Copy link
Contributor Author

build

@jbrennan333 jbrennan333 changed the title Change GpuRowToColumnarIterator to allocate a single buffer for builders Host Memory OOM handling for RowToColumnarIterator Mar 26, 2024
@jbrennan333
Copy link
Contributor Author

I have been doing some additional testing with ScaleTest query 7. Running this on my desktop at scale 1, complexity 10, and with ShuffleExchangeExec disabled, I am able to force host memory OOMs in this code (RowToColumnarIterator). To see the impact, I changed the original RapidsHostColumnBuilder code to use HostAlloc.alloc() instead of HostMemoryBuffer.allocate() so I could see where we start running out of memory.

Before this patch, I fail with a CPU OOM at 16GB of heap memory. (no oom at 17GB).
With this patch, I fail with a CPU OOM at 6GB of heap memory. (no oom at 7GB).

I am running with 16 cpu cores, 16GB executor memory, and 4 concurrent GPU tasks.

@jbrennan333
Copy link
Contributor Author

I'm seeing a lot of java gateway errors in the premerge build log:

2024-03-26T19:43:23.7922131Z [2024-03-26T19:38:38.162Z] ConnectionRefusedError: [Errno 111] Connection refused
2024-03-26T19:43:23.7922604Z [2024-03-26T19:38:38.162Z] --------------------------- Captured stderr teardown ---------------------------
2024-03-26T19:43:23.7923269Z [2024-03-26T19:38:38.162Z] 2024-03-26 19:13:51 ERROR    An error occurred while trying to connect to the Java server (****:35623)
2024-03-26T19:43:23.7923586Z [2024-03-26T19:38:38.162Z] Traceback (most recent call last):
2024-03-26T19:43:23.7924860Z [2024-03-26T19:38:38.162Z]   File "/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-9236-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
2024-03-26T19:43:23.7925175Z [2024-03-26T19:38:38.162Z]     connection = self.deque.pop()
2024-03-26T19:43:23.7925543Z [2024-03-26T19:38:38.162Z] IndexError: pop from an empty deque
2024-03-26T19:43:23.7925719Z [2024-03-26T19:38:38.162Z] 
2024-03-26T19:43:23.7926212Z [2024-03-26T19:38:38.162Z] During handling of the above exception, another exception occurred:
2024-03-26T19:43:23.7926381Z [2024-03-26T19:38:38.162Z] 
2024-03-26T19:43:23.7926694Z [2024-03-26T19:38:38.162Z] Traceback (most recent call last):
2024-03-26T19:43:23.7927922Z [2024-03-26T19:38:38.162Z]   File "/home/jenkins/agent/workspace/jenkins-rapids_premerge-github-9236-ci-2/.download/spark-3.1.1-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1115, in start

@jbrennan333
Copy link
Contributor Author

So far I have not been able to repro any premerge test failures locally, so I merged up to HEAD and will kick off the build again.

@jbrennan333
Copy link
Contributor Author

build

@jbrennan333
Copy link
Contributor Author

I am still having trouble reproducing these premerge integration test failures. I have been able to run the full join_test.py with no failures. All of the tests that are failing during premerge are passing for me locally.

@jbrennan333
Copy link
Contributor Author

I found a bug (leaked spillable host buffer) while trying to repro. Might explain the premerge test failures.

@jbrennan333
Copy link
Contributor Author

build

@sameerz sameerz removed the feature request New feature or request label Mar 27, 2024
@jbrennan333
Copy link
Contributor Author

As another test, I ran the full nds power run at scale 100 on my desktop, with

spark.rapids.sql.exec.ShuffleExchangeExec=false
spark.rapids.memory.host.offHeapLimit.enabled=true
spark.rapids.memory.host.offHeapLimit.size=2G \

All queries passed, and the output was validated.

@jbrennan333
Copy link
Contributor Author

jbrennan333 commented Apr 1, 2024

I filed a follow-up PR to add host memory oom handling to other places where GpuColumnarBatchBuilder is used.
#10647

@jbrennan333
Copy link
Contributor Author

I ran one final nds ab performance check on an 8-node A100 cluster and there was no measurable performance impact from this change.

@jbrennan333 jbrennan333 merged commit c28c7fa into NVIDIA:branch-24.04 Apr 1, 2024
43 checks passed
jlowe added a commit to jlowe/spark-rapids that referenced this pull request Apr 2, 2024
jlowe added a commit that referenced this pull request Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add Host Memory Retry for Row to Columnar Conversion
4 participants