-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Host Memory OOM handling for RowToColumnarIterator #10617
Conversation
Signed-off-by: Jim Brennan <[email protected]>
build |
Put up commits to merge up to latest, fix unit test failure, and parameterize batchSizeBytes for the test_row_conversion integration test. By testing with 4mb and 1kb batch sizes, the test now exercises the new code paths that deal with overwriting one of the host columns. |
build |
Some integration tests were failing because column views were being created with a valid buffer when there were no nulls. Old code would never create the valid buffer if there were no nulls. This code pre-allocates it in case we need it, but if we end up not using it, we need to close it and set to null before creating the gpu columns. |
build |
1 similar comment
build |
I have added the host memory retries to this. I will update the description. |
build |
build |
I think the premerge failures may be unrelated:
It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version. |
That's seems like a scary error. How could we be pulling in such an old nvcomp version during the build? Tracked by #10627 |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good.
This looks like a testing framework failure:
|
build |
I have been doing some additional testing with ScaleTest query 7. Running this on my desktop at scale 1, complexity 10, and with Before this patch, I fail with a CPU OOM at 16GB of heap memory. (no oom at 17GB). I am running with 16 cpu cores, 16GB executor memory, and 4 concurrent GPU tasks. |
I'm seeing a lot of java gateway errors in the premerge build log:
|
So far I have not been able to repro any premerge test failures locally, so I merged up to HEAD and will kick off the build again. |
build |
I am still having trouble reproducing these premerge integration test failures. I have been able to run the full join_test.py with no failures. All of the tests that are failing during premerge are passing for me locally. |
I found a bug (leaked spillable host buffer) while trying to repro. Might explain the premerge test failures. |
build |
As another test, I ran the full nds power run at scale 100 on my desktop, with
All queries passed, and the output was validated. |
I filed a follow-up PR to add host memory oom handling to other places where |
I ran one final nds ab performance check on an 8-node A100 cluster and there was no measurable performance impact from this change. |
)" This reverts commit c28c7fa. Signed-off-by: Jason Lowe <[email protected]>
…10657) This reverts commit c28c7fa. Signed-off-by: Jason Lowe <[email protected]>
Closes #8887
This adds host memory oom handling for the slower path of GpuRowToColumnarExec.
Most of this patch involves adding code to allocate a single spillable host buffer up front and then slice it up for each column builder. It does a first pass of slicing (one slice for each column) and then the RapidHostColumnBuilder slices its buffer further to pre-allocate data, offsets and validity buffers for itself and its children. The pre-allocation logic is optional for the column builders - we can still use them in the old way of dynamically growing host buffers as needed.
The validity buffer is pre-allocated for all nullable columns, but we only use it if nulls were added.
This also handles cases where we overwrite one of the pre-allocated buffers for the columns. We use the Retryable interface to add checkpoint/restore logic to the RapidsHostColumnBuilders. We checkpoint before writing out a row, and then if we overwrite while writing a row, we restore all of the columns to the checkpointed state. We then save the row that was in progress for later processing. If it is the very first row, or if we have a coalesce goal for a single batch, we re-enable dynamic growth for the builders and try that row again. This risks an OOM, but prevents an outright failure for this case.
This has primarily been tested via existing integration and unit tests, and also running nds locally and on a larger cluster. A performance check of nds at 3TB was run and found no significant performance impact. I added some smaller batch sizes to the main row_conversion_test to force it into the overwrite code paths.