Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in retry for ORC writes [databricks] #7972

Merged
merged 6 commits into from
Mar 31, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Mar 29, 2023

This fixes #7341
This fixes #7960

I also included a fix for metrics because for some reason when writing data the metrics were not deserialized until all of the query was done. For now I just made the task level metrics go on most GPU operators.

I did some performance testing with smaller and smaller memory. The issues that I was seeing were mostly with not being able to split inputs on a parquet read (when I got down to 4 GiB of GPU memory and 4x parallelism) I also saw some issues with running out of memory when trying to read back in spilled data.

chart

It is a nice logarithmic looking performance drop off, which is nice to see.

Signed-off-by: Robert (Bobby) Evans <[email protected]>
@revans2
Copy link
Collaborator Author

revans2 commented Mar 29, 2023

build

@revans2 revans2 changed the title Add in retry for ORC writes Add in retry for ORC writes [databricks] Mar 30, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Mar 30, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Mar 30, 2023

build

jlowe
jlowe previously approved these changes Mar 30, 2023
@revans2
Copy link
Collaborator Author

revans2 commented Mar 30, 2023

Looks like the OOM injection PR messed with some things for this PR, so I will spend some time to debug this...

@revans2
Copy link
Collaborator Author

revans2 commented Mar 30, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Mar 30, 2023

@jlowe sorry about more test failures but it should be fixed now please take another look

@abellina please take a look at my latest patch which fixes some issues with OOM injection

Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@revans2 revans2 merged commit 663c39a into NVIDIA:branch-23.04 Mar 31, 2023
@revans2 revans2 deleted the orc_retry branch March 31, 2023 13:50
@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Don't Throw OutOfMemoryError in retry iterator [BUG] Leverage OOM retry framework for ORC writes
5 participants