Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda.deviceSynchronize as a last resort if we cannot spill enough #6849

Merged
merged 4 commits into from
Oct 20, 2022

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Oct 18, 2022

Signed-off-by: Alessandro Bellina [email protected]

Closes #6768

This depends on: rapidsai/cudf#11940

This PR adds Cuda.deviceSynchronize calls when the device store is empty. The idea is that these calls should allow GPU work to finish (like pending frees) and allow the async pool to resolve any temporary state (make available blocks that were recently freed for other streams to consume).

Adds an internal config spark.rapids.memory.gpu.oomMaxRetries. It is set to 2 retries by default.

@abellina abellina added cudf_dependency An issue or PR with this label depends on a new feature in cudf reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 18, 2022
@abellina
Copy link
Collaborator Author

This is going to fail to build until we get cuDF changes in.

logInfo(s"Device allocation of $allocSize bytes failed, device store has " +
s"$storeSize bytes. Total RMM allocated is ${Rmm.getTotalBytesAllocated} bytes.")
s"$storeSize bytes. $attemptMsg" +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit spacing is off

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good here: 6d10d77

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit 69f507e into NVIDIA:branch-22.12 Oct 20, 2022
@abellina abellina deleted the oom/sync_before_fatal_oom branch October 20, 2022 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add synchronize and retry before OOM with async allocator
3 participants