Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in support for OOM retry #7822

Merged
merged 3 commits into from
Mar 6, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Feb 27, 2023

This adds in the framework needed for OOM retry. It coordinates with RMMSpark so that it knows the state of each thread.

Signed-off-by: Robert (Bobby) Evans <[email protected]>
@@ -54,8 +54,15 @@ class GpuDeviceManagerSuite extends FunSuite with Arm with BeforeAndAfter {
// initial allocation should fit within pool size
withResource(DeviceMemoryBuffer.allocate(allocSize)) { _ =>
assertThrows[OutOfMemoryError] {
// this should exceed the specified pool size
DeviceMemoryBuffer.allocate(allocSize).close()
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug code?

Comment on lines +364 to +366
if (before != null) {
before()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add logging here. An exception from before()/after() might be difficult to contextualize since it in a different thread.

Comment on lines +364 to +366
if (before != null) {
before()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider

Suggested change
if (before != null) {
before()
}
Option(before).foreach(_.apply())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more functional. I get that. This is not performance critical code, but it is replacing a check and a branch, probably 3 or 4 instructions with calling a static method to create an object that then calls a method on that object with a function that is probably a separate class that had to be created, possibly as a singleton.

I personally prefer the null check, but if for consistency with other code styles we want the functional one liner I am fine with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with your preference. Performance considerations are irrelevant here. Thanks for considering the suggestion.

I just realized that we probably need neither version of the null check if you make the default parameter value a nop () => () instead of null

@revans2
Copy link
Collaborator Author

revans2 commented Feb 28, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Feb 28, 2023

Had two tests fail where it looks like RMM was partly shut down but SparkRMM was not properly set up again. I will put a fix for this into spark-rapids-jni and hopefully this will fix it. I ran the tests locally and everything passed even before the change.

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Mar 1, 2023
@revans2 revans2 linked an issue Mar 1, 2023 that may be closed by this pull request
@abellina
Copy link
Collaborator

abellina commented Mar 1, 2023

Looks like the fix in spark-rapids-jni made it to the nightly jar. Should we rekick this?

@revans2
Copy link
Collaborator Author

revans2 commented Mar 1, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Mar 6, 2023

build

1 similar comment
@revans2
Copy link
Collaborator Author

revans2 commented Mar 6, 2023

build

@revans2
Copy link
Collaborator Author

revans2 commented Mar 6, 2023

@abellina and @gerashegalov could you please take another look? I upmerged locally and there were no issues with the shim changes so I am hoping we can just merge this.

Copy link
Contributor

@jbrennan333 jbrennan333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me

@revans2 revans2 merged commit 4fce102 into NVIDIA:branch-23.04 Mar 6, 2023
@revans2 revans2 deleted the spark_rmm_integration branch March 6, 2023 21:32
@gerashegalov
Copy link
Collaborator

@abellina and @gerashegalov could you please take another look? I upmerged locally and there were no issues with the shim changes so I am hoping we can just merge this.

this remained unaddressed but it's fine #7822 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement OOM retry framework
5 participants