-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in support for OOM retry #7822
Conversation
Signed-off-by: Robert (Bobby) Evans <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsShuffleHeartbeatManager.scala
Show resolved
Hide resolved
@@ -54,8 +54,15 @@ class GpuDeviceManagerSuite extends FunSuite with Arm with BeforeAndAfter { | |||
// initial allocation should fit within pool size | |||
withResource(DeviceMemoryBuffer.allocate(allocSize)) { _ => | |||
assertThrows[OutOfMemoryError] { | |||
// this should exceed the specified pool size | |||
DeviceMemoryBuffer.allocate(allocSize).close() | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug code?
shuffle-plugin/src/main/scala/com/nvidia/spark/rapids/shuffle/ucx/UCX.scala
Show resolved
Hide resolved
if (before != null) { | ||
before() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add logging here. An exception from before()/after()
might be difficult to contextualize since it in a different thread.
if (before != null) { | ||
before() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider
if (before != null) { | |
before() | |
} | |
Option(before).foreach(_.apply()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more functional. I get that. This is not performance critical code, but it is replacing a check and a branch, probably 3 or 4 instructions with calling a static method to create an object that then calls a method on that object with a function that is probably a separate class that had to be created, possibly as a singleton.
I personally prefer the null check, but if for consistency with other code styles we want the functional one liner I am fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with your preference. Performance considerations are irrelevant here. Thanks for considering the suggestion.
I just realized that we probably need neither version of the null check if you make the default parameter value a nop () => ()
instead of null
build |
Had two tests fail where it looks like RMM was partly shut down but SparkRMM was not properly set up again. I will put a fix for this into spark-rapids-jni and hopefully this will fix it. I ran the tests locally and everything passed even before the change. |
Looks like the fix in spark-rapids-jni made it to the nightly jar. Should we rekick this? |
build |
build |
1 similar comment
build |
@abellina and @gerashegalov could you please take another look? I upmerged locally and there were no issues with the shim changes so I am hoping we can just merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me
this remained unaddressed but it's fine #7822 (comment) |
This adds in the framework needed for OOM retry. It coordinates with RMMSpark so that it knows the state of each thread.