-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print out the current attempt object when OOM inside a retry block #11733
Conversation
Signed-off-by: Firestarman <[email protected]>
build |
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
@@ -420,6 +427,12 @@ class SpillableHostBuffer(handle: RapidsBufferHandle, | |||
rapidsBuffer.getHostMemoryBuffer | |||
} | |||
} | |||
|
|||
override def sizeInBytes: Long = { | |||
withResource(catalog.acquireBuffer(handle)) { rapidsBuffer => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just return .length
here? why do we need to acquire the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, I did not notice the length
here. Will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Firestarman <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with the direction of the current code. I think it could be better so I am suggesting some changes. If you feel that the size of the data is enough, then that is fine with me.
At a minimum we would need comments explaining the new trait and what is happening with the catch/throw anti-pattern and why it is okay.
override def close(): Unit = () | ||
} | ||
|
||
trait RetrySizeAwareable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some docs to explain what this is for and how it is intended to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is removed.
val splitted = try { | ||
splitPolicy(curAttempt) | ||
} catch { | ||
case ex: Throwable => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an anti-pattern. Catching an exception to log and then re-throw can be very confusing. What if someone handles that exception and recovers? We then have a lot message that is confusing. I would prefer to have splitPolicy
updated so that we can ask it to log on an error in some way. We should be able to adjust the signature slightly and update all of the places that it is passed in. There are not that many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to catch only some special OOMs to include the message into a new exception that wraps the original one.
} | ||
} | ||
// splitPolicy must take ownership of the argument | ||
val splitted = splitPolicy(attemptStack.pop()) | ||
val curAttempt = attemptStack.pop() | ||
val curSize = closeOnExcept(curAttempt)(sizeAsString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to not just return the size, but make it a generic string where we could add in more information? That way we can just call toString on curAttempt
and we don't need RetrySizeAwareable? It might take a bit of effort to pull on the thread for all of the things that we can spill, but I think it would help with debugging a lot more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I did the refactor to use toString
for more info.
override def close(): Unit = () | ||
} | ||
|
||
trait RetrySizeAwareable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
maybe
trait RetrySizeAwareable { | |
trait RetrySizeAware { |
or
trait RetrySizeAwareable { | |
trait RetrySizeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for review. This new trait is removed per Bobby's suggestion.
Signed-off-by: Firestarman <[email protected]>
build |
throw new CpuSplitAndRetryOOM( | ||
s"CPU OutOfMemory: Could not split the current attempt: {$attemptAsString}" | ||
).initCause(co) | ||
case ex: Throwable => throw ex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why catch throwable just to rethrow it? If we don't catch it that happens automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, updated
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
build |
closes #11732
This PR adds the support to print out the current attempt object being processed when OOM happens in the retry block.
This is designed for the better OOM issues triage.