Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print out the current attempt object when OOM inside a retry block #11733

Merged
merged 7 commits into from
Nov 25, 2024

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Nov 19, 2024

closes #11732

This PR adds the support to print out the current attempt object being processed when OOM happens in the retry block.
This is designed for the better OOM issues triage.

@firestarman
Copy link
Collaborator Author

build

@abellina abellina self-requested a review November 20, 2024 15:25
@@ -420,6 +427,12 @@ class SpillableHostBuffer(handle: RapidsBufferHandle,
rapidsBuffer.getHostMemoryBuffer
}
}

override def sizeInBytes: Long = {
withResource(catalog.acquireBuffer(handle)) { rapidsBuffer =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just return .length here? why do we need to acquire the buffer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I did not notice the length here. Will update it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the direction of the current code. I think it could be better so I am suggesting some changes. If you feel that the size of the data is enough, then that is fine with me.

At a minimum we would need comments explaining the new trait and what is happening with the catch/throw anti-pattern and why it is okay.

override def close(): Unit = ()
}

trait RetrySizeAwareable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some docs to explain what this is for and how it is intended to be used?

Copy link
Collaborator Author

@firestarman firestarman Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is removed.

val splitted = try {
splitPolicy(curAttempt)
} catch {
case ex: Throwable =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an anti-pattern. Catching an exception to log and then re-throw can be very confusing. What if someone handles that exception and recovers? We then have a lot message that is confusing. I would prefer to have splitPolicy updated so that we can ask it to log on an error in some way. We should be able to adjust the signature slightly and update all of the places that it is passed in. There are not that many.

Copy link
Collaborator Author

@firestarman firestarman Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to catch only some special OOMs to include the message into a new exception that wraps the original one.

}
}
// splitPolicy must take ownership of the argument
val splitted = splitPolicy(attemptStack.pop())
val curAttempt = attemptStack.pop()
val curSize = closeOnExcept(curAttempt)(sizeAsString)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to not just return the size, but make it a generic string where we could add in more information? That way we can just call toString on curAttempt and we don't need RetrySizeAwareable? It might take a bit of effort to pull on the thread for all of the things that we can spill, but I think it would help with debugging a lot more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I did the refactor to use toString for more info.

override def close(): Unit = ()
}

trait RetrySizeAwareable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
maybe

Suggested change
trait RetrySizeAwareable {
trait RetrySizeAware {

or

Suggested change
trait RetrySizeAwareable {
trait RetrySizeable {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for review. This new trait is removed per Bobby's suggestion.

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Nov 22, 2024
@firestarman firestarman changed the title Print out the current batch size when OOM Print out the current attempt object when OOM inside a retry block Nov 22, 2024
@firestarman
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Nov 22, 2024
throw new CpuSplitAndRetryOOM(
s"CPU OutOfMemory: Could not split the current attempt: {$attemptAsString}"
).initCause(co)
case ex: Throwable => throw ex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why catch throwable just to rethrow it? If we don't catch it that happens automatically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, updated

Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

@firestarman firestarman merged commit 6cba00d into NVIDIA:branch-24.12 Nov 25, 2024
49 checks passed
@firestarman firestarman deleted the oom-size-log branch November 25, 2024 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Print out the size of the batch currently being processed for GPU OOM.
6 participants