Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add synchronize and retry before OOM with async allocator #6768

Closed
abellina opened this issue Oct 12, 2022 · 0 comments · Fixed by #6849
Closed

[FEA] Add synchronize and retry before OOM with async allocator #6768

abellina opened this issue Oct 12, 2022 · 0 comments · Fixed by #6849
Assignees
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@abellina
Copy link
Collaborator

In a dev build, I added a Cuda.deviceSynchronize when we are unable to handle a failed allocation (we can't spill anymore) in DeviceMemoryEventHandler.onAllocFailure and I had a customer query pass consistently after doing this.

This is one such example that shows 15GB allocated, and us unable to allocate 2.6GB in a 40GB pool:

Device store exhausted, unable to allocate 2570310640 bytes. Total RMM allocated is 15015998208 bytes

After the synchronize, a retry succeeded. This could be related to internals of how the ASYNC allocator works or it could also be partly due to how the limiting memory resource and the async memory resource (which is wrapped by the limiting memory resource) are not in sync. It is worth noting that I tried a similar with with ARENA, I still run OOM, but a synchronize doesn't seem to be helping in that case.

So, I believe there is value in making a change either in the plugin or in RmmJni.cpp.

Essentially what I think we want is:

  1. Try to handle the failure via spilling the regular way (this could be also tweaked, see below).
  2. When we can't spill anymore (https://github.com/NVIDIA/spark-rapids/blob/branch-22.12/sql-plugin/src/main/scala/com/nvidia/spark/rapids/DeviceMemoryEventHandler.scala#L52):
    a. If we haven't retried this allocation more than N times (some sort of limit, or time based limit), Cuda.deviceSynchronize and then allow the calling code to retry.
    b. If we have gone above our retry limits, we can let the executor fail as before.

We'll need to reset the limit if the job starts succeeding again without help. That said, there could be many threads failing and spilling and several concurrent retries. So I think one approach may be a time based retry. If the last time we failed for any thread is more than say 1 or 2 seconds, we can retry again. Otherwise, we know we retried and we are still retrying, it is probably best to be done. Perhaps we can keep a retry counter per task thread, that seems like an option as well. I'd like to hear input on preferences to the above.

I believe we should experiment with synchronize before we try to spill to see if (1) above could be prevented in some cases. If we are seeing fragmentation that can be handled via a sync, it's probably cheaper to sync than to spill in some scenarios.

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Oct 12, 2022
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 18, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Oct 18, 2022
This adds the method `boolean onAllocFailure(long sizeRequested, int retryCount)` to `RmmEventHandler`, to help handling code keep track of the number of times an allocation failure has been retried.

With this code callers can perform extra logic that depends on whether the callback was due to a brand new allocation failure, or one that has failed in the past and is being retried.

This will be used here: NVIDIA/spark-rapids#6768

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11940
@sameerz sameerz removed the feature request New feature or request label Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants