-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA][JNI] Track GPU memory usage at process and at local level #11949
Labels
feature request
New feature or request
Java
Affects Java cuDF API.
Spark
Functionality that helps Spark RAPIDS
Comments
abellina
added
feature request
New feature or request
Needs Triage
Need team to review and classify
Java
Affects Java cuDF API.
Spark
Functionality that helps Spark RAPIDS
labels
Oct 20, 2022
abellina
changed the title
[FEA][JNI] Track held device memory per thread in RmmJni
[FEA][JNI] Track GPU memory usage at process and at local level
Oct 20, 2022
rapids-bot bot
pushed a commit
that referenced
this issue
Oct 24, 2022
This PR addresses #11949. We are adding methods to get the current memory usage watermarks at the whole process level and adding a "scoped" maximum, where the user can reset the initial value, run cuDF functions, and then call the API to get what happened since the reset. For the scoped maximum, the `getScopedMaximumOutstanding` could have somewhat surprising results. If the scoped maximum is reset to 0 for example, and we only see frees for allocations done before the reset, we are going to see that the scoped maximum returned is 0. This is because our memory usage is literally negative in this scenario. The APIs here assume that the caller process is using a single thread to call into the GPU (for Spark it would be 1 concurrent task). Note I assume `Rmm.initialize` has been called, otherwise this doesn't track allocations done before that. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jim Brennan (https://github.com/jbrennan333) - Jason Lowe (https://github.com/jlowe) URL: #11950
The watermark API added in #11950 is useful but we want to be able to factor in our spillable cache in the spark-rapids plugin, which means so far that we need finer grained optional callbacks on every allocate and free. This isn't high performance, but it doesn't seem to be the end of the world. As such, I am working on patch that should ultimately close this particular issue. |
rapids-bot bot
pushed a commit
that referenced
this issue
Nov 4, 2022
This adds `onAllocated` and `onDeallocated` to `RmmEventHandler` as debug callbacks. If the event handler is installed with debug enabled (in `Rmm.setEventHandler`) these callbacks will be invoked when an allocation or deallocation finishes. It also fixes a bug with #11950 where the initial allocated amount was not getting set appropriately. It was getting set to 0, but instead it should be set to the new initial value/maximum. Closes #11949. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #12054
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
feature request
New feature or request
Java
Affects Java cuDF API.
Spark
Functionality that helps Spark RAPIDS
This issue tracks the cuDF side of NVIDIA/spark-rapids#6745.
Essentially, we'd like to be able to track the maximum amount of memory used when calling a set of cuDF functions from Spark. This would be something we enable during debug, as new code is getting implemented at least, but I could also see the Spark code having a mode where it will run slower, but it will help diagnose where memory is going as we call into cuDF.
The proposal is to add a global watermark for the whole process (maximum GPU memory used at any point in time) and also add a "local" maximum, that can be reset by the user as a debug tool.
The text was updated successfully, but these errors were encountered: