[FEA][JNI] Track GPU memory usage at process and at local level #11949

abellina · 2022-10-20T14:44:02Z

This issue tracks the cuDF side of NVIDIA/spark-rapids#6745.

Essentially, we'd like to be able to track the maximum amount of memory used when calling a set of cuDF functions from Spark. This would be something we enable during debug, as new code is getting implemented at least, but I could also see the Spark code having a mode where it will run slower, but it will help diagnose where memory is going as we call into cuDF.

The proposal is to add a global watermark for the whole process (maximum GPU memory used at any point in time) and also add a "local" maximum, that can be reset by the user as a debug tool.

This PR addresses #11949. We are adding methods to get the current memory usage watermarks at the whole process level and adding a "scoped" maximum, where the user can reset the initial value, run cuDF functions, and then call the API to get what happened since the reset. For the scoped maximum, the `getScopedMaximumOutstanding` could have somewhat surprising results. If the scoped maximum is reset to 0 for example, and we only see frees for allocations done before the reset, we are going to see that the scoped maximum returned is 0. This is because our memory usage is literally negative in this scenario. The APIs here assume that the caller process is using a single thread to call into the GPU (for Spark it would be 1 concurrent task). Note I assume `Rmm.initialize` has been called, otherwise this doesn't track allocations done before that. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jim Brennan (https://github.com/jbrennan333) - Jason Lowe (https://github.com/jlowe) URL: #11950

abellina · 2022-11-01T13:54:14Z

The watermark API added in #11950 is useful but we want to be able to factor in our spillable cache in the spark-rapids plugin, which means so far that we need finer grained optional callbacks on every allocate and free. This isn't high performance, but it doesn't seem to be the end of the world. As such, I am working on patch that should ultimately close this particular issue.

This adds `onAllocated` and `onDeallocated` to `RmmEventHandler` as debug callbacks. If the event handler is installed with debug enabled (in `Rmm.setEventHandler`) these callbacks will be invoked when an allocation or deallocation finishes. It also fixes a bug with #11950 where the initial allocated amount was not getting set appropriately. It was getting set to 0, but instead it should be set to the new initial value/maximum. Closes #11949. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #12054

abellina added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS labels Oct 20, 2022

abellina self-assigned this Oct 20, 2022

abellina mentioned this issue Oct 20, 2022

Add gpu memory watermark apis to JNI #11950

Merged

abellina changed the title ~~[FEA][JNI] Track held device memory per thread in RmmJni~~ [FEA][JNI] Track GPU memory usage at process and at local level Oct 20, 2022

GregoryKimball removed the Needs Triage Need team to review and classify label Oct 21, 2022

abellina mentioned this issue Nov 3, 2022

[JNI] Add debug-only onAllocated/onDeallocated to RmmEventHandler #12054

Merged

rapids-bot bot closed this as completed in #12054 Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA][JNI] Track GPU memory usage at process and at local level #11949

[FEA][JNI] Track GPU memory usage at process and at local level #11949

abellina commented Oct 20, 2022 •

edited

Loading

abellina commented Nov 1, 2022

[FEA][JNI] Track GPU memory usage at process and at local level #11949

[FEA][JNI] Track GPU memory usage at process and at local level #11949

Comments

abellina commented Oct 20, 2022 • edited Loading

abellina commented Nov 1, 2022

abellina commented Oct 20, 2022 •

edited

Loading