Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA][JNI] Track GPU memory usage at process and at local level #11949

Closed
abellina opened this issue Oct 20, 2022 · 1 comment · Fixed by #12054
Closed

[FEA][JNI] Track GPU memory usage at process and at local level #11949

abellina opened this issue Oct 20, 2022 · 1 comment · Fixed by #12054
Assignees
Labels
feature request New feature or request Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@abellina
Copy link
Contributor

abellina commented Oct 20, 2022

This issue tracks the cuDF side of NVIDIA/spark-rapids#6745.

Essentially, we'd like to be able to track the maximum amount of memory used when calling a set of cuDF functions from Spark. This would be something we enable during debug, as new code is getting implemented at least, but I could also see the Spark code having a mode where it will run slower, but it will help diagnose where memory is going as we call into cuDF.

The proposal is to add a global watermark for the whole process (maximum GPU memory used at any point in time) and also add a "local" maximum, that can be reset by the user as a debug tool.

@abellina abellina added feature request New feature or request Needs Triage Need team to review and classify Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS labels Oct 20, 2022
@abellina abellina self-assigned this Oct 20, 2022
@abellina abellina changed the title [FEA][JNI] Track held device memory per thread in RmmJni [FEA][JNI] Track GPU memory usage at process and at local level Oct 20, 2022
@GregoryKimball GregoryKimball removed the Needs Triage Need team to review and classify label Oct 21, 2022
rapids-bot bot pushed a commit that referenced this issue Oct 24, 2022
This PR addresses #11949.

We are adding methods to get the current memory usage watermarks at the whole process level and adding a "scoped" maximum, where the user can reset the initial value, run cuDF functions, and then call the API to get what happened since the reset.

For the scoped maximum, the `getScopedMaximumOutstanding` could have somewhat surprising results. If the scoped maximum is reset to 0 for example, and we only see frees for allocations done before the reset, we are going to see that the scoped maximum returned is 0. This is because our memory usage is literally negative in this scenario.

The APIs here assume that the caller process is using a single thread to call into the GPU (for Spark it would be 1 concurrent task).

Note I assume `Rmm.initialize` has been called, otherwise this doesn't track allocations done before that.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jim Brennan (https://github.com/jbrennan333)
  - Jason Lowe (https://github.com/jlowe)

URL: #11950
@abellina
Copy link
Contributor Author

abellina commented Nov 1, 2022

The watermark API added in #11950 is useful but we want to be able to factor in our spillable cache in the spark-rapids plugin, which means so far that we need finer grained optional callbacks on every allocate and free. This isn't high performance, but it doesn't seem to be the end of the world. As such, I am working on patch that should ultimately close this particular issue.

rapids-bot bot pushed a commit that referenced this issue Nov 4, 2022
This adds `onAllocated` and `onDeallocated` to `RmmEventHandler` as debug callbacks. If the event handler is installed with debug enabled (in `Rmm.setEventHandler`) these callbacks will be invoked when an allocation or deallocation finishes.

It also fixes a bug with #11950 where the initial allocated amount was not getting set appropriately. It was getting set to 0, but instead it should be set to the new initial value/maximum.

Closes #11949.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #12054
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants