Add gpu memory watermark apis to JNI #11950

abellina · 2022-10-20T15:02:35Z

This PR addresses #11949.

We are adding methods to get the current memory usage watermarks at the whole process level and adding a "scoped" maximum, where the user can reset the initial value, run cuDF functions, and then call the API to get what happened since the reset.

For the scoped maximum, the getScopedMaximumOutstanding could have somewhat surprising results. If the scoped maximum is reset to 0 for example, and we only see frees for allocations done before the reset, we are going to see that the scoped maximum returned is 0. This is because our memory usage is literally negative in this scenario.

The APIs here assume that the caller process is using a single thread to call into the GPU (for Spark it would be 1 concurrent task).

Note I assume Rmm.initialize has been called, otherwise this doesn't track allocations done before that.

jrhemstad

I'm not sure how much you care about performance here or how often you intend to use this functionality. There's definitely room for optimization, but may not be worth the effort if you don't care much about the perf.

java/src/main/native/src/RmmJni.cpp

abellina · 2022-10-20T16:05:44Z

@jrhemstad fyi, we are going to simplify this a lot removing the stack you just commented on. So it may not be worth a review right now. The idea is now to simply keep a watermark of the maximum memory used at a global level, and then add "local" watermark that can be reset (like an odometer essentially). This means we'd use this functionality single threaded while devs are trying to debug an issue, so it removes a whole host of issues as well.

jrhemstad · 2022-10-20T16:44:53Z

You may want to check out the tracking_resource_adapter that RMM provides as well: https://github.com/rapidsai/rmm/blob/1394c281a9294c87342802e8392d4f60d17ed7de/include/rmm/mr/device/tracking_resource_adaptor.hpp#L45

abellina · 2022-10-20T17:09:46Z

Yes @jrhemstad on tracking_resource_adaptor. One of the things we discussed locally was to propose modifying it or creating something similar where a stack trace could be dumped when we reach a certain level (so we know what part of the code "triggered" the oom). That said, the same could be accomplished by enabling the tracking_resource_adaptor and limiting the RMM pool, so on OOM we could get the cuDF/JNI stack trace that caused us to go overboard.

End result is we want to figure out ways to stay within some limits, which dictates how big a table we should be aiming for and how much concurrency we should allow. So far we think we can limit concurrency to 1 thread, and run with reduced pool or some debug flags to tell us what part of a query violated our assumption (not a production setup obviously). We are also trying to estimate how much memory we might use, but all of that is Spark plugin-side and will be separate changes.

abellina · 2022-10-20T19:52:59Z

So the above has a bug because I am using size_t and given some of the reset logic, it can go negative (and overflow). I'll put up a patch + tests.

codecov · 2022-10-20T20:51:44Z

Codecov Report

Base: 88.11% // Head: 88.14% // Increases project coverage by +0.03% 🎉

Coverage data is based on head (bec9818) compared to base (5c2150e).
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11950      +/-   ##
================================================
+ Coverage         88.11%   88.14%   +0.03%     
================================================
  Files               133      133              
  Lines             21982    21982              
================================================
+ Hits              19369    19376       +7     
+ Misses             2613     2606       -7

Impacted Files	Coverage Δ
python/cudf/cudf/core/dataframe.py	`93.77% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.65% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.51% <0.00%> (+0.20%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`93.75% <0.00%> (+0.96%)`	⬆️
python/strings_udf/strings_udf/__init__.py	`86.27% <0.00%> (+1.96%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

jbrennan333 · 2022-10-20T21:11:30Z

java/src/main/native/src/RmmJni.cpp

+
+      // `total_allocated - local_allocated` can be negative in the case where we free
+      // after we call `reset_local_max_outstanding`
+      std::size_t local_diff = std::max(static_cast<long>(total_allocated - local_allocated), 0L);


Maybe use static_cast<intptr_t> instead of static_cast<long>
I don't think long is guaranteed to be big enough to hold a size_t.

So I think long == long long (I think this is 32bit vs 64bit compiled programs). To cover all of std::size_t, I'd have to go to unsigned long. That's a lot of GPU memory ;) I am not sure we need to worry too much about that, especially since we are going to send this to Spark shortly, which runs java, and java's long is 64-bit and signed.

size_t max value: 18446744073709551615 long max value: 9223372036854775807 unsigned long max value: 18446744073709551615 long long max value: 9223372036854775807 unsigned long long max value: 18446744073709551615

I think in this case, long is sufficient because we are an LP64 architecture (we don't run on Windows, do we?).
std::intptr_t is guaranteed to be the same width as std::size_t, but signed (I don't think ssize_t is standard?). You could use int64_t here, since as you say we know we are going to pass it to java, which is using 64 bits. This was more of a technical nit, than an actual concern that it will break (too much history cross-porting to different architectures...)

These types are very confusing which can be alias of other types depending on the system. Therefore, for clarity, please always use the fix-width types: (u)int32_t and (u)int64_t. They guarantee you to have known limits.

So this is a debug feature, we don't use (u)int* right now (e.g. I'd make things more inconsistent unless I change the whole thing), and I am not sure whether cuDF is moving away from the alias types. I think we can update in one PR that is "go away from these old types to the better ones" in the future.

java/src/main/java/ai/rapids/cudf/Rmm.java

java/src/main/native/src/RmmJni.cpp

abellina · 2022-10-21T17:04:42Z

@jlowe this should be ready for another look

java/src/main/java/ai/rapids/cudf/Rmm.java

…oom/memory_track

abellina · 2022-10-22T18:23:32Z

I see some unrelated test failures in the python side. Upmerging for now.

java/src/main/native/src/RmmJni.cpp

jbrennan333

+1 lgtm

abellina · 2022-10-24T18:13:02Z

@jlowe this should be ready for another look

abellina · 2022-10-24T19:33:37Z

@gpucibot merge

This adds `onAllocated` and `onDeallocated` to `RmmEventHandler` as debug callbacks. If the event handler is installed with debug enabled (in `Rmm.setEventHandler`) these callbacks will be invoked when an allocation or deallocation finishes. It also fixes a bug with #11950 where the initial allocated amount was not getting set appropriately. It was getting set to 0, but instead it should be set to the new initial value/maximum. Closes #11949. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #12054

Add gpu memory tracking api in JNI to track maximum memory usage

4f9a18e

github-actions bot added the Java Affects Java cuDF API. label Oct 20, 2022

abellina added feature request New feature or request non-breaking Non-breaking change and removed Java Affects Java cuDF API. labels Oct 20, 2022

clang-format changes

543080e

github-actions bot added the Java Affects Java cuDF API. label Oct 20, 2022

abellina added the Spark Functionality that helps Spark RAPIDS label Oct 20, 2022

jrhemstad reviewed Oct 20, 2022

View reviewed changes

java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved

java/src/main/native/src/RmmJni.cpp Outdated Show resolved Hide resolved

abellina added 4 commits October 20, 2022 13:26

Adds global/local high memory usage watermark functions to JNI

7dfac33

Fix typo

cf851c3

Remove unnecesary include

9171b1c

Make sure we call the native function

5aaa979

abellina changed the title ~~Add gpu memory tracking api in JNI to track maximum memory usage~~ Add gpu memory watermark apis to JNI Oct 20, 2022

abellina marked this pull request as ready for review October 20, 2022 19:23

abellina requested a review from a team as a code owner October 20, 2022 19:23

Fix overflow issue

d9d4b53

jbrennan333 reviewed Oct 20, 2022

View reviewed changes

jlowe reviewed Oct 20, 2022

View reviewed changes

abellina added 5 commits October 20, 2022 17:38

Address review comments

155f227

Fix javadoc

4d2a602

Made the RmmJni code more consistent with the Java calling code

9c92699

local->scoped

d8eae3b

clang-style fixes

c2c6c8e

jlowe reviewed Oct 21, 2022

View reviewed changes

abellina added 2 commits October 21, 2022 14:25

Apply code review comments

22fa6e6

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

5947d6e

…oom/memory_track

jbrennan333 reviewed Oct 24, 2022

View reviewed changes

java/src/main/native/src/RmmJni.cpp Show resolved Hide resolved

Lock while resetting max scoped usage

bec9818

jbrennan333 approved these changes Oct 24, 2022

View reviewed changes

jlowe approved these changes Oct 24, 2022

View reviewed changes

rapids-bot bot merged commit 1e93af8 into rapidsai:branch-22.12 Oct 24, 2022

abellina deleted the oom/memory_track branch October 24, 2022 19:33

This was referenced Nov 1, 2022

[FEA][JNI] Track GPU memory usage at process and at local level #11949

Closed

[JNI] Add debug-only onAllocated/onDeallocated to RmmEventHandler #12054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpu memory watermark apis to JNI #11950

Add gpu memory watermark apis to JNI #11950

abellina commented Oct 20, 2022 •

edited

Loading

jrhemstad left a comment

abellina commented Oct 20, 2022

jrhemstad commented Oct 20, 2022

abellina commented Oct 20, 2022 •

edited

Loading

abellina commented Oct 20, 2022

codecov bot commented Oct 20, 2022 •

edited

Loading

jbrennan333 Oct 20, 2022 •

edited

Loading

abellina Oct 20, 2022

jbrennan333 Oct 20, 2022

ttnghia Oct 20, 2022 •

edited

Loading

abellina Oct 21, 2022

abellina commented Oct 21, 2022

abellina commented Oct 22, 2022

jbrennan333 left a comment

abellina commented Oct 24, 2022

abellina commented Oct 24, 2022

Add gpu memory watermark apis to JNI #11950

Add gpu memory watermark apis to JNI #11950

Conversation

abellina commented Oct 20, 2022 • edited Loading

jrhemstad left a comment

Choose a reason for hiding this comment

abellina commented Oct 20, 2022

jrhemstad commented Oct 20, 2022

abellina commented Oct 20, 2022 • edited Loading

abellina commented Oct 20, 2022

codecov bot commented Oct 20, 2022 • edited Loading

Codecov Report

jbrennan333 Oct 20, 2022 • edited Loading

Choose a reason for hiding this comment

abellina Oct 20, 2022

Choose a reason for hiding this comment

jbrennan333 Oct 20, 2022

Choose a reason for hiding this comment

ttnghia Oct 20, 2022 • edited Loading

Choose a reason for hiding this comment

abellina Oct 21, 2022

Choose a reason for hiding this comment

abellina commented Oct 21, 2022

abellina commented Oct 22, 2022

jbrennan333 left a comment

Choose a reason for hiding this comment

abellina commented Oct 24, 2022

abellina commented Oct 24, 2022

abellina commented Oct 20, 2022 •

edited

Loading

abellina commented Oct 20, 2022 •

edited

Loading

codecov bot commented Oct 20, 2022 •

edited

Loading

jbrennan333 Oct 20, 2022 •

edited

Loading

ttnghia Oct 20, 2022 •

edited

Loading