-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gpu memory watermark apis to JNI #11950
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much you care about performance here or how often you intend to use this functionality. There's definitely room for optimization, but may not be worth the effort if you don't care much about the perf.
@jrhemstad fyi, we are going to simplify this a lot removing the stack you just commented on. So it may not be worth a review right now. The idea is now to simply keep a watermark of the maximum memory used at a global level, and then add "local" watermark that can be reset (like an odometer essentially). This means we'd use this functionality single threaded while devs are trying to debug an issue, so it removes a whole host of issues as well. |
You may want to check out the |
Yes @jrhemstad on End result is we want to figure out ways to stay within some limits, which dictates how big a table we should be aiming for and how much concurrency we should allow. So far we think we can limit concurrency to 1 thread, and run with reduced pool or some debug flags to tell us what part of a query violated our assumption (not a production setup obviously). We are also trying to estimate how much memory we might use, but all of that is Spark plugin-side and will be separate changes. |
So the above has a bug because I am using |
Codecov ReportBase: 88.11% // Head: 88.14% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11950 +/- ##
================================================
+ Coverage 88.11% 88.14% +0.03%
================================================
Files 133 133
Lines 21982 21982
================================================
+ Hits 19369 19376 +7
+ Misses 2613 2606 -7
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
java/src/main/native/src/RmmJni.cpp
Outdated
|
||
// `total_allocated - local_allocated` can be negative in the case where we free | ||
// after we call `reset_local_max_outstanding` | ||
std::size_t local_diff = std::max(static_cast<long>(total_allocated - local_allocated), 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use static_cast<intptr_t>
instead of static_cast<long>
I don't think long is guaranteed to be big enough to hold a size_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think long
== long long
(I think this is 32bit vs 64bit compiled programs). To cover all of std::size_t
, I'd have to go to unsigned long
. That's a lot of GPU memory ;) I am not sure we need to worry too much about that, especially since we are going to send this to Spark shortly, which runs java, and java's long is 64-bit and signed.
size_t max value: 18446744073709551615
long max value: 9223372036854775807
unsigned long max value: 18446744073709551615
long long max value: 9223372036854775807
unsigned long long max value: 18446744073709551615
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case, long is sufficient because we are an LP64 architecture (we don't run on Windows, do we?).
std::intptr_t
is guaranteed to be the same width as std::size_t
, but signed (I don't think ssize_t
is standard?). You could use int64_t here, since as you say we know we are going to pass it to java, which is using 64 bits. This was more of a technical nit, than an actual concern that it will break (too much history cross-porting to different architectures...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types are very confusing which can be alias of other types depending on the system. Therefore, for clarity, please always use the fix-width types: (u)int32_t
and (u)int64_t
. They guarantee you to have known limits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a debug feature, we don't use (u)int*
right now (e.g. I'd make things more inconsistent unless I change the whole thing), and I am not sure whether cuDF is moving away from the alias types. I think we can update in one PR that is "go away from these old types to the better ones" in the future.
@jlowe this should be ready for another look |
I see some unrelated test failures in the python side. Upmerging for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 lgtm
@jlowe this should be ready for another look |
@gpucibot merge |
This adds `onAllocated` and `onDeallocated` to `RmmEventHandler` as debug callbacks. If the event handler is installed with debug enabled (in `Rmm.setEventHandler`) these callbacks will be invoked when an allocation or deallocation finishes. It also fixes a bug with #11950 where the initial allocated amount was not getting set appropriately. It was getting set to 0, but instead it should be set to the new initial value/maximum. Closes #11949. Authors: - Alessandro Bellina (https://github.com/abellina) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #12054
This PR addresses #11949.
We are adding methods to get the current memory usage watermarks at the whole process level and adding a "scoped" maximum, where the user can reset the initial value, run cuDF functions, and then call the API to get what happened since the reset.
For the scoped maximum, the
getScopedMaximumOutstanding
could have somewhat surprising results. If the scoped maximum is reset to 0 for example, and we only see frees for allocations done before the reset, we are going to see that the scoped maximum returned is 0. This is because our memory usage is literally negative in this scenario.The APIs here assume that the caller process is using a single thread to call into the GPU (for Spark it would be 1 concurrent task).
Note I assume
Rmm.initialize
has been called, otherwise this doesn't track allocations done before that.