-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set async allocator size to be the same as the limiting adaptor #9505
Conversation
// Use `limiting_resource_adaptor` to set a hard limit on the max pool size since | ||
// `cuda_async_memory_resource` only has a release threshold. | ||
Initialized_resource = rmm::mr::make_owning_wrapper<rmm::mr::limiting_resource_adaptor>( | ||
std::make_shared<rmm::mr::cuda_async_memory_resource>(pool_size, release_threshold), | ||
pool_limit); | ||
std::make_shared<rmm::mr::cuda_async_memory_resource>(pool_size, pool_size), pool_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't pool_size
here just an initial size that can grow up to the maximum size rather than being the absolute limit of the pool size? It seems wrong to ignore the value in max_pool_size
if it is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the current configs are not exact matches for the async allocator. From what I've seen its memory usage fluctuates up quite a bit beyond the limit we set, so if we use the max size as the limit, the allocator may run into oom errors. This change sort of works with our current assumptions that the pool size is the max free memory minus the reserve. Probably need to do some cleanup down the road as suggested in #9209.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if the specified pool size is already the maximum size (i.e.: pool_size == max_size)? Doesn't that also lead to a problematic case? It seems like this kind of change just happens to work for your use case with certain pool_size, max_pool_size, and reserved size settings, but it doesn't seem like a general solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current default value for pool_size is the max free memory minus reserve, while the max_size is the total memory minus reserve, so this change is just trying to make the async allocator work better with default settings. If the user changes these settings, they are on their own. :)
We probably just get rid of the max_size and just have one size for the pool size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimally there needs to be some comments here explaining why this code is written the way it is, as it is making assumptions on what pool_size
is set to relative to what the user wants. The Javadoc should also be updated to explain that the maximum size argument is ignored when the pool mode is using the async allocator and that the pool size cannot grow -- the initial size is the limit size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to just get rid of max_pool_size
? In all the pool implementations, we should really just set a fixed size. In the plugin config we can still have the min/max fractions, but they could just be limits on what the pool size can be set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just having a single, fixed size should be fine. To avoid breaking the plugin, we may want to do this in phases:
- Add the new Java API in cudf to take a single, fixed pool size and deprecate the old Java API
- Update the plugin to use the new API
- Remove the deprecated Java API from cudf
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9505 +/- ##
================================================
- Coverage 10.79% 10.66% -0.13%
================================================
Files 116 117 +1
Lines 18869 19725 +856
================================================
+ Hits 2036 2104 +68
- Misses 16833 17621 +788
Continue to review full report at Codecov.
|
Superseded by #9583 |
I was getting out of memory errors from the async allocator. I think previously the limiting adaptor was set too loosely, causing the async allocator to run out of memory.