Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set async allocator size to be the same as the limiting adaptor #9505

Closed
wants to merge 2 commits into from

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Oct 22, 2021

I was getting out of memory errors from the async allocator. I think previously the limiting adaptor was set too loosely, causing the async allocator to run out of memory.

@rongou rongou added bug Something isn't working 3 - Ready for Review Ready for review by team RMM Java Affects Java cuDF API. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Oct 22, 2021
@rongou rongou requested a review from abellina October 22, 2021 22:00
@rongou rongou self-assigned this Oct 22, 2021
@rongou rongou requested a review from a team as a code owner October 22, 2021 22:00
// Use `limiting_resource_adaptor` to set a hard limit on the max pool size since
// `cuda_async_memory_resource` only has a release threshold.
Initialized_resource = rmm::mr::make_owning_wrapper<rmm::mr::limiting_resource_adaptor>(
std::make_shared<rmm::mr::cuda_async_memory_resource>(pool_size, release_threshold),
pool_limit);
std::make_shared<rmm::mr::cuda_async_memory_resource>(pool_size, pool_size), pool_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't pool_size here just an initial size that can grow up to the maximum size rather than being the absolute limit of the pool size? It seems wrong to ignore the value in max_pool_size if it is specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the current configs are not exact matches for the async allocator. From what I've seen its memory usage fluctuates up quite a bit beyond the limit we set, so if we use the max size as the limit, the allocator may run into oom errors. This change sort of works with our current assumptions that the pool size is the max free memory minus the reserve. Probably need to do some cleanup down the road as suggested in #9209.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if the specified pool size is already the maximum size (i.e.: pool_size == max_size)? Doesn't that also lead to a problematic case? It seems like this kind of change just happens to work for your use case with certain pool_size, max_pool_size, and reserved size settings, but it doesn't seem like a general solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current default value for pool_size is the max free memory minus reserve, while the max_size is the total memory minus reserve, so this change is just trying to make the async allocator work better with default settings. If the user changes these settings, they are on their own. :)

We probably just get rid of the max_size and just have one size for the pool size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimally there needs to be some comments here explaining why this code is written the way it is, as it is making assumptions on what pool_size is set to relative to what the user wants. The Javadoc should also be updated to explain that the maximum size argument is ignored when the pool mode is using the async allocator and that the pool size cannot grow -- the initial size is the limit size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to just get rid of max_pool_size? In all the pool implementations, we should really just set a fixed size. In the plugin config we can still have the min/max fractions, but they could just be limits on what the pool size can be set.

Copy link
Member

@jlowe jlowe Oct 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just having a single, fixed size should be fine. To avoid breaking the plugin, we may want to do this in phases:

  • Add the new Java API in cudf to take a single, fixed pool size and deprecate the old Java API
  • Update the plugin to use the new API
  • Remove the deprecated Java API from cudf

@codecov
Copy link

codecov bot commented Oct 22, 2021

Codecov Report

Merging #9505 (2d35e76) into branch-21.12 (ab4bfaa) will decrease coverage by 0.12%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9505      +/-   ##
================================================
- Coverage         10.79%   10.66%   -0.13%     
================================================
  Files               116      117       +1     
  Lines             18869    19725     +856     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17621     +788     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8f23c1...2d35e76. Read the comment docs.

@rongou
Copy link
Contributor Author

rongou commented Nov 3, 2021

Superseded by #9583

@rongou rongou closed this Nov 3, 2021
@rongou rongou deleted the async-allocator-size branch November 23, 2021 17:24
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuDF (Java) Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working Java Affects Java cuDF API. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants