-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limiting workspace memory resource #1356
Limiting workspace memory resource #1356
Conversation
Open questionWhat should be the default limit? Currently, I set it to the total memory of the current device independently of the upstream resource. Issues:
|
Update: alternative option:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @achirkin for the PR! Overall it looks good, this PR addresses the problems in #1310.
The main question is what shall be the default size.
@ahendriksen could you help addressing the issues for this PR while Artem is away?
@achirkin can you explain why exactly this feature is needed? We don't do this type of thing anywhere else in RAPIDS and we leave it to the user to configure these types of resources. I suggest we document for users how they can configure their |
@cjnolet Without knowing that the
|
Also note that this PR fixes two other problems with the workspace resource.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @achirkin for the updates. Attached please find my comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the update and the discussion! Looks good, apart from a minor change request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the updates, the PR looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @achirkin. Thanks so much for this change!
if (inserted) { | ||
RAFT_LOG_WARN( | ||
"[%s] the default cuda resource is used for the raft workspace allocations. This may lead " | ||
"to a significant slowdown for this algorithm. Consider using the default pool resource " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for adding this
@@ -45,7 +45,7 @@ struct check_index_layout { | |||
"paste in the new size and consider updating the serialization logic"); | |||
}; | |||
|
|||
template struct check_index_layout<sizeof(index<double, std::uint64_t>), 296>; | |||
template struct check_index_layout<sizeof(index<double, std::uint64_t>), 328>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@divyegala has mentioned this in another PR. I don't think we need to hold up this PR for it but I'd prefer if we started creating a constexpr for this value so it's easier to spot and manage.
/merge |
PR #1356 increased the maximum internal batch size of IVF-PQ and, by doing so, exposed a bug of uint32_t integer overflow that resulted in incorrectly allocated intermediate buffers. This PR fixes the original bug, but also: - Proofs the places where `max_samples` multiplied by something could cause integer overflow - Make more careful estimation of the workspace size to avoid `out_of_memory` errors from the limiting resource adaptor - Removes an unused argument from `compute_similarity_kernel` which has been slipping between code updates for a really long time Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1685
### Brief Add another workspace memory resource that does not have the explicit memory limit. That is, after the change we have the following: 1. `rmm::mr::get_current_device_resource()` is default for all allocations, as before. It is used for the allocations with unlimited lifetime, e.g. returned to the user. 2. `raft::get_workspace_resource()` is for temporary allocations and forced to have fixed size, as before. However, it becomes smaller and should be used only for allocations, which do not scale with problem size. It defaults to a thin layer on top of the `current_device_resource`. 3. `raft::get_large_workspace_resource()` _(new)_ is for temporary allocations, which can scale with the problem size. Unlike `workspace_resource`, its size is not fixed. By default, it points to the `current_device_resource`, but the user can set it to something backed by the host memory (e.g. managed memory) to avoid OOM exceptions when there's not enough device memory left. ## Problem We have a list of issues/preference/requirements, some of which contradict others 1. We rely on RMM to handle all allocations and we often use [`rmm::mr::pool_memory_resource`](https://github.com/rapidsai/raft/blob/9fb05a2ab3d72760a09f1b7051e711d773682ef1/cpp/bench/ann/src/raft/raft_ann_bench_utils.h#L73) for performance reasons (to avoid lots of cudaMalloc calls in the loops) 2. Historically, we've used managed memory allocators as a workaround to [avoid OOM errors](https://github.com/rapidsai/raft/blob/5e80c1d2159e00a204ab5db0f5ca3f9ec43187c7/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L1788-L1795) or [improve speed (by increasing batch sizes)](https://github.com/rapidsai/raft/blob/5e80c1d2159e00a204ab5db0f5ca3f9ec43187c7/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L1596-L1603). 3. However, the design goal is to avoid setting allocators on our own and to give the full control to the user (hence the workaround in 2 [was removed](addb059#diff-f7f070424d71da5321d470416d1a4ca3605c4290c34c4a1c1d8b2240747000d2)). 4. We introduced the [workspace resource](#1356) earlier to allow querying the available memory reliably and maximize the batch sizes accordingly (see also issue [#1310](#1310)). Without this, some of our batched algorithms either fail with OOM or severely underperform due to small batch sizes. 5. However, we cannot just put all of RAFT temporary allocations into the limited `workspace_resource`, because some of them scale with the problem size and would inevitably fail with OOM at some point. 6. Setting the workspace resource to the managed memory is not advisable as well for performance reasons: we have lots of small allocations in performance critical sections, so we need a pool, but a pool in the managed memory inevitably outgrows the device memory and makes the whole program slow. ## Solution I propose to split the workspace memory into two: 1. small, fixed-size workspace for small, frequent allocations 2. large workspace for the allocations that scale with the problem size Notes: - We still leave the full control over the allocator types to the user. - Neither of the workspace resource should have unlimited lifetime / returned to the user. As a result, if the user sets the managed memory as the large workspace resource, the memory is guaranteed to be released after the function call. - We have the option to use the slow managed memory without a pool for large allocations, while still using a fast pool for small allocations. - We have more flexible control over which allocations are "large" and which are "small", so hopefully using the managed memory is not so bad for performance. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2322
Wrap the
rmm::mr::device_memory_resource
returned byget_workspace_resource
into armm::mr::limiting_resource_adaptor
. This allows to control how much memory is dedicated for the temporary allocations within raft and query this limit in the code (mr->get_allocation_limit()
-mr->get_allocated_bytes()
).Breaking change: workspace resource passed in arguments everywhere via a shared_ptr instead of a raw pointer. This affects
set_workspace_resource
function and the constructors ofdevice_resources
andhandle_t
. Luckily, it's optional everywhere and is likely rarely used, so the impact should be low.Resolves #1310