Limiting workspace memory resource #1356

achirkin · 2023-03-20T10:29:10Z

Wrap the rmm::mr::device_memory_resource returned by get_workspace_resource into a rmm::mr::limiting_resource_adaptor. This allows to control how much memory is dedicated for the temporary allocations within raft and query this limit in the code (mr->get_allocation_limit() - mr->get_allocated_bytes()).

Breaking change: workspace resource passed in arguments everywhere via a shared_ptr instead of a raw pointer. This affects set_workspace_resource function and the constructors of device_resources and handle_t. Luckily, it's optional everywhere and is likely rarely used, so the impact should be low.

Resolves #1310

achirkin · 2023-03-20T10:37:02Z

Open question

What should be the default limit? Currently, I set it to the total memory of the current device independently of the upstream resource. Issues:

(major issue?) This default choice makes it unreliable for calculating the logical workspace size in the algorithms, i.e. the batch size is going to be too big by default.
(minor issue) This default choice is not suitable for managed memory allocations, which can go beyond the available gpu memory.

achirkin · 2023-03-20T18:59:55Z

Update: alternative option:

Using pool memory resource by default.
Set the default limit to be half of free gpu memory, to encourage the user to set the limit manually if they desire.

…rs to keep the change non-breaking

tfeher

Thanks @achirkin for the PR! Overall it looks good, this PR addresses the problems in #1310.

The main question is what shall be the default size.

@ahendriksen could you help addressing the issues for this PR while Artem is away?

cpp/include/raft/core/resource/device_memory_resource.hpp

cjnolet · 2023-03-29T00:38:22Z

@achirkin can you explain why exactly this feature is needed? We don't do this type of thing anywhere else in RAPIDS and we leave it to the user to configure these types of resources. I suggest we document for users how they can configure their device_resources instance with the existing rmm::mr::limiting_memory_resource as the workspace_resource.

tfeher · 2023-03-29T06:33:39Z

why exactly this feature is needed?

@cjnolet Without knowing that the workspace_resource is a limiting_memory_resource we are not able to query the size of it. Dynamic casting of device_memory_resource is problematic, as described here: #1310 (comment)

to check the available memory amount inside raft's code, we'd have to try dynamic cast the allocator to limiting_resource_adaptor before querying get_allocation_limit , but that would only work if we know exactly what's the upstream resource (template parameter).

achirkin · 2023-03-29T09:59:31Z

Also note that this PR fixes two other problems with the workspace resource.

Without

raft/cpp/include/raft/core/resources.hpp

Lines 97 to 100 in 2eeb9e9

    
           // Clear the corresponding resource, so that on next `get_resource` the new factory is used 
        
           if (resources_.at(rtype).first != resource::resource_type::LAST_KEY) { 
        
             resources_.at(rtype) = std::make_pair(resource::resource_type::LAST_KEY, 
        
                                                   std::make_shared<resource::empty_resource>());

, the resource is always cached and thus does not change after the factory has been updated.

The previous semantics of passing raw memory resource pointers without transferring ownership was not very usable. If a user wanted to pass their custom memory resource, they would have to manually construct, delete it after use, and replace it with the previously set workspace resource. That would require them to do all exception handling manually as well to not leak resources.

tfeher

Few more notes.

cpp/include/raft/core/resource/device_memory_resource.hpp

…cessary

tfeher

Thanks @achirkin for the updates. Attached please find my comments.

cpp/include/raft/core/device_container_policy.hpp

cpp/include/raft/neighbors/detail/ivf_pq_search.cuh

cpp/include/raft/core/resource/device_memory_resource.hpp

cpp/include/raft/neighbors/detail/ivf_pq_search.cuh

tfeher

Thanks Artem for the update and the discussion! Looks good, apart from a minor change request.

cpp/include/raft/core/resource/device_memory_resource.hpp

cpp/include/raft/neighbors/detail/ivf_pq_search.cuh

cpp/include/raft/core/device_container_policy.hpp

…ool resource

tfeher

Thanks Artem for the updates, the PR looks good to me!

cjnolet

LGTM @achirkin. Thanks so much for this change!

cjnolet · 2023-07-26T19:04:18Z

cpp/include/raft/core/resource/detail/device_memory_resource.hpp

+    if (inserted) {
+      RAFT_LOG_WARN(
+        "[%s] the default cuda resource is used for the raft workspace allocations. This may lead "
+        "to a significant slowdown for this algorithm. Consider using the default pool resource "


Looks great! Thanks for adding this

cjnolet · 2023-07-26T19:09:56Z

cpp/include/raft/neighbors/detail/ivf_flat_serialize.cuh

@@ -45,7 +45,7 @@ struct check_index_layout {
                "paste in the new size and consider updating the serialization logic");
 };

-template struct check_index_layout<sizeof(index<double, std::uint64_t>), 296>;
+template struct check_index_layout<sizeof(index<double, std::uint64_t>), 328>;


@divyegala has mentioned this in another PR. I don't think we need to hold up this PR for it but I'd prefer if we started creating a constexpr for this value so it's easier to spot and manage.

cjnolet · 2023-07-26T19:14:50Z

/merge

PR #1356 increased the maximum internal batch size of IVF-PQ and, by doing so, exposed a bug of uint32_t integer overflow that resulted in incorrectly allocated intermediate buffers. This PR fixes the original bug, but also: - Proofs the places where `max_samples` multiplied by something could cause integer overflow - Make more careful estimation of the workspace size to avoid `out_of_memory` errors from the limiting resource adaptor - Removes an unused argument from `compute_similarity_kernel` which has been slipping between code updates for a really long time Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #1685

### Brief Add another workspace memory resource that does not have the explicit memory limit. That is, after the change we have the following: 1. `rmm::mr::get_current_device_resource()` is default for all allocations, as before. It is used for the allocations with unlimited lifetime, e.g. returned to the user. 2. `raft::get_workspace_resource()` is for temporary allocations and forced to have fixed size, as before. However, it becomes smaller and should be used only for allocations, which do not scale with problem size. It defaults to a thin layer on top of the `current_device_resource`. 3. `raft::get_large_workspace_resource()` _(new)_ is for temporary allocations, which can scale with the problem size. Unlike `workspace_resource`, its size is not fixed. By default, it points to the `current_device_resource`, but the user can set it to something backed by the host memory (e.g. managed memory) to avoid OOM exceptions when there's not enough device memory left. ## Problem We have a list of issues/preference/requirements, some of which contradict others 1. We rely on RMM to handle all allocations and we often use [`rmm::mr::pool_memory_resource`](https://github.com/rapidsai/raft/blob/9fb05a2ab3d72760a09f1b7051e711d773682ef1/cpp/bench/ann/src/raft/raft_ann_bench_utils.h#L73) for performance reasons (to avoid lots of cudaMalloc calls in the loops) 2. Historically, we've used managed memory allocators as a workaround to [avoid OOM errors](https://github.com/rapidsai/raft/blob/5e80c1d2159e00a204ab5db0f5ca3f9ec43187c7/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L1788-L1795) or [improve speed (by increasing batch sizes)](https://github.com/rapidsai/raft/blob/5e80c1d2159e00a204ab5db0f5ca3f9ec43187c7/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L1596-L1603). 3. However, the design goal is to avoid setting allocators on our own and to give the full control to the user (hence the workaround in 2 [was removed](addb059#diff-f7f070424d71da5321d470416d1a4ca3605c4290c34c4a1c1d8b2240747000d2)). 4. We introduced the [workspace resource](#1356) earlier to allow querying the available memory reliably and maximize the batch sizes accordingly (see also issue [#1310](#1310)). Without this, some of our batched algorithms either fail with OOM or severely underperform due to small batch sizes. 5. However, we cannot just put all of RAFT temporary allocations into the limited `workspace_resource`, because some of them scale with the problem size and would inevitably fail with OOM at some point. 6. Setting the workspace resource to the managed memory is not advisable as well for performance reasons: we have lots of small allocations in performance critical sections, so we need a pool, but a pool in the managed memory inevitably outgrows the device memory and makes the whole program slow. ## Solution I propose to split the workspace memory into two: 1. small, fixed-size workspace for small, frequent allocations 2. large workspace for the allocations that scale with the problem size Notes: - We still leave the full control over the allocator types to the user. - Neither of the workspace resource should have unlimited lifetime / returned to the user. As a result, if the user sets the managed memory as the large workspace resource, the memory is guaranteed to be released after the function call. - We have the option to use the slow managed memory without a pool for large allocations, while still using a fast pool for small allocations. - We have more flexible control over which allocations are "large" and which are "small", so hopefully using the managed memory is not so bad for performance. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2322

Wrap the workspace resource into a limiting_resource_adaptor

a939784

achirkin added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels Mar 20, 2023

achirkin requested a review from a team as a code owner March 20, 2023 10:29

github-actions bot added the cpp label Mar 20, 2023

Set the pool memory resource by default and start the ivf-pq use case

ac5762b

achirkin and others added 3 commits March 21, 2023 08:16

Refactor the resource to more rely on shared_ptr to manage lifetime

c33d519

Preserve the semantics of not transfering the ownership of raw pointe…

de5dd84

…rs to keep the change non-breaking

Merge branch 'branch-23.04' into fea-limited-workspace-resource

36365b5

cjnolet assigned achirkin Mar 23, 2023

tfeher requested changes Mar 28, 2023

View reviewed changes

cpp/include/raft/core/resource/device_memory_resource.hpp Outdated Show resolved Hide resolved

cpp/include/raft/core/resource/device_memory_resource.hpp Outdated Show resolved Hide resolved

cpp/include/raft/core/resource/device_memory_resource.hpp Outdated Show resolved Hide resolved

Merge branch 'branch-23.04' into fea-limited-workspace-resource

2eeb9e9

Merge branch 'branch-23.04' into fea-limited-workspace-resource

b8c5bc3

tfeher reviewed Mar 30, 2023

View reviewed changes

cpp/include/raft/core/resource/device_memory_resource.hpp Show resolved Hide resolved

cpp/include/raft/core/resource/device_memory_resource.hpp Outdated Show resolved Hide resolved

cpp/include/raft/core/resource/device_memory_resource.hpp Outdated Show resolved Hide resolved

Merge branch 'branch-23.06' into fea-limited-workspace-resource

48f90fe

achirkin requested review from a team as code owners May 9, 2023 06:11

achirkin changed the base branch from branch-23.04 to branch-23.06 May 9, 2023 06:11

github-actions bot added ci CMake python labels May 9, 2023

achirkin added 2 commits May 9, 2023 08:29

Fix a missing merge change

79c954e

Make the resource change not permanent

6cf1103

achirkin added 2 commits July 7, 2023 10:53

Remove ANN reference

d6a27c5

Use the plain workspace resource by default and print a warning if ne…

4530423

…cessary

achirkin requested a review from cjnolet July 7, 2023 09:42

cjnolet mentioned this pull request Jul 11, 2023

Fix UMAP sparse input crash rapidsai/cuml#5494

Open

achirkin added 2 commits July 14, 2023 10:10

Merge branch 'branch-23.08' into fea-limited-workspace-resource

f87cbf2

Merge branch 'branch-23.08' into fea-limited-workspace-resource

d4f0c78

tfeher requested changes Jul 17, 2023

View reviewed changes

achirkin added 2 commits July 18, 2023 14:41

Add a note about no deleter

775f718

Use the workspace resource size to determine the batch sizes for ivf-pq

d7fcde9

tfeher reviewed Jul 18, 2023

View reviewed changes

cpp/include/raft/neighbors/detail/ivf_pq_search.cuh Outdated Show resolved Hide resolved

achirkin added 6 commits July 19, 2023 11:23

Merge branch 'branch-23.08' into fea-limited-workspace-resource

eaafd3f

Merge branch 'branch-23.08' into fea-limited-workspace-resource

8eb9b80

Merge branch 'branch-23.08' into fea-limited-workspace-resource

463f409

Merge branch 'branch-23.08' into fea-limited-workspace-resource

e85033e

Merge branch 'branch-23.08' into fea-limited-workspace-resource

044b6ca

Merge branch 'branch-23.08' into fea-limited-workspace-resource

f3edcbc

achirkin requested a review from tfeher July 25, 2023 06:00

tfeher reviewed Jul 25, 2023

View reviewed changes

cpp/include/raft/core/resource/device_memory_resource.hpp Show resolved Hide resolved

cpp/include/raft/neighbors/detail/ivf_pq_search.cuh Outdated Show resolved Hide resolved

cpp/include/raft/core/device_container_policy.hpp Show resolved Hide resolved

Use get_workspace_free_bytes and debug-log the usage of the default p…

ae0f469

…ool resource

achirkin requested a review from tfeher July 25, 2023 12:18

tfeher approved these changes Jul 25, 2023

View reviewed changes

Merge branch 'branch-23.08' into fea-limited-workspace-resource

e082d09

cjnolet approved these changes Jul 26, 2023

View reviewed changes

rapids-bot bot merged commit 617d33a into rapidsai:branch-23.08 Jul 26, 2023

achirkin mentioned this pull request Jul 27, 2023

IVF-PQ: Fix illegal memory access with large max_samples #1685

Merged

achirkin mentioned this pull request Feb 23, 2024

[Outdated] Scaling workspace resources #2194

Closed

achirkin mentioned this pull request May 16, 2024

Scaling workspace resources #2322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limiting workspace memory resource #1356

Limiting workspace memory resource #1356

achirkin commented Mar 20, 2023 •

edited

Loading

achirkin commented Mar 20, 2023

achirkin commented Mar 20, 2023

tfeher left a comment

cjnolet commented Mar 29, 2023 •

edited

Loading

tfeher commented Mar 29, 2023 •

edited

Loading

achirkin commented Mar 29, 2023

tfeher left a comment

tfeher left a comment

tfeher left a comment

tfeher left a comment

cjnolet left a comment

cjnolet Jul 26, 2023

cjnolet Jul 26, 2023

cjnolet commented Jul 26, 2023

Limiting workspace memory resource #1356

Limiting workspace memory resource #1356

Conversation

achirkin commented Mar 20, 2023 • edited Loading

achirkin commented Mar 20, 2023

Open question

achirkin commented Mar 20, 2023

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Mar 29, 2023 • edited Loading

tfeher commented Mar 29, 2023 • edited Loading

achirkin commented Mar 29, 2023

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet Jul 26, 2023

Choose a reason for hiding this comment

cjnolet Jul 26, 2023

Choose a reason for hiding this comment

cjnolet commented Jul 26, 2023

achirkin commented Mar 20, 2023 •

edited

Loading

cjnolet commented Mar 29, 2023 •

edited

Loading

tfeher commented Mar 29, 2023 •

edited

Loading