[FEA] Max workspace size #1310

tfeher · 2023-03-01T13:30:47Z

Is your feature request related to a problem? Please describe.

PR #1137 added a workspace resource, which can be used for temporary memory allocations. I would like an interface to

set / get max workspace size,
query the currently allocated workspace.

There are several algorithms, where the workspace size that we actually use depends on an internal batch_size parameter:

ivf-pq:

raft/cpp/include/raft/neighbors/detail/ivf_pq_search.cuh

Lines 1608 to 1611 in ae15be7

    
           // Maximum number of query vectors to search at the same time. 
        
           const auto max_queries = std::min<uint32_t>(std::max<uint32_t>(n_queries, 1), 4096); 
        
           auto max_batch_size    = get_max_batch_size(k, n_probes, max_queries, max_samples);

radix-topk:

raft/cpp/include/raft/matrix/detail/select_radix.cuh

Lines 598 to 603 in ae15be7

    
           rmm::device_uvector<Counter<T, IdxT>> counters(max_chunk_size, stream, mr); 
        
           rmm::device_uvector<IdxT> histograms(max_chunk_size * num_buckets, stream, mr); 
        
           rmm::device_uvector<T> buf1(max_chunk_size * len, stream, mr_buf); 
        
           rmm::device_uvector<IdxT> idx_buf1(max_chunk_size * len, stream, mr_buf); 
        
           rmm::device_uvector<T> buf2(max_chunk_size * len, stream, mr_buf); 
        
           rmm::device_uvector<IdxT> idx_buf2(max_chunk_size * len, stream, mr_buf);

kmeans balanced:

raft/cpp/include/raft/cluster/detail/kmeans_balanced.cuh

Lines 385 to 388 in ae15be7

    
           auto [max_minibatch_size, _mem_per_row] = 
        
             calc_minibatch_size<MathT>(n_clusters, n_rows, dim, params.metric, std::is_same_v<T, MathT>); 
        
           rmm::device_uvector<MathT> cur_dataset( 
        
             std::is_same_v<T, MathT> ? 0 : max_minibatch_size * dim, stream, mr);

cuML svm predict https://github.com/rapidsai/cuml/blob/branch-23.04/cpp/src/svm/svc_impl.cuh#L113-L123

In some cases above, we just hardcode a reasonable parameter. But the performance of an algorithm usually depends on the batch size, so we would like to allow the user to specify the workspace size, and the algorithms can adapt their batch size accordingly.

Describe the solution you'd like

Add an extra max_size attribute to the workspace resource,
track currently allocated size
print warning if we try to allocate larger than max workspace size.

Describe alternatives you've considered

Simply use a pool allocator, with initial_size = max_size = workspace_size.
This allows to specify the max workspace size during initialization.
We can query the pool size.
We don't have access of the currently allocated workspace.

The last point is problematic if we have nested algorithms that need workspace. Concrete example is IVF-PQ search can use radix-top-k.

Additional context

While the difference between max_workspace_size - currently_allocated_size gives the theoretical free space, in practice it is not guaranteed that we can allocate that much memory when a pooling allocator is used (due to fragmentation of the pool).
The workspace is always freed when an algorithm returns, therefore fragmentation might not be an issue.
Even if max_workspace_size - currrently_allocated_size is not accurate, it is an estimate that is useful for setting batch size.

The text was updated successfully, but these errors were encountered:

tfeher · 2023-03-01T13:31:11Z

cc @achirkin

achirkin · 2023-03-01T14:14:52Z

A reasonable alternative: https://github.com/rapidsai/rmm/blob/branch-23.04/include/rmm/mr/device/limiting_resource_adaptor.hpp
One can use the pool allocator as an upstream for this one for better performance.
Though there's some inconvenience to this: to check the available memory amount inside raft's code, we'd have to try dynamic cast the allocator to limiting_resource_adaptor before querying get_allocation_limit , but that would only work if we know exactly what's the upstream resource (template parameter).
On the other hand, we can just force get_workspace_resource to be of type limiting_resource_adaptor<device_memory_resource>, so that the upstream can be anything, and set the default limit to be equal to the current amount of free or total memory (or just int max).

Wrap the `rmm::mr::device_memory_resource` returned by `get_workspace_resource` into a `rmm::mr::limiting_resource_adaptor`. This allows to control how much memory is dedicated for the temporary allocations within raft and query this limit in the code (`mr->get_allocation_limit()` - `mr->get_allocated_bytes()`). **Breaking change**: workspace resource passed in arguments everywhere via a shared_ptr instead of a raw pointer. This affects `set_workspace_resource` function and the constructors of `device_resources` and `handle_t`. Luckily, it's optional everywhere and is likely rarely used, so the impact should be low. Resolves #1310 Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: #1356

### Brief Add another workspace memory resource that does not have the explicit memory limit. That is, after the change we have the following: 1. `rmm::mr::get_current_device_resource()` is default for all allocations, as before. It is used for the allocations with unlimited lifetime, e.g. returned to the user. 2. `raft::get_workspace_resource()` is for temporary allocations and forced to have fixed size, as before. However, it becomes smaller and should be used only for allocations, which do not scale with problem size. It defaults to a thin layer on top of the `current_device_resource`. 3. `raft::get_large_workspace_resource()` _(new)_ is for temporary allocations, which can scale with the problem size. Unlike `workspace_resource`, its size is not fixed. By default, it points to the `current_device_resource`, but the user can set it to something backed by the host memory (e.g. managed memory) to avoid OOM exceptions when there's not enough device memory left. ## Problem We have a list of issues/preference/requirements, some of which contradict others 1. We rely on RMM to handle all allocations and we often use [`rmm::mr::pool_memory_resource`](https://github.com/rapidsai/raft/blob/9fb05a2ab3d72760a09f1b7051e711d773682ef1/cpp/bench/ann/src/raft/raft_ann_bench_utils.h#L73) for performance reasons (to avoid lots of cudaMalloc calls in the loops) 2. Historically, we've used managed memory allocators as a workaround to [avoid OOM errors](https://github.com/rapidsai/raft/blob/5e80c1d2159e00a204ab5db0f5ca3f9ec43187c7/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L1788-L1795) or [improve speed (by increasing batch sizes)](https://github.com/rapidsai/raft/blob/5e80c1d2159e00a204ab5db0f5ca3f9ec43187c7/cpp/include/raft/neighbors/detail/ivf_pq_build.cuh#L1596-L1603). 3. However, the design goal is to avoid setting allocators on our own and to give the full control to the user (hence the workaround in 2 [was removed](addb059#diff-f7f070424d71da5321d470416d1a4ca3605c4290c34c4a1c1d8b2240747000d2)). 4. We introduced the [workspace resource](#1356) earlier to allow querying the available memory reliably and maximize the batch sizes accordingly (see also issue [#1310](#1310)). Without this, some of our batched algorithms either fail with OOM or severely underperform due to small batch sizes. 5. However, we cannot just put all of RAFT temporary allocations into the limited `workspace_resource`, because some of them scale with the problem size and would inevitably fail with OOM at some point. 6. Setting the workspace resource to the managed memory is not advisable as well for performance reasons: we have lots of small allocations in performance critical sections, so we need a pool, but a pool in the managed memory inevitably outgrows the device memory and makes the whole program slow. ## Solution I propose to split the workspace memory into two: 1. small, fixed-size workspace for small, frequent allocations 2. large workspace for the allocations that scale with the problem size Notes: - We still leave the full control over the allocator types to the user. - Neither of the workspace resource should have unlimited lifetime / returned to the user. As a result, if the user sets the managed memory as the large workspace resource, the memory is guaranteed to be released after the function call. - We have the option to use the slow managed memory without a pool for large allocations, while still using a fast pool for small allocations. - We have more flexible control over which allocations are "large" and which are "small", so hopefully using the managed memory is not so bad for performance. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2322

tfeher added the feature request New feature or request label Mar 1, 2023

tfeher mentioned this issue Mar 14, 2023

[BUG] IVF-FLAT OOM for large batch size #1343

Closed

achirkin mentioned this issue Mar 20, 2023

Limiting workspace memory resource #1356

Merged

rapids-bot bot closed this as completed in #1356 Jul 26, 2023

achirkin mentioned this issue Feb 23, 2024

[Outdated] Scaling workspace resources #2194

Closed

achirkin mentioned this issue May 16, 2024

Scaling workspace resources #2322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Max workspace size #1310

[FEA] Max workspace size #1310

tfeher commented Mar 1, 2023

tfeher commented Mar 1, 2023

achirkin commented Mar 1, 2023 •

edited

Loading

[FEA] Max workspace size #1310

[FEA] Max workspace size #1310

Comments

tfeher commented Mar 1, 2023

tfeher commented Mar 1, 2023

achirkin commented Mar 1, 2023 • edited Loading

achirkin commented Mar 1, 2023 •

edited

Loading