Skip to content

Commit

Permalink
Improve concurrency of stream_ordered_memory_resource by stealing less (
Browse files Browse the repository at this point in the history
#851)

Fixes #850. The `stream_ordered_memory_resource` was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc.  The result is that there is zero concurrency between the streams.

This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly.

It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails.

Performance of `RANDOM_ALLOCATIONS_BENCH` is minimally affected, and performance of `MULTI_STREAM_ALLOCATION_BENCH` is improved, demonstrating full concurrency.

Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance. 

CC @cwharris 

```
--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0       2780 us         2780 us          251 items_per_second=1.4391k/s
BM_MultiStreamAllocations/pool_mr/2/4/0       1393 us         1392 us          489 items_per_second=2.8729k/s
BM_MultiStreamAllocations/pool_mr/4/4/0        706 us          706 us          926 items_per_second=5.66735k/s
BM_MultiStreamAllocations/pool_mr/1/4/1       2775 us         2774 us          252 items_per_second=1.44176k/s
BM_MultiStreamAllocations/pool_mr/2/4/1       1393 us         1393 us          487 items_per_second=2.87209k/s
BM_MultiStreamAllocations/pool_mr/4/4/1        704 us          704 us          913 items_per_second=5.67891k/s
```

MULTI_STREAM_ALLOCATIONS performance change:

```
(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_multistream_allocations_21.10.json pool_multistream_allocations_returnsplit.json 
Comparing pool_multistream_allocations_21.10.json to pool_multistream_allocations_returnsplit.json
Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0                -0.0014         -0.0044          2789          2785          2788          2776
BM_MultiStreamAllocations/pool_mr/2/4/0                -0.4989         -0.4982          2779          1393          2775          1392
BM_MultiStreamAllocations/pool_mr/4/4/0                -0.7450         -0.7450          2778           708          2778           708
BM_MultiStreamAllocations/pool_mr/1/4/1                +0.0001         +0.0001          2775          2775          2774          2775
BM_MultiStreamAllocations/pool_mr/2/4/1                +0.0002         +0.0001          1393          1393          1392          1393
BM_MultiStreamAllocations/pool_mr/4/4/1                -0.0531         -0.0531           744           704           744           704
```

RANDOM_ALLOCATIONS performance change:

```
(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_random_allocations_21.10.json pool_random_allocations_returnsplit.json  
Comparing pool_random_allocations_21.10.json to pool_random_allocations_returnsplit.json
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations/pool_mr/1000/1                     +0.0199         +0.0198             1             1             1             1
BM_RandomAllocations/pool_mr/1000/4                     -0.0063         -0.0061             1             1             1             1
BM_RandomAllocations/pool_mr/1000/64                    -0.0144         -0.0145             1             1             1             1
BM_RandomAllocations/pool_mr/1000/256                   +0.0243         +0.0254             1             1             1             1
BM_RandomAllocations/pool_mr/1000/1024                  -0.0313         -0.0311             1             0             1             0
BM_RandomAllocations/pool_mr/1000/4096                  -0.0063         -0.0059             0             0             0             0
BM_RandomAllocations/pool_mr/10000/1                    +0.0105         +0.0105            46            47            46            47
BM_RandomAllocations/pool_mr/10000/4                    -0.0023         -0.0023            50            50            50            50
BM_RandomAllocations/pool_mr/10000/64                   +0.0065         +0.0065            11            11            11            11
BM_RandomAllocations/pool_mr/10000/256                  +0.0099         +0.0099             6             6             6             6
BM_RandomAllocations/pool_mr/10000/1024                 -0.0074         -0.0075             5             5             5             5
BM_RandomAllocations/pool_mr/10000/4096                 -0.0165         -0.0163             5             5             5             5
BM_RandomAllocations/pool_mr/100000/1                   +0.0154         +0.0154          6939          7046          6937          7044
BM_RandomAllocations/pool_mr/100000/4                   +0.0839         +0.0838          2413          2615          2413          2615
BM_RandomAllocations/pool_mr/100000/64                  +0.0050         +0.0050           116           117           116           117
BM_RandomAllocations/pool_mr/100000/256                 -0.0040         -0.0039            64            64            64            64
BM_RandomAllocations/pool_mr/100000/1024                -0.0174         -0.0174            51            50            51            50
BM_RandomAllocations/pool_mr/100000/4096                -0.0447         -0.0448            48            46            48            46
```

Screenshot of kernel concurrency (or lack of) in the profiler before and after this change:

Before:
![Screenshot from 2021-08-24 15-36-17](https://user-images.githubusercontent.com/783069/130563715-7a52bb21-2ee7-4541-967a-a5dc25ab56ff.png)

After:
![Screenshot from 2021-08-24 15-41-33](https://user-images.githubusercontent.com/783069/130563731-aae43a6d-9881-46ed-ada1-98e5277a0dd6.png)

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Christopher Harris (https://github.com/cwharris)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #851
  • Loading branch information
harrism authored Aug 27, 2021
1 parent b458233 commit bef4377
Show file tree
Hide file tree
Showing 3 changed files with 84 additions and 61 deletions.
106 changes: 63 additions & 43 deletions include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020, NVIDIA CORPORATION.
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,12 +16,14 @@
#pragma once

#include <limits>
#include <rmm/detail/aligned.hpp>
#include <rmm/detail/error.hpp>
#include <rmm/logger.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>

#include <cuda_runtime_api.h>

#include <cstddef>
#include <functional>
#include <map>
#include <mutex>
Expand Down Expand Up @@ -66,10 +68,10 @@ struct crtp {
* Classes derived from stream_ordered_memory_resource must implement the following four methods,
* documented separately:
*
* 1. `size_t get_maximum_allocation_size() const`
* 2. `block_type expand_pool(size_t size, free_list& blocks, cuda_stream_view stream)`
* 3. `split_block allocate_from_block(block_type const& b, size_t size)`
* 4. `block_type free_block(void* p, size_t size) noexcept`
* 1. `std::size_t get_maximum_allocation_size() const`
* 2. `block_type expand_pool(std::size_t size, free_list& blocks, cuda_stream_view stream)`
* 3. `split_block allocate_from_block(block_type const& b, std::size_t size)`
* 4. `block_type free_block(void* p, std::size_t size) noexcept`
*/
template <typename PoolResource, typename FreeListType>
class stream_ordered_memory_resource : public crtp<PoolResource>, public device_memory_resource {
Expand All @@ -89,18 +91,18 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_

// Derived classes must implement these four methods

/**
/*
* @brief Get the maximum size of a single allocation supported by this suballocator memory
* resource
*
* Default implementation is the maximum `size_t` value, but fixed-size allocators will have a
* lower limit. Override this function in derived classes as necessary.
* Default implementation is the maximum `std::size_t` value, but fixed-size allocators will have
* a lower limit. Override this function in derived classes as necessary.
*
* @return size_t The maximum size of a single allocation supported by this memory resource
* @return std::size_t The maximum size of a single allocation supported by this memory resource
*/
// size_t get_maximum_allocation_size() const { return std::numeric_limits<size_t>::max(); }
// std::size_t get_maximum_allocation_size() const

/**
/*
* @brief Allocate space (typically from upstream) to supply the suballocation pool and return
* a sufficiently sized block.
*
Expand All @@ -114,15 +116,12 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
* @param stream The stream on which the memory is to be used.
* @return block_type a block of at least `size` bytes
*/
// block_type expand_pool(size_t size, free_list& blocks, cuda_stream_view stream)
// block_type expand_pool(std::size_t size, free_list& blocks, cuda_stream_view stream)

/// Struct representing a block that has been split for allocation
struct split_block {
void* allocated_pointer; ///< The pointer allocated from a block
block_type remainder; ///< The remainder of the block from which the pointer was allocated
};
/// Pair representing a block that has been split for allocation
using split_block = std::pair<block_type, block_type>;

/**
/*
* @brief Split block `b` if necessary to return a pointer to memory of `size` bytes.
*
* If the block is split, the remainder is returned as the remainder element in the output
Expand All @@ -134,17 +133,17 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
* @return A `split_block` comprising the allocated pointer and any unallocated remainder of the
* input block.
*/
// split_block allocate_from_block(block_type const& b, size_t size)
// split_block allocate_from_block(block_type const& b, std::size_t size)

/**
/*
* @brief Finds, frees and returns the block associated with pointer `p`.
*
* @param p The pointer to the memory to free.
* @param size The size of the memory to free. Must be equal to the original allocation size.
* @return The (now freed) block associated with `p`. The caller is expected to return the block
* to the pool.
*/
// block_type free_block(void* p, size_t size) noexcept
// block_type free_block(void* p, std::size_t size) noexcept

/**
* @brief Returns the block `b` (last used on stream `stream_event`) to the pool.
Expand Down Expand Up @@ -213,16 +212,13 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
rmm::bad_alloc,
"Maximum allocation size exceeded");
auto const b = this->underlying().get_block(bytes, stream_event);
auto split = this->underlying().allocate_from_block(b, bytes);
if (split.remainder.is_valid()) stream_free_blocks_[stream_event].insert(split.remainder);
RMM_LOG_TRACE("[A][stream {:p}][{}B][{:p}]",
fmt::ptr(stream_event.stream),
bytes,
fmt::ptr(split.allocated_pointer));

RMM_LOG_TRACE(
"[A][stream {:p}][{}B][{:p}]", fmt::ptr(stream_event.stream), bytes, fmt::ptr(b.pointer()));

log_summary_trace();

return split.allocated_pointer;
return b.pointer();
}

/**
Expand All @@ -234,9 +230,12 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
*/
virtual void do_deallocate(void* p, std::size_t bytes, cuda_stream_view stream) override
{
RMM_LOG_TRACE("[D][stream {:p}][{}B][{:p}]", fmt::ptr(stream.value()), bytes, p);

if (bytes <= 0 || p == nullptr) return;

lock_guard lock(mtx_);
auto stream_event = get_event(stream);
RMM_LOG_TRACE("[D][stream {:p}][{}B][{:p}]", fmt::ptr(stream_event.stream), bytes, p);

bytes = rmm::detail::align_up(bytes, rmm::detail::CUDA_ALLOCATION_ALIGNMENT);
auto const b = this->underlying().free_block(p, bytes);
Expand Down Expand Up @@ -301,19 +300,35 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
}

/**
* @brief Get an avaible memory block of at least `size` bytes
* @brief Splits a block into an allocated block of `size` bytes and a remainder block, and
* inserts the remainder into a free list.
*
* @param b The block to split into allocated and remainder portions.
* @param size The size of the block to allocate from `b`.
* @param blocks The `free_list` in which to insert the remainder block.
* @return The allocated block.
*/
block_type allocate_and_insert_remainder(block_type b, std::size_t size, free_list& blocks)
{
auto const [allocated, remainder] = this->underlying().allocate_from_block(b, size);
if (remainder.is_valid()) blocks.insert(remainder);
return allocated;
}

/**
* @brief Get an available memory block of at least `size` bytes
*
* @param size The number of bytes to allocate
* @param stream_event The stream and associated event on which the allocation will be used.
* @return block_type A block of memory of at least `size` bytes
*/
block_type get_block(size_t size, stream_event_pair stream_event)
block_type get_block(std::size_t size, stream_event_pair stream_event)
{
// Try to find a satisfactory block in free list for the same stream (no sync required)
auto iter = stream_free_blocks_.find(stream_event);
if (iter != stream_free_blocks_.end()) {
block_type b = iter->second.get_block(size);
if (b.is_valid()) { return b; }
block_type const b = iter->second.get_block(size);
if (b.is_valid()) { return allocate_and_insert_remainder(b, size, iter->second); }
}

free_list& blocks =
Expand All @@ -334,7 +349,10 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
log_summary_trace();

// no large enough blocks available after merging, so grow the pool
return this->underlying().expand_pool(size, blocks, cuda_stream_view{stream_event.stream});
block_type const b =
this->underlying().expand_pool(size, blocks, cuda_stream_view{stream_event.stream});

return allocate_and_insert_remainder(b, size, blocks);
}

/**
Expand All @@ -352,7 +370,7 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
* @return A block with non-null pointer and size >= `size`, or a nullptr block if none is
* available in `blocks`.
*/
block_type get_block_from_other_stream(size_t size,
block_type get_block_from_other_stream(std::size_t size,
stream_event_pair stream_event,
free_list& blocks,
bool merge_first)
Expand All @@ -375,10 +393,18 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_

stream_free_blocks_.erase(it);

return blocks.get_block(size); // get the best fit block in merged lists
block_type const b = blocks.get_block(size); // get the best fit block in merged lists
if (b.is_valid()) { return allocate_and_insert_remainder(b, size, blocks); }
} else {
return other_blocks.get_block(size); // get the best fit block in other list
block_type const b = other_blocks.get_block(size);
if (b.is_valid()) {
// Since we found a block associated with a different stream, we have to insert a wait
// on the stream's associated event into the allocating stream.
RMM_CUDA_TRY(cudaStreamWaitEvent(stream_event.stream, other_event, 0));
return allocate_and_insert_remainder(b, size, other_blocks);
}
}
return block_type{};
}();

if (b.is_valid()) {
Expand All @@ -387,12 +413,6 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
fmt::ptr(stream_event.stream),
size,
fmt::ptr(it->first.stream));

if (not merge_first) {
merge_lists(stream_event, blocks, other_event, std::move(other_blocks));
stream_free_blocks_.erase(it);
}

return b;
}
}
Expand Down
12 changes: 6 additions & 6 deletions include/rmm/mr/device/fixed_size_memory_resource.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -129,10 +129,10 @@ class fixed_size_memory_resource
/**
* @brief Get the (fixed) size of allocations supported by this memory resource
*
* @return size_t The (fixed) maximum size of a single allocation supported by this memory
* @return std::size_t The (fixed) maximum size of a single allocation supported by this memory
* resource
*/
size_t get_maximum_allocation_size() const { return get_block_size(); }
std::size_t get_maximum_allocation_size() const { return get_block_size(); }

/**
* @brief Allocate a block from upstream to supply the suballocation pool.
Expand All @@ -144,7 +144,7 @@ class fixed_size_memory_resource
* @param stream The stream on which the memory is to be used.
* @return block_type The allocated block
*/
block_type expand_pool(size_t size, free_list& blocks, cuda_stream_view stream)
block_type expand_pool(std::size_t size, free_list& blocks, cuda_stream_view stream)
{
blocks.insert(std::move(blocks_from_upstream(stream)));
return blocks.get_block(size);
Expand Down Expand Up @@ -181,9 +181,9 @@ class fixed_size_memory_resource
* @return A pair comprising the allocated pointer and any unallocated remainder of the input
* block.
*/
split_block allocate_from_block(block_type const& b, size_t size)
split_block allocate_from_block(block_type const& b, std::size_t size)
{
return split_block{b.pointer(), block_type{nullptr}};
return {b, block_type{nullptr}};
}

/**
Expand All @@ -195,7 +195,7 @@ class fixed_size_memory_resource
* @return The (now freed) block associated with `p`. The caller is expected to return the block
* to the pool.
*/
block_type free_block(void* p, size_t size) noexcept
block_type free_block(void* p, std::size_t size) noexcept
{
// Deallocating a fixed-size block just inserts it in the free list, which is
// handled by the parent class
Expand Down
27 changes: 15 additions & 12 deletions include/rmm/mr/device/pool_memory_resource.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
#include <cuda_runtime_api.h>

#include <algorithm>
#include <cstdint>
#include <cstddef>
#include <iostream>
#include <map>
#include <mutex>
Expand Down Expand Up @@ -141,11 +141,14 @@ class pool_memory_resource final
* @brief Get the maximum size of allocations supported by this memory resource
*
* Note this does not depend on the memory size of the device. It simply returns the maximum
* value of `size_t`
* value of `std::size_t`
*
* @return size_t The maximum size of a single allocation supported by this memory resource
* @return std::size_t The maximum size of a single allocation supported by this memory resource
*/
size_t get_maximum_allocation_size() const { return std::numeric_limits<size_t>::max(); }
std::size_t get_maximum_allocation_size() const
{
return std::numeric_limits<std::size_t>::max();
}

/**
* @brief Try to expand the pool by allocating a block of at least `min_size` bytes from
Expand Down Expand Up @@ -247,7 +250,7 @@ class pool_memory_resource final
* Returns 0 if the requested size cannot be satisfied.
*
* @param size The size of the minimum allocation immediately needed
* @return size_t The computed size to grow the pool.
* @return std::size_t The computed size to grow the pool.
*/
std::size_t size_to_grow(std::size_t size) const
{
Expand All @@ -268,7 +271,7 @@ class pool_memory_resource final
* @param stream The stream on which the memory is to be used.
* @return block_type The allocated block
*/
thrust::optional<block_type> block_from_upstream(size_t size, cuda_stream_view stream)
thrust::optional<block_type> block_from_upstream(std::size_t size, cuda_stream_view stream)
{
RMM_LOG_DEBUG("[A][Stream {}][Upstream {}B]", fmt::ptr(stream.value()), size);

Expand All @@ -294,7 +297,7 @@ class pool_memory_resource final
* @return A pair comprising the allocated pointer and any unallocated remainder of the input
* block.
*/
split_block allocate_from_block(block_type const& b, size_t size)
split_block allocate_from_block(block_type const& b, std::size_t size)
{
block_type const alloc{b.pointer(), size, b.is_head()};
#ifdef RMM_POOL_TRACK_ALLOCATIONS
Expand All @@ -303,7 +306,7 @@ class pool_memory_resource final

auto rest =
(b.size() > size) ? block_type{b.pointer() + size, b.size() - size, false} : block_type{};
return {reinterpret_cast<void*>(alloc.pointer()), rest};
return {alloc, rest};
}

/**
Expand All @@ -315,7 +318,7 @@ class pool_memory_resource final
* @return The (now freed) block associated with `p`. The caller is expected to return the block
* to the pool.
*/
block_type free_block(void* p, size_t size) noexcept
block_type free_block(void* p, std::size_t size) noexcept
{
#ifdef RMM_POOL_TRACK_ALLOCATIONS
if (p == nullptr) return block_type{};
Expand All @@ -338,9 +341,9 @@ class pool_memory_resource final
*
* Includes allocated as well as free memory.
*
* @return size_t The total size of the currently allocated pool.
* @return std::size_t The total size of the currently allocated pool.
*/
size_t pool_size() const noexcept { return current_pool_size_; }
std::size_t pool_size() const noexcept { return current_pool_size_; }

/**
* @brief Free all memory allocated from the upstream memory_resource.
Expand Down Expand Up @@ -419,7 +422,7 @@ class pool_memory_resource final
* @param stream to execute on
* @return std::pair contaiing free_size and total_size of memory
*/
std::pair<size_t, size_t> do_get_mem_info(cuda_stream_view stream) const override
std::pair<std::size_t, std::size_t> do_get_mem_info(cuda_stream_view stream) const override
{
std::size_t free_size{};
std::size_t total_size{};
Expand Down

0 comments on commit bef4377

Please sign in to comment.