Improve concurrency of stream_ordered_memory_resource by stealing less (

#851) Fixes #850. The `stream_ordered_memory_resource` was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc. The result is that there is zero concurrency between the streams. This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly. It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails. Performance of `RANDOM_ALLOCATIONS_BENCH` is minimally affected, and performance of `MULTI_STREAM_ALLOCATION_BENCH` is improved, demonstrating full concurrency. Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance. CC @cwharris ``` -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- BM_MultiStreamAllocations/pool_mr/1/4/0 2780 us 2780 us 251 items_per_second=1.4391k/s BM_MultiStreamAllocations/pool_mr/2/4/0 1393 us 1392 us 489 items_per_second=2.8729k/s BM_MultiStreamAllocations/pool_mr/4/4/0 706 us 706 us 926 items_per_second=5.66735k/s BM_MultiStreamAllocations/pool_mr/1/4/1 2775 us 2774 us 252 items_per_second=1.44176k/s BM_MultiStreamAllocations/pool_mr/2/4/1 1393 us 1393 us 487 items_per_second=2.87209k/s BM_MultiStreamAllocations/pool_mr/4/4/1 704 us 704 us 913 items_per_second=5.67891k/s ``` MULTI_STREAM_ALLOCATIONS performance change: ``` (rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_multistream_allocations_21.10.json pool_multistream_allocations_returnsplit.json Comparing pool_multistream_allocations_21.10.json to pool_multistream_allocations_returnsplit.json Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------- BM_MultiStreamAllocations/pool_mr/1/4/0 -0.0014 -0.0044 2789 2785 2788 2776 BM_MultiStreamAllocations/pool_mr/2/4/0 -0.4989 -0.4982 2779 1393 2775 1392 BM_MultiStreamAllocations/pool_mr/4/4/0 -0.7450 -0.7450 2778 708 2778 708 BM_MultiStreamAllocations/pool_mr/1/4/1 +0.0001 +0.0001 2775 2775 2774 2775 BM_MultiStreamAllocations/pool_mr/2/4/1 +0.0002 +0.0001 1393 1393 1392 1393 BM_MultiStreamAllocations/pool_mr/4/4/1 -0.0531 -0.0531 744 704 744 704 ``` RANDOM_ALLOCATIONS performance change: ``` (rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_random_allocations_21.10.json pool_random_allocations_returnsplit.json Comparing pool_random_allocations_21.10.json to pool_random_allocations_returnsplit.json Benchmark Time CPU Time Old Time New CPU Old CPU New --------------------------------------------------------------------------------------------------------------------------------------- BM_RandomAllocations/pool_mr/1000/1 +0.0199 +0.0198 1 1 1 1 BM_RandomAllocations/pool_mr/1000/4 -0.0063 -0.0061 1 1 1 1 BM_RandomAllocations/pool_mr/1000/64 -0.0144 -0.0145 1 1 1 1 BM_RandomAllocations/pool_mr/1000/256 +0.0243 +0.0254 1 1 1 1 BM_RandomAllocations/pool_mr/1000/1024 -0.0313 -0.0311 1 0 1 0 BM_RandomAllocations/pool_mr/1000/4096 -0.0063 -0.0059 0 0 0 0 BM_RandomAllocations/pool_mr/10000/1 +0.0105 +0.0105 46 47 46 47 BM_RandomAllocations/pool_mr/10000/4 -0.0023 -0.0023 50 50 50 50 BM_RandomAllocations/pool_mr/10000/64 +0.0065 +0.0065 11 11 11 11 BM_RandomAllocations/pool_mr/10000/256 +0.0099 +0.0099 6 6 6 6 BM_RandomAllocations/pool_mr/10000/1024 -0.0074 -0.0075 5 5 5 5 BM_RandomAllocations/pool_mr/10000/4096 -0.0165 -0.0163 5 5 5 5 BM_RandomAllocations/pool_mr/100000/1 +0.0154 +0.0154 6939 7046 6937 7044 BM_RandomAllocations/pool_mr/100000/4 +0.0839 +0.0838 2413 2615 2413 2615 BM_RandomAllocations/pool_mr/100000/64 +0.0050 +0.0050 116 117 116 117 BM_RandomAllocations/pool_mr/100000/256 -0.0040 -0.0039 64 64 64 64 BM_RandomAllocations/pool_mr/100000/1024 -0.0174 -0.0174 51 50 51 50 BM_RandomAllocations/pool_mr/100000/4096 -0.0447 -0.0448 48 46 48 46 ``` Screenshot of kernel concurrency (or lack of) in the profiler before and after this change: Before: ![Screenshot from 2021-08-24 15-36-17](https://user-images.githubusercontent.com/783069/130563715-7a52bb21-2ee7-4541-967a-a5dc25ab56ff.png) After: ![Screenshot from 2021-08-24 15-41-33](https://user-images.githubusercontent.com/783069/130563731-aae43a6d-9881-46ed-ada1-98e5277a0dd6.png) Authors: - Mark Harris (https://github.com/harrism) Approvers: - Rong Ou (https://github.com/rongou) - Christopher Harris (https://github.com/cwharris) - Mike Wilson (https://github.com/hyperbolic2346) URL: #851
rapidsai · Aug 27, 2021 · bef4377 · bef4377
1 parent b458233
commit bef4377
Show file tree

Hide file tree

Showing 3 changed files with 84 additions and 61 deletions.
diff --git a/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp b/include/rmm/mr/device/detail/stream_ordered_memory_resource.hpp
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2020, NVIDIA CORPORATION.
+ * Copyright (c) 2021, NVIDIA CORPORATION.
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -16,12 +16,14 @@
 #pragma once
 
 #include <limits>
+#include <rmm/detail/aligned.hpp>
 #include <rmm/detail/error.hpp>
 #include <rmm/logger.hpp>
 #include <rmm/mr/device/device_memory_resource.hpp>
 
 #include <cuda_runtime_api.h>
 
+#include <cstddef>
 #include <functional>
 #include <map>
 #include <mutex>
@@ -66,10 +68,10 @@ struct crtp {
  * Classes derived from stream_ordered_memory_resource must implement the following four methods,
  * documented separately:
  *
- * 1. `size_t get_maximum_allocation_size() const`
- * 2. `block_type expand_pool(size_t size, free_list& blocks, cuda_stream_view stream)`
- * 3. `split_block allocate_from_block(block_type const& b, size_t size)`
- * 4. `block_type free_block(void* p, size_t size) noexcept`
+ * 1. `std::size_t get_maximum_allocation_size() const`
+ * 2. `block_type expand_pool(std::size_t size, free_list& blocks, cuda_stream_view stream)`
+ * 3. `split_block allocate_from_block(block_type const& b, std::size_t size)`
+ * 4. `block_type free_block(void* p, std::size_t size) noexcept`
  */
 template <typename PoolResource, typename FreeListType>
 class stream_ordered_memory_resource : public crtp<PoolResource>, public device_memory_resource {
@@ -89,18 +91,18 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
 
   // Derived classes must implement these four methods
 
-  /**
+  /*
    * @brief Get the maximum size of a single allocation supported by this suballocator memory
    * resource
    *
-   * Default implementation is the maximum `size_t` value, but fixed-size allocators will have a
-   * lower limit. Override this function in derived classes as necessary.
+   * Default implementation is the maximum `std::size_t` value, but fixed-size allocators will have
+   * a lower limit. Override this function in derived classes as necessary.
    *
-   * @return size_t The maximum size of a single allocation supported by this memory resource
+   * @return std::size_t The maximum size of a single allocation supported by this memory resource
    */
-  // size_t get_maximum_allocation_size() const { return std::numeric_limits<size_t>::max(); }
+  // std::size_t get_maximum_allocation_size() const
 
-  /**
+  /*
    * @brief Allocate space (typically from upstream) to supply the suballocation pool and return
    * a sufficiently sized block.
    *
@@ -114,15 +116,12 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
    * @param stream The stream on which the memory is to be used.
    * @return block_type a block of at least `size` bytes
    */
-  // block_type expand_pool(size_t size, free_list& blocks, cuda_stream_view stream)
+  // block_type expand_pool(std::size_t size, free_list& blocks, cuda_stream_view stream)
 
-  /// Struct representing a block that has been split for allocation
-  struct split_block {
-    void* allocated_pointer;  ///< The pointer allocated from a block
-    block_type remainder;     ///< The remainder of the block from which the pointer was allocated
-  };
+  /// Pair representing a block that has been split for allocation
+  using split_block = std::pair<block_type, block_type>;
 
-  /**
+  /*
    * @brief Split block `b` if necessary to return a pointer to memory of `size` bytes.
    *
    * If the block is split, the remainder is returned as the remainder element in the output
@@ -134,17 +133,17 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
    * @return A `split_block` comprising the allocated pointer and any unallocated remainder of the
    * input block.
    */
-  // split_block allocate_from_block(block_type const& b, size_t size)
+  // split_block allocate_from_block(block_type const& b, std::size_t size)
 
-  /**
+  /*
    * @brief Finds, frees and returns the block associated with pointer `p`.
    *
    * @param p The pointer to the memory to free.
    * @param size The size of the memory to free. Must be equal to the original allocation size.
    * @return The (now freed) block associated with `p`. The caller is expected to return the block
    * to the pool.
    */
-  // block_type free_block(void* p, size_t size) noexcept
+  // block_type free_block(void* p, std::size_t size) noexcept
 
   /**
    * @brief Returns the block `b` (last used on stream `stream_event`) to the pool.
@@ -213,16 +212,13 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
                 rmm::bad_alloc,
                 "Maximum allocation size exceeded");
     auto const b = this->underlying().get_block(bytes, stream_event);
-    auto split   = this->underlying().allocate_from_block(b, bytes);
-    if (split.remainder.is_valid()) stream_free_blocks_[stream_event].insert(split.remainder);
-    RMM_LOG_TRACE("[A][stream {:p}][{}B][{:p}]",
-                  fmt::ptr(stream_event.stream),
-                  bytes,
-                  fmt::ptr(split.allocated_pointer));
+
+    RMM_LOG_TRACE(
+      "[A][stream {:p}][{}B][{:p}]", fmt::ptr(stream_event.stream), bytes, fmt::ptr(b.pointer()));
 
     log_summary_trace();
 
-    return split.allocated_pointer;
+    return b.pointer();
   }
 
   /**
@@ -234,9 +230,12 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
    */
   virtual void do_deallocate(void* p, std::size_t bytes, cuda_stream_view stream) override
   {
+    RMM_LOG_TRACE("[D][stream {:p}][{}B][{:p}]", fmt::ptr(stream.value()), bytes, p);
+
+    if (bytes <= 0 || p == nullptr) return;
+
     lock_guard lock(mtx_);
     auto stream_event = get_event(stream);
-    RMM_LOG_TRACE("[D][stream {:p}][{}B][{:p}]", fmt::ptr(stream_event.stream), bytes, p);
 
     bytes        = rmm::detail::align_up(bytes, rmm::detail::CUDA_ALLOCATION_ALIGNMENT);
     auto const b = this->underlying().free_block(p, bytes);
@@ -301,19 +300,35 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
   }
 
   /**
-   * @brief Get an avaible memory block of at least `size` bytes
+   * @brief Splits a block into an allocated block of `size` bytes and a remainder block, and
+   * inserts the remainder into a free list.
+   *
+   * @param b The block to split into allocated and remainder portions.
+   * @param size The size of the block to allocate from `b`.
+   * @param blocks The `free_list` in which to insert the remainder block.
+   * @return The allocated block.
+   */
+  block_type allocate_and_insert_remainder(block_type b, std::size_t size, free_list& blocks)
+  {
+    auto const [allocated, remainder] = this->underlying().allocate_from_block(b, size);
+    if (remainder.is_valid()) blocks.insert(remainder);
+    return allocated;
+  }
+
+  /**
+   * @brief Get an available memory block of at least `size` bytes
    *
    * @param size The number of bytes to allocate
    * @param stream_event The stream and associated event on which the allocation will be used.
    * @return block_type A block of memory of at least `size` bytes
    */
-  block_type get_block(size_t size, stream_event_pair stream_event)
+  block_type get_block(std::size_t size, stream_event_pair stream_event)
   {
     // Try to find a satisfactory block in free list for the same stream (no sync required)
     auto iter = stream_free_blocks_.find(stream_event);
     if (iter != stream_free_blocks_.end()) {
-      block_type b = iter->second.get_block(size);
-      if (b.is_valid()) { return b; }
+      block_type const b = iter->second.get_block(size);
+      if (b.is_valid()) { return allocate_and_insert_remainder(b, size, iter->second); }
     }
 
     free_list& blocks =
@@ -334,7 +349,10 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
     log_summary_trace();
 
     // no large enough blocks available after merging, so grow the pool
-    return this->underlying().expand_pool(size, blocks, cuda_stream_view{stream_event.stream});
+    block_type const b =
+      this->underlying().expand_pool(size, blocks, cuda_stream_view{stream_event.stream});
+
+    return allocate_and_insert_remainder(b, size, blocks);
   }
 
   /**
@@ -352,7 +370,7 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
    * @return A block with non-null pointer and size >= `size`, or a nullptr block if none is
    *         available in `blocks`.
    */
-  block_type get_block_from_other_stream(size_t size,
+  block_type get_block_from_other_stream(std::size_t size,
                                          stream_event_pair stream_event,
                                          free_list& blocks,
                                          bool merge_first)
@@ -375,10 +393,18 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
 
             stream_free_blocks_.erase(it);
 
-            return blocks.get_block(size);  // get the best fit block in merged lists
+            block_type const b = blocks.get_block(size);  // get the best fit block in merged lists
+            if (b.is_valid()) { return allocate_and_insert_remainder(b, size, blocks); }
           } else {
-            return other_blocks.get_block(size);  // get the best fit block in other list
+            block_type const b = other_blocks.get_block(size);
+            if (b.is_valid()) {
+              // Since we found a block associated with a different stream, we have to insert a wait
+              // on the stream's associated event into the allocating stream.
+              RMM_CUDA_TRY(cudaStreamWaitEvent(stream_event.stream, other_event, 0));
+              return allocate_and_insert_remainder(b, size, other_blocks);
+            }
           }
+          return block_type{};
         }();
 
         if (b.is_valid()) {
@@ -387,12 +413,6 @@ class stream_ordered_memory_resource : public crtp<PoolResource>, public device_
                         fmt::ptr(stream_event.stream),
                         size,
                         fmt::ptr(it->first.stream));
-
-          if (not merge_first) {
-            merge_lists(stream_event, blocks, other_event, std::move(other_blocks));
-            stream_free_blocks_.erase(it);
-          }
-
           return b;
         }
       }

diff --git a/include/rmm/mr/device/fixed_size_memory_resource.hpp b/include/rmm/mr/device/fixed_size_memory_resource.hpp
@@ -129,10 +129,10 @@ class fixed_size_memory_resource
   /**
    * @brief Get the (fixed) size of allocations supported by this memory resource
    *
-   * @return size_t The (fixed) maximum size of a single allocation supported by this memory
+   * @return std::size_t The (fixed) maximum size of a single allocation supported by this memory
    * resource
    */
-  size_t get_maximum_allocation_size() const { return get_block_size(); }
+  std::size_t get_maximum_allocation_size() const { return get_block_size(); }
 
   /**
    * @brief Allocate a block from upstream to supply the suballocation pool.
@@ -144,7 +144,7 @@ class fixed_size_memory_resource
    * @param stream The stream on which the memory is to be used.
    * @return block_type The allocated block
    */
-  block_type expand_pool(size_t size, free_list& blocks, cuda_stream_view stream)
+  block_type expand_pool(std::size_t size, free_list& blocks, cuda_stream_view stream)
   {
     blocks.insert(std::move(blocks_from_upstream(stream)));
     return blocks.get_block(size);
@@ -181,9 +181,9 @@ class fixed_size_memory_resource
    * @return A pair comprising the allocated pointer and any unallocated remainder of the input
    * block.
    */
-  split_block allocate_from_block(block_type const& b, size_t size)
+  split_block allocate_from_block(block_type const& b, std::size_t size)
   {
-    return split_block{b.pointer(), block_type{nullptr}};
+    return {b, block_type{nullptr}};
   }
 
   /**
@@ -195,7 +195,7 @@ class fixed_size_memory_resource
    * @return The (now freed) block associated with `p`. The caller is expected to return the block
    * to the pool.
    */
-  block_type free_block(void* p, size_t size) noexcept
+  block_type free_block(void* p, std::size_t size) noexcept
   {
     // Deallocating a fixed-size block just inserts it in the free list, which is
     // handled by the parent class

diff --git a/include/rmm/mr/device/pool_memory_resource.hpp b/include/rmm/mr/device/pool_memory_resource.hpp
@@ -31,7 +31,7 @@
 #include <cuda_runtime_api.h>
 
 #include <algorithm>
-#include <cstdint>
+#include <cstddef>
 #include <iostream>
 #include <map>
 #include <mutex>
@@ -141,11 +141,14 @@ class pool_memory_resource final
    * @brief Get the maximum size of allocations supported by this memory resource
    *
    * Note this does not depend on the memory size of the device. It simply returns the maximum
-   * value of `size_t`
+   * value of `std::size_t`
    *
-   * @return size_t The maximum size of a single allocation supported by this memory resource
+   * @return std::size_t The maximum size of a single allocation supported by this memory resource
    */
-  size_t get_maximum_allocation_size() const { return std::numeric_limits<size_t>::max(); }
+  std::size_t get_maximum_allocation_size() const
+  {
+    return std::numeric_limits<std::size_t>::max();
+  }
 
   /**
    * @brief Try to expand the pool by allocating a block of at least `min_size` bytes from
@@ -247,7 +250,7 @@ class pool_memory_resource final
    * Returns 0 if the requested size cannot be satisfied.
    *
    * @param size The size of the minimum allocation immediately needed
-   * @return size_t The computed size to grow the pool.
+   * @return std::size_t The computed size to grow the pool.
    */
   std::size_t size_to_grow(std::size_t size) const
   {
@@ -268,7 +271,7 @@ class pool_memory_resource final
    * @param stream The stream on which the memory is to be used.
    * @return block_type The allocated block
    */
-  thrust::optional<block_type> block_from_upstream(size_t size, cuda_stream_view stream)
+  thrust::optional<block_type> block_from_upstream(std::size_t size, cuda_stream_view stream)
   {
     RMM_LOG_DEBUG("[A][Stream {}][Upstream {}B]", fmt::ptr(stream.value()), size);
 
@@ -294,7 +297,7 @@ class pool_memory_resource final
    * @return A pair comprising the allocated pointer and any unallocated remainder of the input
    * block.
    */
-  split_block allocate_from_block(block_type const& b, size_t size)
+  split_block allocate_from_block(block_type const& b, std::size_t size)
   {
     block_type const alloc{b.pointer(), size, b.is_head()};
 #ifdef RMM_POOL_TRACK_ALLOCATIONS
@@ -303,7 +306,7 @@ class pool_memory_resource final
 
     auto rest =
       (b.size() > size) ? block_type{b.pointer() + size, b.size() - size, false} : block_type{};
-    return {reinterpret_cast<void*>(alloc.pointer()), rest};
+    return {alloc, rest};
   }
 
   /**
@@ -315,7 +318,7 @@ class pool_memory_resource final
    * @return The (now freed) block associated with `p`. The caller is expected to return the block
    * to the pool.
    */
-  block_type free_block(void* p, size_t size) noexcept
+  block_type free_block(void* p, std::size_t size) noexcept
   {
 #ifdef RMM_POOL_TRACK_ALLOCATIONS
     if (p == nullptr) return block_type{};
@@ -338,9 +341,9 @@ class pool_memory_resource final
    *
    * Includes allocated as well as free memory.
    *
-   * @return size_t The total size of the currently allocated pool.
+   * @return std::size_t The total size of the currently allocated pool.
    */
-  size_t pool_size() const noexcept { return current_pool_size_; }
+  std::size_t pool_size() const noexcept { return current_pool_size_; }
 
   /**
    * @brief Free all memory allocated from the upstream memory_resource.
@@ -419,7 +422,7 @@ class pool_memory_resource final
    * @param stream to execute on
    * @return std::pair contaiing free_size and total_size of memory
    */
-  std::pair<size_t, size_t> do_get_mem_info(cuda_stream_view stream) const override
+  std::pair<std::size_t, std::size_t> do_get_mem_info(cuda_stream_view stream) const override
   {
     std::size_t free_size{};
     std::size_t total_size{};