Improve concurrency of stream_ordered_memory_resource by stealing less #851

harrism · 2021-08-24T05:44:34Z

Fixes #850. The stream_ordered_memory_resource was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc. The result is that there is zero concurrency between the streams.

This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly.

It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails.

Performance of RANDOM_ALLOCATIONS_BENCH is minimally affected, and performance of MULTI_STREAM_ALLOCATION_BENCH is improved, demonstrating full concurrency.

Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance.

CC @cwharris

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0       2780 us         2780 us          251 items_per_second=1.4391k/s
BM_MultiStreamAllocations/pool_mr/2/4/0       1393 us         1392 us          489 items_per_second=2.8729k/s
BM_MultiStreamAllocations/pool_mr/4/4/0        706 us          706 us          926 items_per_second=5.66735k/s
BM_MultiStreamAllocations/pool_mr/1/4/1       2775 us         2774 us          252 items_per_second=1.44176k/s
BM_MultiStreamAllocations/pool_mr/2/4/1       1393 us         1393 us          487 items_per_second=2.87209k/s
BM_MultiStreamAllocations/pool_mr/4/4/1        704 us          704 us          913 items_per_second=5.67891k/s

MULTI_STREAM_ALLOCATIONS performance change:

(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_multistream_allocations_21.10.json pool_multistream_allocations_returnsplit.json 
Comparing pool_multistream_allocations_21.10.json to pool_multistream_allocations_returnsplit.json
Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0                -0.0014         -0.0044          2789          2785          2788          2776
BM_MultiStreamAllocations/pool_mr/2/4/0                -0.4989         -0.4982          2779          1393          2775          1392
BM_MultiStreamAllocations/pool_mr/4/4/0                -0.7450         -0.7450          2778           708          2778           708
BM_MultiStreamAllocations/pool_mr/1/4/1                +0.0001         +0.0001          2775          2775          2774          2775
BM_MultiStreamAllocations/pool_mr/2/4/1                +0.0002         +0.0001          1393          1393          1392          1393
BM_MultiStreamAllocations/pool_mr/4/4/1                -0.0531         -0.0531           744           704           744           704

RANDOM_ALLOCATIONS performance change:

(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_random_allocations_21.10.json pool_random_allocations_returnsplit.json  
Comparing pool_random_allocations_21.10.json to pool_random_allocations_returnsplit.json
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations/pool_mr/1000/1                     +0.0199         +0.0198             1             1             1             1
BM_RandomAllocations/pool_mr/1000/4                     -0.0063         -0.0061             1             1             1             1
BM_RandomAllocations/pool_mr/1000/64                    -0.0144         -0.0145             1             1             1             1
BM_RandomAllocations/pool_mr/1000/256                   +0.0243         +0.0254             1             1             1             1
BM_RandomAllocations/pool_mr/1000/1024                  -0.0313         -0.0311             1             0             1             0
BM_RandomAllocations/pool_mr/1000/4096                  -0.0063         -0.0059             0             0             0             0
BM_RandomAllocations/pool_mr/10000/1                    +0.0105         +0.0105            46            47            46            47
BM_RandomAllocations/pool_mr/10000/4                    -0.0023         -0.0023            50            50            50            50
BM_RandomAllocations/pool_mr/10000/64                   +0.0065         +0.0065            11            11            11            11
BM_RandomAllocations/pool_mr/10000/256                  +0.0099         +0.0099             6             6             6             6
BM_RandomAllocations/pool_mr/10000/1024                 -0.0074         -0.0075             5             5             5             5
BM_RandomAllocations/pool_mr/10000/4096                 -0.0165         -0.0163             5             5             5             5
BM_RandomAllocations/pool_mr/100000/1                   +0.0154         +0.0154          6939          7046          6937          7044
BM_RandomAllocations/pool_mr/100000/4                   +0.0839         +0.0838          2413          2615          2413          2615
BM_RandomAllocations/pool_mr/100000/64                  +0.0050         +0.0050           116           117           116           117
BM_RandomAllocations/pool_mr/100000/256                 -0.0040         -0.0039            64            64            64            64
BM_RandomAllocations/pool_mr/100000/1024                -0.0174         -0.0174            51            50            51            50
BM_RandomAllocations/pool_mr/100000/4096                -0.0447         -0.0448            48            46            48            46

Screenshot of kernel concurrency (or lack of) in the profiler before and after this change:

Before:

After:

…tely.

cwharris

Yay! :D

harrism · 2021-08-27T11:52:28Z

@gpucibot merge

Depends on rapidsai/rmm#851, for performance reasons. There are two parts to this change. First, we remove a workaround for RMM's sync-and-steal behavior which was preventing some work from overlapping. This behavior is significantly improveed in rmm#851. The workaround involved allocating long-lived buffers and reusing them. With this change, we create device_uvectors on-the-fly and return them, which brings us to the second part of the change... Because the data chunk reader owned the long-lived buffers, it was possible to return `device_span`s from the `get_next_chunk` method. Now that the `device_uvector`s are created on the fly and returned, we need an interface that supports ownership of the data on an implementation basis. Different readers can return different implementations of `device_data_chunk` via a `unique_ptr`. Those implementations can be owners of data, or just views. This PR should merge only after rmm#851, else it will cause performance degradation in `multibyte_split` (which is the only API to use this reader so far). Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) URL: #9129

Fixes #861. An implicit copy of `free_list` was being used instead of a reference, which led to duplicate allocations. This never manifested until after #851 because previously the locally modified copy of a free list was being merged into an MR-owned free list. When we removed one of the places where we merged free lists, this copy resulted in the changes to free lists being lost. This only manifested in PTDS usage, but likely would also manifest in use cases with multiple non-default streams. Authors: - Mark Harris (https://github.com/harrism) Approvers: - Conor Hoekstra (https://github.com/codereport) - Jake Hemstad (https://github.com/jrhemstad) - Jason Lowe (https://github.com/jlowe) - Rong Ou (https://github.com/rongou) URL: #862

Return unused portion of a block to owning stream's free list immedia…

ed04df5

…tely.

harrism requested a review from a team as a code owner August 24, 2021 05:44

harrism requested review from rongou and codereport August 24, 2021 05:44

github-actions bot added the cpp Pertains to C++ code label Aug 24, 2021

harrism added bug Something isn't working non-breaking Non-breaking change performance labels Aug 24, 2021

harrism self-assigned this Aug 24, 2021

harrism requested a review from jrhemstad August 24, 2021 06:00

harrism changed the title ~~Improve concurrency of pool allocator~~ Improve concurrency of stream_ordered_memory_resource by stealing less Aug 24, 2021

rongou approved these changes Aug 24, 2021

View reviewed changes

harrism added 3 commits August 25, 2021 11:16

Docs and std::size_t and headers

12b5599

const

e2c2c01

std::size_t

5ceaf1e

cwharris approved these changes Aug 26, 2021

View reviewed changes

hyperbolic2346 approved these changes Aug 26, 2021

View reviewed changes

cwharris mentioned this pull request Aug 26, 2021

make data chunk reader return unique_ptr rapidsai/cudf#9129

Merged

rapids-bot bot merged commit bef4377 into rapidsai:branch-21.10 Aug 27, 2021

pxLi mentioned this pull request Sep 1, 2021

[BUG] cudf unit tests fail with PTDS enabled #861

Closed

harrism mentioned this pull request Sep 1, 2021

Disable copy/move ctors and operator= from free_list classes #862

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve concurrency of stream_ordered_memory_resource by stealing less #851

Improve concurrency of stream_ordered_memory_resource by stealing less #851

harrism commented Aug 24, 2021 •

edited

Loading

cwharris left a comment •

edited

Loading

harrism commented Aug 27, 2021

Improve concurrency of stream_ordered_memory_resource by stealing less #851

Improve concurrency of stream_ordered_memory_resource by stealing less #851

Conversation

harrism commented Aug 24, 2021 • edited Loading

cwharris left a comment • edited Loading

Choose a reason for hiding this comment

harrism commented Aug 27, 2021

harrism commented Aug 24, 2021 •

edited

Loading

cwharris left a comment •

edited

Loading