Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concurrency of stream_ordered_memory_resource by stealing less #851

Merged
merged 4 commits into from
Aug 27, 2021

Conversation

harrism
Copy link
Member

@harrism harrism commented Aug 24, 2021

Fixes #850. The stream_ordered_memory_resource was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc. The result is that there is zero concurrency between the streams.

This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly.

It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails.

Performance of RANDOM_ALLOCATIONS_BENCH is minimally affected, and performance of MULTI_STREAM_ALLOCATION_BENCH is improved, demonstrating full concurrency.

Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance.

CC @cwharris

--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0       2780 us         2780 us          251 items_per_second=1.4391k/s
BM_MultiStreamAllocations/pool_mr/2/4/0       1393 us         1392 us          489 items_per_second=2.8729k/s
BM_MultiStreamAllocations/pool_mr/4/4/0        706 us          706 us          926 items_per_second=5.66735k/s
BM_MultiStreamAllocations/pool_mr/1/4/1       2775 us         2774 us          252 items_per_second=1.44176k/s
BM_MultiStreamAllocations/pool_mr/2/4/1       1393 us         1393 us          487 items_per_second=2.87209k/s
BM_MultiStreamAllocations/pool_mr/4/4/1        704 us          704 us          913 items_per_second=5.67891k/s

MULTI_STREAM_ALLOCATIONS performance change:

(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_multistream_allocations_21.10.json pool_multistream_allocations_returnsplit.json 
Comparing pool_multistream_allocations_21.10.json to pool_multistream_allocations_returnsplit.json
Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0                -0.0014         -0.0044          2789          2785          2788          2776
BM_MultiStreamAllocations/pool_mr/2/4/0                -0.4989         -0.4982          2779          1393          2775          1392
BM_MultiStreamAllocations/pool_mr/4/4/0                -0.7450         -0.7450          2778           708          2778           708
BM_MultiStreamAllocations/pool_mr/1/4/1                +0.0001         +0.0001          2775          2775          2774          2775
BM_MultiStreamAllocations/pool_mr/2/4/1                +0.0002         +0.0001          1393          1393          1392          1393
BM_MultiStreamAllocations/pool_mr/4/4/1                -0.0531         -0.0531           744           704           744           704

RANDOM_ALLOCATIONS performance change:

(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_random_allocations_21.10.json pool_random_allocations_returnsplit.json  
Comparing pool_random_allocations_21.10.json to pool_random_allocations_returnsplit.json
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations/pool_mr/1000/1                     +0.0199         +0.0198             1             1             1             1
BM_RandomAllocations/pool_mr/1000/4                     -0.0063         -0.0061             1             1             1             1
BM_RandomAllocations/pool_mr/1000/64                    -0.0144         -0.0145             1             1             1             1
BM_RandomAllocations/pool_mr/1000/256                   +0.0243         +0.0254             1             1             1             1
BM_RandomAllocations/pool_mr/1000/1024                  -0.0313         -0.0311             1             0             1             0
BM_RandomAllocations/pool_mr/1000/4096                  -0.0063         -0.0059             0             0             0             0
BM_RandomAllocations/pool_mr/10000/1                    +0.0105         +0.0105            46            47            46            47
BM_RandomAllocations/pool_mr/10000/4                    -0.0023         -0.0023            50            50            50            50
BM_RandomAllocations/pool_mr/10000/64                   +0.0065         +0.0065            11            11            11            11
BM_RandomAllocations/pool_mr/10000/256                  +0.0099         +0.0099             6             6             6             6
BM_RandomAllocations/pool_mr/10000/1024                 -0.0074         -0.0075             5             5             5             5
BM_RandomAllocations/pool_mr/10000/4096                 -0.0165         -0.0163             5             5             5             5
BM_RandomAllocations/pool_mr/100000/1                   +0.0154         +0.0154          6939          7046          6937          7044
BM_RandomAllocations/pool_mr/100000/4                   +0.0839         +0.0838          2413          2615          2413          2615
BM_RandomAllocations/pool_mr/100000/64                  +0.0050         +0.0050           116           117           116           117
BM_RandomAllocations/pool_mr/100000/256                 -0.0040         -0.0039            64            64            64            64
BM_RandomAllocations/pool_mr/100000/1024                -0.0174         -0.0174            51            50            51            50
BM_RandomAllocations/pool_mr/100000/4096                -0.0447         -0.0448            48            46            48            46

Screenshot of kernel concurrency (or lack of) in the profiler before and after this change:

Before:
Screenshot from 2021-08-24 15-36-17

After:
Screenshot from 2021-08-24 15-41-33

@harrism harrism requested a review from a team as a code owner August 24, 2021 05:44
@harrism harrism requested review from rongou and codereport August 24, 2021 05:44
@github-actions github-actions bot added the cpp Pertains to C++ code label Aug 24, 2021
@harrism harrism added bug Something isn't working non-breaking Non-breaking change performance labels Aug 24, 2021
@harrism harrism self-assigned this Aug 24, 2021
@harrism harrism requested a review from jrhemstad August 24, 2021 06:00
@harrism harrism changed the title Improve concurrency of pool allocator Improve concurrency of stream_ordered_memory_resource by stealing less Aug 24, 2021
Copy link
Contributor

@cwharris cwharris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! :D

@harrism
Copy link
Member Author

harrism commented Aug 27, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit bef4377 into rapidsai:branch-21.10 Aug 27, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Aug 27, 2021
Depends on rapidsai/rmm#851, for performance reasons.

There are two parts to this change. First, we remove a workaround for RMM's sync-and-steal behavior which was preventing some work from overlapping. This behavior is significantly improveed in rmm#851. The workaround involved allocating long-lived buffers and reusing them. With this change, we create device_uvectors on-the-fly and return them, which brings us to the second part of the change...

Because the data chunk reader owned the long-lived buffers, it was possible to return `device_span`s from the `get_next_chunk` method. Now that the `device_uvector`s are created on the fly and returned, we need an interface that supports ownership of the data on an implementation basis. Different readers can return different implementations of `device_data_chunk` via a `unique_ptr`. Those implementations can be owners of data, or just views.

This PR should merge only after rmm#851, else it will cause performance degradation in `multibyte_split` (which is the only API to use this reader so far).

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)

URL: #9129
rapids-bot bot pushed a commit that referenced this pull request Sep 1, 2021
Fixes #861.

An implicit copy of `free_list` was being used instead of a reference, which led to duplicate allocations. This never manifested until after #851 because previously the locally modified copy of a free list was being merged into an MR-owned free list.  When we removed one of the places where we merged free lists, this copy resulted in the changes to free lists being lost. This only manifested in PTDS usage, but likely would also manifest in use cases with multiple non-default streams.

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Conor Hoekstra (https://github.com/codereport)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Jason Lowe (https://github.com/jlowe)
  - Rong Ou (https://github.com/rongou)

URL: #862
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] pool_memory_resource excessive synchronization
4 participants