Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] pool_memory_resource excessive synchronization #850

Closed
harrism opened this issue Aug 24, 2021 · 0 comments · Fixed by #851
Closed

[BUG] pool_memory_resource excessive synchronization #850

harrism opened this issue Aug 24, 2021 · 0 comments · Fixed by #851
Assignees
Labels
bug Something isn't working performance

Comments

@harrism
Copy link
Member

harrism commented Aug 24, 2021

Describe the bug
#841 demonstrated that without pre-allocating regions for each stream, applications which cycle between streams can oversynchronize due to the pool's stealing implementation. This results in loss of concurrency in workloads that should otherwise achieve it.

Steps/Code to reproduce bug
Run MULTISTREAM_ALLOCATIONS_BENCHMARK. See #841 for performance details.

Expected behavior
Concurrency -- in other words, performance in the benchmark should be similar regardless of pre-warming.

@harrism harrism added bug Something isn't working performance labels Aug 24, 2021
@harrism harrism self-assigned this Aug 24, 2021
@rapids-bot rapids-bot bot closed this as completed in #851 Aug 27, 2021
rapids-bot bot pushed a commit that referenced this issue Aug 27, 2021
#851)

Fixes #850. The `stream_ordered_memory_resource` was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc.  The result is that there is zero concurrency between the streams.

This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly.

It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails.

Performance of `RANDOM_ALLOCATIONS_BENCH` is minimally affected, and performance of `MULTI_STREAM_ALLOCATION_BENCH` is improved, demonstrating full concurrency.

Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance. 

CC @cwharris 

```
--------------------------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0       2780 us         2780 us          251 items_per_second=1.4391k/s
BM_MultiStreamAllocations/pool_mr/2/4/0       1393 us         1392 us          489 items_per_second=2.8729k/s
BM_MultiStreamAllocations/pool_mr/4/4/0        706 us          706 us          926 items_per_second=5.66735k/s
BM_MultiStreamAllocations/pool_mr/1/4/1       2775 us         2774 us          252 items_per_second=1.44176k/s
BM_MultiStreamAllocations/pool_mr/2/4/1       1393 us         1393 us          487 items_per_second=2.87209k/s
BM_MultiStreamAllocations/pool_mr/4/4/1        704 us          704 us          913 items_per_second=5.67891k/s
```

MULTI_STREAM_ALLOCATIONS performance change:

```
(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_multistream_allocations_21.10.json pool_multistream_allocations_returnsplit.json 
Comparing pool_multistream_allocations_21.10.json to pool_multistream_allocations_returnsplit.json
Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
BM_MultiStreamAllocations/pool_mr/1/4/0                -0.0014         -0.0044          2789          2785          2788          2776
BM_MultiStreamAllocations/pool_mr/2/4/0                -0.4989         -0.4982          2779          1393          2775          1392
BM_MultiStreamAllocations/pool_mr/4/4/0                -0.7450         -0.7450          2778           708          2778           708
BM_MultiStreamAllocations/pool_mr/1/4/1                +0.0001         +0.0001          2775          2775          2774          2775
BM_MultiStreamAllocations/pool_mr/2/4/1                +0.0002         +0.0001          1393          1393          1392          1393
BM_MultiStreamAllocations/pool_mr/4/4/1                -0.0531         -0.0531           744           704           744           704
```

RANDOM_ALLOCATIONS performance change:

```
(rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_random_allocations_21.10.json pool_random_allocations_returnsplit.json  
Comparing pool_random_allocations_21.10.json to pool_random_allocations_returnsplit.json
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations/pool_mr/1000/1                     +0.0199         +0.0198             1             1             1             1
BM_RandomAllocations/pool_mr/1000/4                     -0.0063         -0.0061             1             1             1             1
BM_RandomAllocations/pool_mr/1000/64                    -0.0144         -0.0145             1             1             1             1
BM_RandomAllocations/pool_mr/1000/256                   +0.0243         +0.0254             1             1             1             1
BM_RandomAllocations/pool_mr/1000/1024                  -0.0313         -0.0311             1             0             1             0
BM_RandomAllocations/pool_mr/1000/4096                  -0.0063         -0.0059             0             0             0             0
BM_RandomAllocations/pool_mr/10000/1                    +0.0105         +0.0105            46            47            46            47
BM_RandomAllocations/pool_mr/10000/4                    -0.0023         -0.0023            50            50            50            50
BM_RandomAllocations/pool_mr/10000/64                   +0.0065         +0.0065            11            11            11            11
BM_RandomAllocations/pool_mr/10000/256                  +0.0099         +0.0099             6             6             6             6
BM_RandomAllocations/pool_mr/10000/1024                 -0.0074         -0.0075             5             5             5             5
BM_RandomAllocations/pool_mr/10000/4096                 -0.0165         -0.0163             5             5             5             5
BM_RandomAllocations/pool_mr/100000/1                   +0.0154         +0.0154          6939          7046          6937          7044
BM_RandomAllocations/pool_mr/100000/4                   +0.0839         +0.0838          2413          2615          2413          2615
BM_RandomAllocations/pool_mr/100000/64                  +0.0050         +0.0050           116           117           116           117
BM_RandomAllocations/pool_mr/100000/256                 -0.0040         -0.0039            64            64            64            64
BM_RandomAllocations/pool_mr/100000/1024                -0.0174         -0.0174            51            50            51            50
BM_RandomAllocations/pool_mr/100000/4096                -0.0447         -0.0448            48            46            48            46
```

Screenshot of kernel concurrency (or lack of) in the profiler before and after this change:

Before:
![Screenshot from 2021-08-24 15-36-17](https://user-images.githubusercontent.com/783069/130563715-7a52bb21-2ee7-4541-967a-a5dc25ab56ff.png)

After:
![Screenshot from 2021-08-24 15-41-33](https://user-images.githubusercontent.com/783069/130563731-aae43a6d-9881-46ed-ada1-98e5277a0dd6.png)

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Christopher Harris (https://github.com/cwharris)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #851
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant