-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] pool_memory_resource excessive synchronization #850
Labels
Comments
rapids-bot bot
pushed a commit
that referenced
this issue
Aug 27, 2021
#851) Fixes #850. The `stream_ordered_memory_resource` was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc. The result is that there is zero concurrency between the streams. This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly. It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails. Performance of `RANDOM_ALLOCATIONS_BENCH` is minimally affected, and performance of `MULTI_STREAM_ALLOCATION_BENCH` is improved, demonstrating full concurrency. Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance. CC @cwharris ``` -------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------- BM_MultiStreamAllocations/pool_mr/1/4/0 2780 us 2780 us 251 items_per_second=1.4391k/s BM_MultiStreamAllocations/pool_mr/2/4/0 1393 us 1392 us 489 items_per_second=2.8729k/s BM_MultiStreamAllocations/pool_mr/4/4/0 706 us 706 us 926 items_per_second=5.66735k/s BM_MultiStreamAllocations/pool_mr/1/4/1 2775 us 2774 us 252 items_per_second=1.44176k/s BM_MultiStreamAllocations/pool_mr/2/4/1 1393 us 1393 us 487 items_per_second=2.87209k/s BM_MultiStreamAllocations/pool_mr/4/4/1 704 us 704 us 913 items_per_second=5.67891k/s ``` MULTI_STREAM_ALLOCATIONS performance change: ``` (rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_multistream_allocations_21.10.json pool_multistream_allocations_returnsplit.json Comparing pool_multistream_allocations_21.10.json to pool_multistream_allocations_returnsplit.json Benchmark Time CPU Time Old Time New CPU Old CPU New -------------------------------------------------------------------------------------------------------------------------------------- BM_MultiStreamAllocations/pool_mr/1/4/0 -0.0014 -0.0044 2789 2785 2788 2776 BM_MultiStreamAllocations/pool_mr/2/4/0 -0.4989 -0.4982 2779 1393 2775 1392 BM_MultiStreamAllocations/pool_mr/4/4/0 -0.7450 -0.7450 2778 708 2778 708 BM_MultiStreamAllocations/pool_mr/1/4/1 +0.0001 +0.0001 2775 2775 2774 2775 BM_MultiStreamAllocations/pool_mr/2/4/1 +0.0002 +0.0001 1393 1393 1392 1393 BM_MultiStreamAllocations/pool_mr/4/4/1 -0.0531 -0.0531 744 704 744 704 ``` RANDOM_ALLOCATIONS performance change: ``` (rapids) rapids@compose:~/rmm/build/cuda-11.2.0/branch-21.10/release$ _deps/benchmark-src/tools/compare.py benchmarks pool_random_allocations_21.10.json pool_random_allocations_returnsplit.json Comparing pool_random_allocations_21.10.json to pool_random_allocations_returnsplit.json Benchmark Time CPU Time Old Time New CPU Old CPU New --------------------------------------------------------------------------------------------------------------------------------------- BM_RandomAllocations/pool_mr/1000/1 +0.0199 +0.0198 1 1 1 1 BM_RandomAllocations/pool_mr/1000/4 -0.0063 -0.0061 1 1 1 1 BM_RandomAllocations/pool_mr/1000/64 -0.0144 -0.0145 1 1 1 1 BM_RandomAllocations/pool_mr/1000/256 +0.0243 +0.0254 1 1 1 1 BM_RandomAllocations/pool_mr/1000/1024 -0.0313 -0.0311 1 0 1 0 BM_RandomAllocations/pool_mr/1000/4096 -0.0063 -0.0059 0 0 0 0 BM_RandomAllocations/pool_mr/10000/1 +0.0105 +0.0105 46 47 46 47 BM_RandomAllocations/pool_mr/10000/4 -0.0023 -0.0023 50 50 50 50 BM_RandomAllocations/pool_mr/10000/64 +0.0065 +0.0065 11 11 11 11 BM_RandomAllocations/pool_mr/10000/256 +0.0099 +0.0099 6 6 6 6 BM_RandomAllocations/pool_mr/10000/1024 -0.0074 -0.0075 5 5 5 5 BM_RandomAllocations/pool_mr/10000/4096 -0.0165 -0.0163 5 5 5 5 BM_RandomAllocations/pool_mr/100000/1 +0.0154 +0.0154 6939 7046 6937 7044 BM_RandomAllocations/pool_mr/100000/4 +0.0839 +0.0838 2413 2615 2413 2615 BM_RandomAllocations/pool_mr/100000/64 +0.0050 +0.0050 116 117 116 117 BM_RandomAllocations/pool_mr/100000/256 -0.0040 -0.0039 64 64 64 64 BM_RandomAllocations/pool_mr/100000/1024 -0.0174 -0.0174 51 50 51 50 BM_RandomAllocations/pool_mr/100000/4096 -0.0447 -0.0448 48 46 48 46 ``` Screenshot of kernel concurrency (or lack of) in the profiler before and after this change: Before: ![Screenshot from 2021-08-24 15-36-17](https://user-images.githubusercontent.com/783069/130563715-7a52bb21-2ee7-4541-967a-a5dc25ab56ff.png) After: ![Screenshot from 2021-08-24 15-41-33](https://user-images.githubusercontent.com/783069/130563731-aae43a6d-9881-46ed-ada1-98e5277a0dd6.png) Authors: - Mark Harris (https://github.com/harrism) Approvers: - Rong Ou (https://github.com/rongou) - Christopher Harris (https://github.com/cwharris) - Mike Wilson (https://github.com/hyperbolic2346) URL: #851
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
#841 demonstrated that without pre-allocating regions for each stream, applications which cycle between streams can oversynchronize due to the pool's stealing implementation. This results in loss of concurrency in workloads that should otherwise achieve it.
Steps/Code to reproduce bug
Run
MULTISTREAM_ALLOCATIONS_BENCHMARK
. See #841 for performance details.Expected behavior
Concurrency -- in other words, performance in the benchmark should be similar regardless of pre-warming.
The text was updated successfully, but these errors were encountered: