-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve concurrency of stream_ordered_memory_resource by stealing less #851
Merged
rapids-bot
merged 4 commits into
rapidsai:branch-21.10
from
harrism:fea-pool-return-remainder
Aug 27, 2021
Merged
Improve concurrency of stream_ordered_memory_resource by stealing less #851
rapids-bot
merged 4 commits into
rapidsai:branch-21.10
from
harrism:fea-pool-return-remainder
Aug 27, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
harrism
added
bug
Something isn't working
non-breaking
Non-breaking change
performance
labels
Aug 24, 2021
harrism
changed the title
Improve concurrency of pool allocator
Improve concurrency of stream_ordered_memory_resource by stealing less
Aug 24, 2021
rongou
approved these changes
Aug 24, 2021
cwharris
approved these changes
Aug 26, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! :D
hyperbolic2346
approved these changes
Aug 26, 2021
@gpucibot merge |
rapids-bot bot
pushed a commit
to rapidsai/cudf
that referenced
this pull request
Aug 27, 2021
Depends on rapidsai/rmm#851, for performance reasons. There are two parts to this change. First, we remove a workaround for RMM's sync-and-steal behavior which was preventing some work from overlapping. This behavior is significantly improveed in rmm#851. The workaround involved allocating long-lived buffers and reusing them. With this change, we create device_uvectors on-the-fly and return them, which brings us to the second part of the change... Because the data chunk reader owned the long-lived buffers, it was possible to return `device_span`s from the `get_next_chunk` method. Now that the `device_uvector`s are created on the fly and returned, we need an interface that supports ownership of the data on an implementation basis. Different readers can return different implementations of `device_data_chunk` via a `unique_ptr`. Those implementations can be owners of data, or just views. This PR should merge only after rmm#851, else it will cause performance degradation in `multibyte_split` (which is the only API to use this reader so far). Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) URL: #9129
rapids-bot bot
pushed a commit
that referenced
this pull request
Sep 1, 2021
Fixes #861. An implicit copy of `free_list` was being used instead of a reference, which led to duplicate allocations. This never manifested until after #851 because previously the locally modified copy of a free list was being merged into an MR-owned free list. When we removed one of the places where we merged free lists, this copy resulted in the changes to free lists being lost. This only manifested in PTDS usage, but likely would also manifest in use cases with multiple non-default streams. Authors: - Mark Harris (https://github.com/harrism) Approvers: - Conor Hoekstra (https://github.com/codereport) - Jake Hemstad (https://github.com/jrhemstad) - Jason Lowe (https://github.com/jlowe) - Rong Ou (https://github.com/rongou) URL: #862
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #850. The
stream_ordered_memory_resource
was too aggressive in stealing blocks. When stream A did not have a block sufficient for an allocation, if it found one in the free list of another stream B, it would wait on stream B's recorded event and then merge stream B's entire free list into its own. This resulted in excessive synchronization in workloads that cycle among threads and repeatedly allocate, as in the new MULTI_STREAM_ALLOCATION benchmark. That benchmark demonstrates that a stream would allocate, run a kernel, and free, then the next stream would allocate, not have a block, so steal all the memory from the first stream, then the next stream would steal from the second stream, etc. The result is that there is zero concurrency between the streams.This PR avoids merging free lists when stealing, and it also returns the remainder of a block unused by an allocation to the original stream that it was taken from. This way when the pool is a single unfragmented block, streams don't steal the entire remainder of the pool from each other repeatedly.
It's possible these changes could increase fragmentation, but I did not change the fallback to merging free lists when a large enough block cannot be found in another stream. By merging the streams repeatedly, there are opportunities to coalesce so that larger blocks become available. The memory should be in its most coalesced state before allocation fails.
Performance of
RANDOM_ALLOCATIONS_BENCH
is minimally affected, and performance ofMULTI_STREAM_ALLOCATION_BENCH
is improved, demonstrating full concurrency.Benchmark results show that performance increases with higher numbers of streams, and pre-warming (last four rows) does not affect performance.
CC @cwharris
MULTI_STREAM_ALLOCATIONS performance change:
RANDOM_ALLOCATIONS performance change:
Screenshot of kernel concurrency (or lack of) in the profiler before and after this change:
Before:
After: