This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
cub::ThreadLoadAsync
and friends, abstractions for asynchronous data movement
#209
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an exposure for Ampere's asynchronous copy mechanism, based on CUTLASS's implementation.
These primitives are useful for people writing their own kernels, BUT we can also potentially use them transparently in existing CUB block mechanisms, like
BlockLoad
andBlockStore
.Essentially, any time we have a repeated series of copies, we could use this. For example, this code:
The above code does a series of copies, which are not contiguous in memory. You couldn't replace this whole loop with a memcpy; the destination is contiguous, but the src is not.
We don't have any compute work to overload with the copies here, but it is still beneficial to replace them with asynchronous copies (@ogiroux and @griwes can explain why).
So that code could become:
TODO:
BlockLoad
,BlockStore
,BlockExchange
, and specific algorithms - basically anywhere there's a series of copies.