`cub::ThreadLoadAsync` and friends, abstractions for asynchronous data movement #209

brycelelbach · 2020-10-05T19:09:54Z

This is an exposure for Ampere's asynchronous copy mechanism, based on CUTLASS's implementation.

These primitives are useful for people writing their own kernels, BUT we can also potentially use them transparently in existing CUB block mechanisms, like BlockLoad and BlockStore.

Essentially, any time we have a repeated series of copies, we could use this. For example, this code:

    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            items[ITEM] = block_itr[(linear_tid * ITEMS_PER_THREAD) + ITEM];
        }
    }

The above code does a series of copies, which are not contiguous in memory. You couldn't replace this whole loop with a memcpy; the destination is contiguous, but the src is not.

We don't have any compute work to overload with the copies here, but it is still beneficial to replace them with asynchronous copies (@ogiroux and @griwes can explain why).

So that code could become:

    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            cub::ThreadLoadAsync<cub::LOAD_DEFAULT>(items + ITEM, block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM);
        }
    }
   cub::ThreadLoadWait();

TODO:

Add tests
Deploy this in BlockLoad, BlockStore, BlockExchange, and specific algorithms - basically anywhere there's a series of copies.
Add peeling, widening, and remaindering, possibly unifying this with the vectorization machinery.

…a movement.

brycelelbach · 2020-10-05T19:19:24Z

cub/thread/thread_load.cuh

+
+
+/**
+  * \brief Establishes an ordering w.r.t previously issued ThreadLoadAsync operations.


This should probably have a comment saying that this means that prior operations have been read from the source, although they have not necessarily been stored to the destination.

mnicely · 2021-03-24T19:51:24Z

@brycelelbach
If it is possible to pass a pointer within a SMEM array, will cub::ThreadLoadAsync<cub::LOAD_DEFAULT> be smart enough to convert the instructions to LDGSTS?

brycelelbach · 2021-03-26T01:31:20Z

That's the idea.

…

On Wed, Mar 24, 2021 at 12:51 PM Matthew Nicely ***@***.***> wrote: @brycelelbach <https://github.com/brycelelbach> If it is possible to pass a pointer within a SMEM array, will cub::ThreadLoadAsync<cub::LOAD_DEFAULT> be smart enough to convert the instructions to LDGSTS? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADBG4RT6FYDS7NSYUP4M6LTFI7E5ANCNFSM4SFCJDPA> .

-- Bryce Adelstein Lelbach aka wash (he/him/his) US Programming Language Standards (PL22) Chair ISO C++ Library Evolution Chair CppCon and C++Now Program Chair HPC Programming Models Architect @ NVIDIA --

gevtushenko · 2023-06-07T14:31:30Z

I'm not sure if we want to expose this in CUB. Closing for now.

cub::ThreadLoadAsync and friends, abstractions for asynchronous dat…

c9745f2

…a movement.

brycelelbach added this to the 1.11.0 milestone Oct 5, 2020

brycelelbach requested a review from alliepiper October 5, 2020 19:09

brycelelbach assigned alliepiper Oct 5, 2020

brycelelbach changed the title ~~cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement.~~ cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement Oct 5, 2020

brycelelbach commented Oct 5, 2020

View reviewed changes

alliepiper modified the milestones: 1.11.0, 1.11.1 Oct 19, 2020

alliepiper modified the milestones: 1.12.0, 1.13.0 Nov 30, 2020

alliepiper modified the milestones: 1.13.0, 1.14.0 Mar 1, 2021

brycelelbach mentioned this pull request Jun 10, 2021

[FEA] Multi-buffer copy algorithm #297

Closed

alliepiper removed their assignment Jul 1, 2021

alliepiper removed this from the 1.14.0 milestone Aug 17, 2021

gevtushenko closed this Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cub::ThreadLoadAsync` and friends, abstractions for asynchronous data movement #209

`cub::ThreadLoadAsync` and friends, abstractions for asynchronous data movement #209

brycelelbach commented Oct 5, 2020 •

edited

Loading

brycelelbach Oct 5, 2020

mnicely commented Mar 24, 2021 •

edited

Loading

brycelelbach commented Mar 26, 2021 via email

gevtushenko commented Jun 7, 2023



		/**
		* \brief Establishes an ordering w.r.t previously issued ThreadLoadAsync operations.

cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement #209

cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement #209

Conversation

brycelelbach commented Oct 5, 2020 • edited Loading

brycelelbach Oct 5, 2020

Choose a reason for hiding this comment

mnicely commented Mar 24, 2021 • edited Loading

brycelelbach commented Mar 26, 2021 via email

gevtushenko commented Jun 7, 2023

`cub::ThreadLoadAsync` and friends, abstractions for asynchronous data movement #209

`cub::ThreadLoadAsync` and friends, abstractions for asynchronous data movement #209

brycelelbach commented Oct 5, 2020 •

edited

Loading

mnicely commented Mar 24, 2021 •

edited

Loading