Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement #209

Closed
wants to merge 1 commit into from

Conversation

brycelelbach
Copy link
Collaborator

@brycelelbach brycelelbach commented Oct 5, 2020

This is an exposure for Ampere's asynchronous copy mechanism, based on CUTLASS's implementation.

These primitives are useful for people writing their own kernels, BUT we can also potentially use them transparently in existing CUB block mechanisms, like BlockLoad and BlockStore.

Essentially, any time we have a repeated series of copies, we could use this. For example, this code:

    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            items[ITEM] = block_itr[(linear_tid * ITEMS_PER_THREAD) + ITEM];
        }
    }

The above code does a series of copies, which are not contiguous in memory. You couldn't replace this whole loop with a memcpy; the destination is contiguous, but the src is not.

We don't have any compute work to overload with the copies here, but it is still beneficial to replace them with asynchronous copies (@ogiroux and @griwes can explain why).

So that code could become:

    for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ITEM++)
    {
        if ((linear_tid * ITEMS_PER_THREAD) + ITEM < valid_items)
        {
            cub::ThreadLoadAsync<cub::LOAD_DEFAULT>(items + ITEM, block_itr + (linear_tid * ITEMS_PER_THREAD) + ITEM);
        }
    }
   cub::ThreadLoadWait();

TODO:

  • Add tests
  • Deploy this in BlockLoad, BlockStore, BlockExchange, and specific algorithms - basically anywhere there's a series of copies.
  • Add peeling, widening, and remaindering, possibly unifying this with the vectorization machinery.

@brycelelbach brycelelbach added this to the 1.11.0 milestone Oct 5, 2020
@brycelelbach brycelelbach changed the title cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement. cub::ThreadLoadAsync and friends, abstractions for asynchronous data movement Oct 5, 2020


/**
* \brief Establishes an ordering w.r.t previously issued ThreadLoadAsync operations.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably have a comment saying that this means that prior operations have been read from the source, although they have not necessarily been stored to the destination.

@alliepiper alliepiper modified the milestones: 1.11.0, 1.11.1 Oct 19, 2020
@alliepiper alliepiper modified the milestones: 1.12.0, 1.13.0 Nov 30, 2020
@alliepiper alliepiper modified the milestones: 1.13.0, 1.14.0 Mar 1, 2021
@mnicely
Copy link
Collaborator

mnicely commented Mar 24, 2021

@brycelelbach
If it is possible to pass a pointer within a SMEM array, will cub::ThreadLoadAsync<cub::LOAD_DEFAULT> be smart enough to convert the instructions to LDGSTS?

@brycelelbach
Copy link
Collaborator Author

brycelelbach commented Mar 26, 2021 via email

@alliepiper alliepiper removed their assignment Jul 1, 2021
@alliepiper alliepiper removed this from the 1.14.0 milestone Aug 17, 2021
@gevtushenko
Copy link
Collaborator

I'm not sure if we want to expose this in CUB. Closing for now.

@gevtushenko gevtushenko closed this Jun 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants