You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Motivation
It's increasingly harder to reach SOL on newer GPU architectures, starting with A100 and H100, especially for simple kernels, like: thrust::transform(..., thrust::plus{}), which basically load a few values and perform little compute. CUB algorithms already counter this by processing several elements per thread, but internal research hints at the necessity to further increase the amount of data in flight.
Use case thrust::transform is an important primitive for many algorithms and also occurs in BabelStream, i highly relevant HPC benchmark often used to produce representative numbers to compare the performance of hardware architectures. We should therefore dedicate some effort to ensure thrust::transform performs well.
Approach
The main strategy is to have more "bytes in flight" when reading, with the concrete amount depending on the target architecture (tuning parameter). There are multiple ways to generate more loads. Again, internal research points to using either prefetching or the tensor memory accelerator (TMA, e.g. via memcpy_async) on newer architectures. Excessive unrolling and loading to registers works as well, but has the drawback of consuming large amount of registers for architectures requiring a large number of bytes in flight.
Address stability
For the loading strategy we have to consider the address stability of data items as well. Users sometimes rely on the ability to retrieve the index inside an input array from the reference of a loaded element:
transform(par, a, a + n, a, [a,b,c](const T& e) {
constauto i = &e – a; // &e expected to point into global memoryreturn e + b[i] + c[i];
});
Such a user-provided function object inhibits any optimization which loads elements from global memory into registers or shared memory before passing them as arguments, thus only allowing prefetching as optimization. Address oblivious function objects can benefit from a larger variety of optimizations (like TMA or pipelined loading to registers.
Further concerns
Furthermore, the computational intensity and shared memory/register consumption of the user provided function object influence the loading strategy. Longer computations seem to require more data in flight. Shared memory is contested by TMA and user-side computation. Register pressure limits unrolling.
Status quo thrust::transform (CUDA) is currently built on top of cub::DeviceFor::Bulk, which eventually dispatches independently of the uses data types or number of input and output streams. Because cub::DeviceFor::Bulk is index based, the involved input and output data streams are not visible and no tuning based on this information is possible. The situation is similar with cub::DeviceFor::ForEach et al.
Strategy
I propose to add a new CUB algorithm cub::DeviceTransform governing transformations of N input streams into a single output stream (maybe M output streams if use cases arrise) and rebasing thrust::transform on top of it.
Address stability: Because I just encountered it in the tests on my A6000. If we use a kernel serving parameters from shared memory and the user performs pointer arithmetic with a pointer to global memory, the kernel crashes (Release build) and the following error code is reported at the next cudaDeviceSynchronize:
717 (operation not supported on global/shared address space)
That's at least better than a garbage result and the kernel continuing with wrong data.
Motivation
It's increasingly harder to reach SOL on newer GPU architectures, starting with A100 and H100, especially for simple kernels, like:
thrust::transform(..., thrust::plus{})
, which basically load a few values and perform little compute. CUB algorithms already counter this by processing several elements per thread, but internal research hints at the necessity to further increase the amount of data in flight.Use case
thrust::transform
is an important primitive for many algorithms and also occurs in BabelStream, i highly relevant HPC benchmark often used to produce representative numbers to compare the performance of hardware architectures. We should therefore dedicate some effort to ensurethrust::transform
performs well.Approach
The main strategy is to have more "bytes in flight" when reading, with the concrete amount depending on the target architecture (tuning parameter). There are multiple ways to generate more loads. Again, internal research points to using either prefetching or the tensor memory accelerator (TMA, e.g. via
memcpy_async
) on newer architectures. Excessive unrolling and loading to registers works as well, but has the drawback of consuming large amount of registers for architectures requiring a large number of bytes in flight.Address stability
For the loading strategy we have to consider the address stability of data items as well. Users sometimes rely on the ability to retrieve the index inside an input array from the reference of a loaded element:
Such a user-provided function object inhibits any optimization which loads elements from global memory into registers or shared memory before passing them as arguments, thus only allowing prefetching as optimization. Address oblivious function objects can benefit from a larger variety of optimizations (like TMA or pipelined loading to registers.
Further concerns
Furthermore, the computational intensity and shared memory/register consumption of the user provided function object influence the loading strategy. Longer computations seem to require more data in flight. Shared memory is contested by TMA and user-side computation. Register pressure limits unrolling.
Status quo
thrust::transform
(CUDA) is currently built on top ofcub::DeviceFor::Bulk
, which eventually dispatches independently of the uses data types or number of input and output streams. Becausecub::DeviceFor::Bulk
is index based, the involved input and output data streams are not visible and no tuning based on this information is possible. The situation is similar withcub::DeviceFor::ForEach
et al.Strategy
I propose to add a new CUB algorithm
cub::DeviceTransform
governing transformations of N input streams into a single output stream (maybe M output streams if use cases arrise) and rebasingthrust::transform
on top of it.Future tasks after merging
cub::DeviceTransform
cub::DeviceTransform
#2091thrust::transform
to usecub::DeviceTransform
#2263thrust::transform
#2717num_items
tocub::DeviceTransform
#2714__restrict__
thrust::transform_if
tocub::DeviceTransform
as wellthrust::transform_if
with stencil tocub::DeviceTransform
as welltransform
(especially replacingfor_each
)The text was updated successfully, but these errors were encountered: