Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CUB ForEach algorithms #1302

Merged
merged 21 commits into from
Jan 25, 2024

Conversation

gevtushenko
Copy link
Collaborator

Description

closes #1231

This PR introduces a family of ForEach algorithms into CUB.
Apart from ForEach and ForEachN, the PR provides *Copy version of algorithms that vectorizes loads, providing about 15% better performance on U8. There's machinery that allows to automatically enable vectorization for non-copy version, but it's disabled for now since it leads to generating twice as many kernels (for aligned and unaligned pointers). There's also a new feature of using occupancy calculator to determine block size leading to maximal occupancy. This feature is currently disabled as well. Some follow-up work on tuning for-each would allow us to discover scenarios where dynamic block size is beneficial.

As of now, there's no difference in generated SASS.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@gevtushenko gevtushenko requested review from a team as code owners January 19, 2024 19:50
Copy link
Collaborator

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍 I'm about one third through, but wanted to flush the few, minor comments before signing out for the day.

cub/benchmarks/bench/for_each/base.cu Outdated Show resolved Hide resolved
cub/cub/agent/agent_for.cuh Outdated Show resolved Hide resolved
cub/docs/test_overview.rst Outdated Show resolved Hide resolved
Copy link
Collaborator

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made my way through 🙂 Great work! 👏
Just a few more minor comments.

cub/cub/device/device_for.cuh Outdated Show resolved Hide resolved
cub/cub/device/device_for.cuh Outdated Show resolved Hide resolved
//! Overview
//! +++++++++++++++++++++++++++++++++++++++++++++
//!
//! Applies the function object ``op`` to each index in the provided shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Can you shed some light on why we call this shape? I'm probably just lacking the relevant context and therefore would probably have found it more intuitive to refer to this as OffsetT num_indexes or something similar.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the notion of offset in this context, because it implies offset in something. The shape terminology comes from P2300. The idea behind it is that we might extend the shape to be multidimensional at some point, potentially providing forward progress annotation, so that we could enable shared memory.

//! CUDA stream to launch kernels within. Default stream is `0`.
template <class ShapeT, class OpT>
CUB_RUNTIME_FUNCTION static cudaError_t
Bulk(void* d_temp_storage, size_t& temp_storage_bytes, ShapeT shape, OpT op, cudaStream_t stream = {})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I was wondering if we're losing some flexibility by not providing an interface that would take a ShapeT first_index. But, I guess, the user could modify their operator to have a member of ShapeT first_index and add it as offset within their operator() member function. Just want to confirm we're not losing some performance optimization that we could apply for such a scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question! Implementing this functionality would definitely change SASS. Having a different overload that takes first_index, on the other hand, would preserve SASS for existing use cases. New overload could be added when we support problem sizes that do not fit into max grid size.

cub/cub/device/device_for.cuh Outdated Show resolved Hide resolved
cub/test/catch2_test_device_bulk.cu Show resolved Hide resolved
cub/test/catch2_test_device_bulk.cu Show resolved Hide resolved
cub/test/catch2_test_device_for.cu Outdated Show resolved Hide resolved
cub/test/catch2_test_device_for_api.cu Show resolved Hide resolved
cub/test/catch2_test_device_for_copy.cu Outdated Show resolved Hide resolved
CUB_RUNTIME_FUNCTION static cudaError_t for_each_n(
InputIteratorT first, OffsetT num_items, OpT op, cudaStream_t stream, ::cuda::std::true_type /* vectorize */)
{
auto unwrapped_first = THRUST_NS_QUALIFIER::raw_pointer_cast(&*first);
Copy link
Collaborator

@jrhemstad jrhemstad Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: Should this use cuda::std::addressof?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here is:

  thrust::device_vector<int> vec(10);
  thrust::device_vector<int>::iterator begin = vec.begin();
  thrust::device_reference<int> thrust_ref = *begin;
  thrust::device_ptr<int> thrust_ptr = &thrust_ref;
  int* actual_ptr = thrust::raw_pointer_cast(thrust_ptr);

There's actual operator& that we need to invoke as opposed to taking address of thrust::device_reference.

Copy link
Collaborator

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Some minor nits

thrust/thrust/system/cuda/detail/for_each.h Outdated Show resolved Hide resolved
cub/benchmarks/bench/for_each/base.cu Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_for.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_for.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_for.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_for.cuh Show resolved Hide resolved
cub/test/catch2_test_device_for.cu Outdated Show resolved Hide resolved
// check for out-of-bounds access here.
if (i != partially_filled_vector_id)
{ // Case of fully filled vector
const vector_t vec = *reinterpret_cast<const vector_t*>(input + vec_size * i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screams in aliasing rule.

No change requested

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UB in CUB stands for Undefined Behavior :)

cub/cub/device/device_for.cuh Outdated Show resolved Hide resolved
@gevtushenko gevtushenko merged commit b7d4228 into NVIDIA:main Jan 25, 2024
538 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Port thrust::cuda_cub::parallel_for to CUB
4 participants