Introduce CUB ForEach algorithms #1302

gevtushenko · 2024-01-19T19:50:07Z

Description

This PR introduces a family of ForEach algorithms into CUB.
Apart from ForEach and ForEachN, the PR provides *Copy version of algorithms that vectorizes loads, providing about 15% better performance on U8. There's machinery that allows to automatically enable vectorization for non-copy version, but it's disabled for now since it leads to generating twice as many kernels (for aligned and unaligned pointers). There's also a new feature of using occupancy calculator to determine block size leading to maximal occupancy. This feature is currently disabled as well. Some follow-up work on tuning for-each would allow us to discover scenarios where dynamic block size is beneficial.

As of now, there's no difference in generated SASS.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

elstehle

Nice work 👍 I'm about one third through, but wanted to flush the few, minor comments before signing out for the day.

cub/benchmarks/bench/for_each/base.cu

cub/cub/agent/agent_for.cuh

cub/docs/test_overview.rst

cub/test/catch2_test_device_adjacent_difference_substract_left.cu

elstehle

Made my way through 🙂 Great work! 👏
Just a few more minor comments.

cub/cub/device/device_for.cuh

elstehle · 2024-01-21T10:23:05Z

cub/cub/device/device_for.cuh

+  //! Overview
+  //! +++++++++++++++++++++++++++++++++++++++++++++
+  //!
+  //! Applies the function object ``op`` to each index in the provided shape


question: Can you shed some light on why we call this shape? I'm probably just lacking the relevant context and therefore would probably have found it more intuitive to refer to this as OffsetT num_indexes or something similar.

I don't like the notion of offset in this context, because it implies offset in something. The shape terminology comes from P2300. The idea behind it is that we might extend the shape to be multidimensional at some point, potentially providing forward progress annotation, so that we could enable shared memory.

elstehle · 2024-01-21T10:23:34Z

cub/cub/device/device_for.cuh

+  //!   CUDA stream to launch kernels within. Default stream is `0`.
+  template <class ShapeT, class OpT>
+  CUB_RUNTIME_FUNCTION static cudaError_t
+  Bulk(void* d_temp_storage, size_t& temp_storage_bytes, ShapeT shape, OpT op, cudaStream_t stream = {})


question: I was wondering if we're losing some flexibility by not providing an interface that would take a ShapeT first_index. But, I guess, the user could modify their operator to have a member of ShapeT first_index and add it as offset within their operator() member function. Just want to confirm we're not losing some performance optimization that we could apply for such a scenario.

This is a good question! Implementing this functionality would definitely change SASS. Having a different overload that takes first_index, on the other hand, would preserve SASS for existing use cases. New overload could be added when we support problem sizes that do not fit into max grid size.

cub/cub/device/device_for.cuh

cub/test/catch2_test_device_bulk.cu

cub/test/catch2_test_device_for.cu

cub/test/catch2_test_device_for_api.cu

cub/test/catch2_test_device_for_copy.cu

cub/cub/device/device_for.cuh

jrhemstad · 2024-01-22T18:31:06Z

cub/cub/device/device_for.cuh

+  CUB_RUNTIME_FUNCTION static cudaError_t for_each_n(
+    InputIteratorT first, OffsetT num_items, OpT op, cudaStream_t stream, ::cuda::std::true_type /* vectorize */)
+  {
+    auto unwrapped_first = THRUST_NS_QUALIFIER::raw_pointer_cast(&*first);


Minor suggestion: Should this use cuda::std::addressof?

What happens here is:

thrust::device_vector<int> vec(10); thrust::device_vector<int>::iterator begin = vec.begin(); thrust::device_reference<int> thrust_ref = *begin; thrust::device_ptr<int> thrust_ptr = &thrust_ref; int* actual_ptr = thrust::raw_pointer_cast(thrust_ptr);

There's actual operator& that we need to invoke as opposed to taking address of thrust::device_reference.

miscco

Great work!

Some minor nits

thrust/thrust/system/cuda/detail/for_each.h

cub/benchmarks/bench/for_each/base.cu

cub/cub/device/dispatch/dispatch_for.cuh

cub/test/catch2_test_device_for.cu

miscco · 2024-01-25T10:00:03Z

cub/cub/device/device_for.cuh

+    // check for out-of-bounds access here.
+    if (i != partially_filled_vector_id)
+    { // Case of fully filled vector
+      const vector_t vec = *reinterpret_cast<const vector_t*>(input + vec_size * i);


Screams in aliasing rule.

No change requested

UB in CUB stands for Undefined Behavior :)

cub/cub/device/device_for.cuh

gevtushenko added 2 commits January 19, 2024 19:41

CUB for each

a355c62

Fix adjacent difference tests

f1eb00c

gevtushenko requested review from a team as code owners January 19, 2024 19:50

gevtushenko requested review from elstehle and miscco January 19, 2024 19:50

gevtushenko added 2 commits January 19, 2024 20:23

Fix pragma unroll warning in unique by key

9f78216

Template disambiguator

4d58521

elstehle reviewed Jan 19, 2024

View reviewed changes

cub/benchmarks/bench/for_each/base.cu Outdated Show resolved Hide resolved

cub/cub/agent/agent_for.cuh Outdated Show resolved Hide resolved

cub/docs/test_overview.rst Outdated Show resolved Hide resolved

cub/test/catch2_test_device_adjacent_difference_substract_left.cu Show resolved Hide resolved

gevtushenko added 3 commits January 19, 2024 22:54

Update copyright year

f673377

Typo in docs

070e477

Fix adjacent difference test

0bf9dc1

elstehle approved these changes Jan 21, 2024

View reviewed changes

gevtushenko added 5 commits January 21, 2024 22:03

Force inline

d61cb83

Fix typo in docs

4a2a93f

Constexpr

bc744b9

Do not rely on transitive includes

ffc2391

Includes order

178eb8f

jrhemstad reviewed Jan 22, 2024

View reviewed changes

cub/cub/device/device_for.cuh Show resolved Hide resolved

gevtushenko commented Jan 22, 2024

View reviewed changes

cub/cub/device/device_for.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Jan 22, 2024

View reviewed changes

miscco approved these changes Jan 25, 2024

View reviewed changes

gevtushenko added 7 commits January 25, 2024 15:14

Missing cast

659ad96

Remove extra host annotations

39449fa

Explicitly discard op result

c52c14c

Better template paramter name

fccf2e3

Qualify for each call

22ef339

Document bench

352804a

Improve readability

b254f2a

gevtushenko added 2 commits January 25, 2024 17:30

Better type name

9eaac68

Improve bulk docs

dba7579

gevtushenko merged commit b7d4228 into NVIDIA:main Jan 25, 2024
538 checks passed

karthikeyann mentioned this pull request Nov 26, 2024

Change binops for-each kernel to thrust::for_each_n rapidsai/cudf#17419

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce CUB ForEach algorithms #1302

Introduce CUB ForEach algorithms #1302

gevtushenko commented Jan 19, 2024

elstehle left a comment

elstehle left a comment

elstehle Jan 21, 2024

gevtushenko Jan 22, 2024

elstehle Jan 21, 2024

gevtushenko Jan 22, 2024

jrhemstad Jan 22, 2024 •

edited

Loading

gevtushenko Jan 25, 2024

miscco left a comment

miscco Jan 25, 2024

gevtushenko Jan 25, 2024

Introduce CUB ForEach algorithms #1302

Introduce CUB ForEach algorithms #1302

Conversation

gevtushenko commented Jan 19, 2024

Description

Checklist

elstehle left a comment

Choose a reason for hiding this comment

elstehle left a comment

Choose a reason for hiding this comment

elstehle Jan 21, 2024

Choose a reason for hiding this comment

gevtushenko Jan 22, 2024

Choose a reason for hiding this comment

elstehle Jan 21, 2024

Choose a reason for hiding this comment

gevtushenko Jan 22, 2024

Choose a reason for hiding this comment

jrhemstad Jan 22, 2024 • edited Loading

Choose a reason for hiding this comment

gevtushenko Jan 25, 2024

Choose a reason for hiding this comment

miscco left a comment

Choose a reason for hiding this comment

miscco Jan 25, 2024

Choose a reason for hiding this comment

gevtushenko Jan 25, 2024

Choose a reason for hiding this comment

jrhemstad Jan 22, 2024 •

edited

Loading