-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce CUB ForEach algorithms #1302
Introduce CUB ForEach algorithms #1302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 👍 I'm about one third through, but wanted to flush the few, minor comments before signing out for the day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made my way through 🙂 Great work! 👏
Just a few more minor comments.
//! Overview | ||
//! +++++++++++++++++++++++++++++++++++++++++++++ | ||
//! | ||
//! Applies the function object ``op`` to each index in the provided shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Can you shed some light on why we call this shape
? I'm probably just lacking the relevant context and therefore would probably have found it more intuitive to refer to this as OffsetT num_indexes
or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the notion of offset in this context, because it implies offset in something. The shape terminology comes from P2300. The idea behind it is that we might extend the shape to be multidimensional at some point, potentially providing forward progress annotation, so that we could enable shared memory.
//! CUDA stream to launch kernels within. Default stream is `0`. | ||
template <class ShapeT, class OpT> | ||
CUB_RUNTIME_FUNCTION static cudaError_t | ||
Bulk(void* d_temp_storage, size_t& temp_storage_bytes, ShapeT shape, OpT op, cudaStream_t stream = {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I was wondering if we're losing some flexibility by not providing an interface that would take a ShapeT first_index
. But, I guess, the user could modify their operator
to have a member of ShapeT first_index
and add it as offset within their operator()
member function. Just want to confirm we're not losing some performance optimization that we could apply for such a scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question! Implementing this functionality would definitely change SASS. Having a different overload that takes first_index
, on the other hand, would preserve SASS for existing use cases. New overload could be added when we support problem sizes that do not fit into max grid size.
CUB_RUNTIME_FUNCTION static cudaError_t for_each_n( | ||
InputIteratorT first, OffsetT num_items, OpT op, cudaStream_t stream, ::cuda::std::true_type /* vectorize */) | ||
{ | ||
auto unwrapped_first = THRUST_NS_QUALIFIER::raw_pointer_cast(&*first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: Should this use cuda::std::addressof
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here is:
thrust::device_vector<int> vec(10);
thrust::device_vector<int>::iterator begin = vec.begin();
thrust::device_reference<int> thrust_ref = *begin;
thrust::device_ptr<int> thrust_ptr = &thrust_ref;
int* actual_ptr = thrust::raw_pointer_cast(thrust_ptr);
There's actual operator&
that we need to invoke as opposed to taking address of thrust::device_reference
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Some minor nits
// check for out-of-bounds access here. | ||
if (i != partially_filled_vector_id) | ||
{ // Case of fully filled vector | ||
const vector_t vec = *reinterpret_cast<const vector_t*>(input + vec_size * i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Screams in aliasing rule.
No change requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UB in CUB stands for Undefined Behavior :)
Description
closes #1231
This PR introduces a family of ForEach algorithms into CUB.
Apart from
ForEach
andForEachN
, the PR provides*Copy
version of algorithms that vectorizes loads, providing about 15% better performance onU8
. There's machinery that allows to automatically enable vectorization for non-copy version, but it's disabled for now since it leads to generating twice as many kernels (for aligned and unaligned pointers). There's also a new feature of using occupancy calculator to determine block size leading to maximal occupancy. This feature is currently disabled as well. Some follow-up work on tuning for-each would allow us to discover scenarios where dynamic block size is beneficial.As of now, there's no difference in generated SASS.
Checklist