-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set retrieve #442
Add set retrieve #442
Conversation
|
||
auto constexpr flushing_tile_size = cuco::detail::warp_size() / window_size; | ||
// random choice to tune | ||
auto constexpr flushing_buffer_size = 2 * flushing_tile_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious. Why did you choose that particular size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. Tested with 1, 2, 3 and 4 and there is no big difference between those options.
auto const found = ref.find(tile, *(first + idx)); | ||
#if defined(CUCO_HAS_CG_INVOKE_ONE) | ||
if (found != ref.end()) { | ||
cg::invoke_one(tile, [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: invoke_one
is logically collective over the group defined by tile
and the hardware could select any thread in [0, tile.num_threads())
to execute the functor. However, it seems to me that not all threads in tile
could reach this line (because both found
and active_flag
are divergent to my understanding). Is this a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining your concern. The tile-based ref.find(tile, ...)
guarantees that all threads of the same tile have the same found
. active_flag
could diverge between different tiles but not for threads of the same tile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
__shared__ Size offset; | ||
|
||
#if defined(CUCO_HAS_CG_INVOKE_ONE) | ||
cooperative_groups::invoke_one( | ||
block, [&]() { offset = counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed); }); | ||
#else | ||
if (i == 0) { offset = counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed); } | ||
#endif | ||
block.sync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: In the CG invoke_one
case, is this better written without the explicit __shared__
offset as:
#if defined(CUCO_HAS_CG_INVOKE_ONE)
Size offset = cg::invoke_one_broadcast(block, [&] { return counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed) });
#else
__shared__ Size offset;
if (i == 0) { offset = counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed); }
block.sync()
#endif
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. cg::invoke_one_broadcast
only works for tiles but not thread block thus it doesn't work in this particular case. However, your suggestion is valid for numerous other cases in cuco and I will make a PR to update them all. 👍 Love it.
* | ||
* @note Behavior is undefined if the size of the output range exceeds | ||
* `std::distance(output_begin, output_end)`. | ||
* @note Behavior is undefined if the given key has multiple matches in the set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it undefined or do we return the first matching occurrence of the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the first element for scalar probing but undefined behavior for CG-based algorithms so undefined behavior is accurate.
ProbeHash const& probe_hash, | ||
cuda_stream_ref stream) const | ||
{ | ||
CUCO_FAIL("Unsupported code path: retrieve_async with custom hash/equal"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a note about this in the inline docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! Thanks!
This PR adds host-bulk set retrieve APIs. For now, they use device
find
APIs to get matches since the benefit of creating a dedicated deviceretrieve
is unclear.It also adds a placeholder for an overload of
retrieve
that takes custom key_equal and hasher.