Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set retrieve #442

Merged
merged 22 commits into from
Mar 19, 2024
Merged

Add set retrieve #442

merged 22 commits into from
Mar 19, 2024

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Feb 25, 2024

This PR adds host-bulk set retrieve APIs. For now, they use device find APIs to get matches since the benefit of creating a dedicated device retrieve is unclear.

It also adds a placeholder for an overload of retrieve that takes custom key_equal and hasher.

@PointKernel PointKernel added type: feature request New feature request helps: rapids Helps or needed by RAPIDS topic: static_set Issue related to the static_set labels Feb 25, 2024
@PointKernel PointKernel added the In Progress Currently a work in progress label Mar 2, 2024
@PointKernel PointKernel marked this pull request as ready for review March 12, 2024 18:44
@PointKernel PointKernel added Needs Review Awaiting reviews before merging and removed In Progress Currently a work in progress labels Mar 12, 2024
include/cuco/detail/static_set/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_set/kernels.cuh Outdated Show resolved Hide resolved

auto constexpr flushing_tile_size = cuco::detail::warp_size() / window_size;
// random choice to tune
auto constexpr flushing_buffer_size = 2 * flushing_tile_size;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious. Why did you choose that particular size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. Tested with 1, 2, 3 and 4 and there is no big difference between those options.

include/cuco/detail/static_set/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_set/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_set/kernels.cuh Outdated Show resolved Hide resolved
include/cuco/detail/static_set/static_set.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_set/static_set.inl Outdated Show resolved Hide resolved
include/cuco/detail/static_set/kernels.cuh Show resolved Hide resolved
include/cuco/detail/static_set/kernels.cuh Outdated Show resolved Hide resolved
auto const found = ref.find(tile, *(first + idx));
#if defined(CUCO_HAS_CG_INVOKE_ONE)
if (found != ref.end()) {
cg::invoke_one(tile, [&]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: invoke_one is logically collective over the group defined by tile and the hardware could select any thread in [0, tile.num_threads()) to execute the functor. However, it seems to me that not all threads in tile could reach this line (because both found and active_flag are divergent to my understanding). Is this a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining your concern. The tile-based ref.find(tile, ...) guarantees that all threads of the same tile have the same found. active_flag could diverge between different tiles but not for threads of the same tile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +247 to +255
__shared__ Size offset;

#if defined(CUCO_HAS_CG_INVOKE_ONE)
cooperative_groups::invoke_one(
block, [&]() { offset = counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed); });
#else
if (i == 0) { offset = counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed); }
#endif
block.sync();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: In the CG invoke_one case, is this better written without the explicit __shared__ offset as:

#if defined(CUCO_HAS_CG_INVOKE_ONE)
  Size offset = cg::invoke_one_broadcast(block, [&] { return counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed) });
#else
  __shared__ Size offset;
  if (i == 0) { offset = counter->fetch_add(buffer_size, cuda::std::memory_order_relaxed); }
  block.sync()
#endif

?

Copy link
Member Author

@PointKernel PointKernel Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. cg::invoke_one_broadcast only works for tiles but not thread block thus it doesn't work in this particular case. However, your suggestion is valid for numerous other cases in cuco and I will make a PR to update them all. 👍 Love it.

@PointKernel PointKernel requested a review from sleeepyjack March 14, 2024 23:01
*
* @note Behavior is undefined if the size of the output range exceeds
* `std::distance(output_begin, output_end)`.
* @note Behavior is undefined if the given key has multiple matches in the set.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it undefined or do we return the first matching occurrence of the key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the first element for scalar probing but undefined behavior for CG-based algorithms so undefined behavior is accurate.

ProbeHash const& probe_hash,
cuda_stream_ref stream) const
{
CUCO_FAIL("Unsupported code path: retrieve_async with custom hash/equal");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a note about this in the inline docs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@sleeepyjack sleeepyjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Thanks!

@PointKernel PointKernel merged commit dd51a21 into NVIDIA:dev Mar 19, 2024
15 checks passed
@PointKernel PointKernel deleted the add-set-retrieve branch March 19, 2024 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: static_set Issue related to the static_set type: feature request New feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants