Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add device subsets example #346

Merged
merged 34 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
f6f37fa
Add device subsets example
PointKernel Jul 3, 2023
d35db89
Merge remote-tracking branch 'upstream/dev' into subset-example
PointKernel Aug 5, 2023
482a14e
Cleanups: fix typo, remove printf etc
PointKernel Aug 5, 2023
8e559e0
Remove unrelated code
PointKernel Aug 5, 2023
40b9b15
Merge remote-tracking branch 'upstream/dev' into subset-example
PointKernel Aug 8, 2023
fda8f88
Add default extent template parameter
PointKernel Aug 8, 2023
2c92f84
Add missing headers
PointKernel Aug 9, 2023
233c668
Add missing header + temporarily disable asserts
PointKernel Aug 9, 2023
4eb25c9
Update subset example
PointKernel Aug 9, 2023
b68761d
Merge remote-tracking branch 'upstream/dev' into subset-example
PointKernel Aug 10, 2023
56b5dc3
Update example
PointKernel Aug 10, 2023
eff6faa
Resolve merging conflict
PointKernel Aug 10, 2023
871424a
Add default parameters to aow_storage for convenience
PointKernel Aug 10, 2023
635988b
Add storage initialized_async
PointKernel Aug 15, 2023
393ee3b
Update subset example
PointKernel Aug 15, 2023
70c3df7
Renaming
PointKernel Aug 15, 2023
085d1bb
Minor cleanups
PointKernel Aug 16, 2023
755db26
Add docs and comments
PointKernel Aug 17, 2023
8c746d7
Merge remote-tracking branch 'upstream/dev' into subset-example
PointKernel Sep 1, 2023
cce72b4
Merge remote-tracking branch 'upstream/dev' into subset-example
PointKernel Sep 6, 2023
b8028f4
Remove CGSize from window_extent
PointKernel Sep 6, 2023
d913720
Add more headers
PointKernel Sep 6, 2023
02eabf6
Temporarily disable window extent checks in open addressing ref base …
PointKernel Sep 11, 2023
3d016f6
Remove window_size tparam from window_extent
sleeepyjack Sep 12, 2023
5d88ea7
Add operator-agnostic static_set_ref move ctor and with() helper func…
sleeepyjack Sep 13, 2023
2433c09
Update device subset example
sleeepyjack Sep 13, 2023
c46d6fe
Partially re-enable checks
sleeepyjack Sep 13, 2023
adc6a54
Merge remote-tracking branch 'upstream/dev' into subset-example
PointKernel Sep 22, 2023
770a2ad
Merge remote-tracking branch 'origin/subset-example' into subset-example
PointKernel Sep 25, 2023
c632a9f
Remove window_extent static check and use size_t in the example
PointKernel Sep 26, 2023
2fa3810
With function to static_map_ref
PointKernel Sep 26, 2023
9e29ea7
Add TODO reminder
PointKernel Sep 26, 2023
8c8c3c1
Clean up example code
PointKernel Sep 26, 2023
2e7658c
Move the sentinel
PointKernel Sep 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ endfunction(ConfigureExample)

ConfigureExample(STATIC_SET_HOST_BULK_EXAMPLE "${CMAKE_CURRENT_SOURCE_DIR}/static_set/host_bulk_example.cu")
ConfigureExample(STATIC_SET_DEVICE_REF_EXAMPLE "${CMAKE_CURRENT_SOURCE_DIR}/static_set/device_ref_example.cu")
ConfigureExample(STATIC_SET_DEVICE_SUBSETS_EXAMPLE "${CMAKE_CURRENT_SOURCE_DIR}/static_set/device_subsets_example.cu")
ConfigureExample(STATIC_MAP_HOST_BULK_EXAMPLE "${CMAKE_CURRENT_SOURCE_DIR}/static_map/host_bulk_example.cu")
ConfigureExample(STATIC_MAP_DEVICE_SIDE_EXAMPLE "${CMAKE_CURRENT_SOURCE_DIR}/static_map/device_view_example.cu")
ConfigureExample(STATIC_MAP_CUSTOM_TYPE_EXAMPLE "${CMAKE_CURRENT_SOURCE_DIR}/static_map/custom_type_example.cu")
Expand Down
154 changes: 154 additions & 0 deletions examples/static_set/device_subsets_example.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cuco/static_set_ref.cuh>
#include <cuco/storage.cuh>

#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>

#include <cooperative_groups.h>

#include <cuda/std/array>

#include <algorithm>
#include <cstddef>
#include <iostream>

auto constexpr cg_size = 8; ///< A CUDA Cooperative Group of 8 threads to handle each subset
auto constexpr window_size = 1; ///< TODO: how to explain window size (vector length) to users
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory access granularity (which may impact perfomance depending on the size of the slot type)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still like referring to it as "items per thread" or "thread granularity" as it controls how many elements an individual thread processes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, items_per_thread itself is definitely a less abstractive name than window_size.

  cuco::storage<key_type, items_per_thread> { ... };

v.s.

  cuco::aow_storage<key_type, window_size> { ... };
  // or
  cuco::window_storage<key_type, window_size> { ... };

Actually, the former one is not bad at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only confusion that might occur is that items_per_thread in e.g. CUB refers to input items per thread, whilst our items_per_thread means slots per thread. Just a minor thing. I'm ok with it. thread_granularity would remove the items part which might be less confusing but also less descriptive. Meh.

auto constexpr N = 10; ///< Number of elements to insert and query

using key_type = int;
using probing_scheme_type =
cuco::experimental::linear_probing<cg_size, cuco::default_hash_function<key_type>>;
using storage_type = cuco::experimental::aow_storage<key_type, window_size>;
using storage_ref_type = typename storage_type::ref_type;
template <typename Operator>
using ref_type = cuco::experimental::static_set_ref<key_type,
cuda::thread_scope_device,
thrust::equal_to<key_type>,
probing_scheme_type,
storage_ref_type,
Operator>;

/// data to insert/query
__device__ constexpr std::array<key_type, N> data = {1, 3, 5, 7, 9, 11, 13, 15, 17, 19};
/// Empty slots are represented by reserved "sentinel" values. These values should be selected such
/// that they never occur in your input data.
key_type constexpr empty_key_sentinel = -1;

template <typename WindowT>
__global__ void initialize(WindowT* windows, std::size_t n, typename WindowT::value_type value)
{
using T = typename WindowT::value_type;

auto const loop_stride = gridDim.x * blockDim.x;
auto idx = blockDim.x * blockIdx.x + threadIdx.x;

while (idx < n) {
auto& window_slots = *(windows + idx);
#pragma unroll
for (auto& slot : window_slots) {
new (&slot) T{value};
}
idx += loop_stride;
}
}

// insert a set of keys into a hash set using one cooperative group for each task
template <typename Window, typename Size, typename Offset>
__global__ void insert(Window* windows, Size* sizes, Offset* offsets)
{
namespace cg = cooperative_groups;

auto const tile = cg::tiled_partition<cg_size>(cg::this_thread_block());
auto const idx = (blockDim.x * blockIdx.x + threadIdx.x) / cg_size;

auto set_ref = ref_type<cuco::experimental::insert_tag>{
cuco::empty_key<key_type>{-1}, {}, {}, storage_ref_type{sizes[idx], windows + offsets[idx]}};

// Each cooperative_groups inserts all elements in `data` into its own subset
for (int i = 0; i < N; i++) {
set_ref.insert(tile, data[i]);
}
}

// insert a set of keys into a hash set using one cooperative group for each task
template <typename Window, typename Size, typename Offset>
__global__ void find(Window* windows, Size* sizes, Offset* offsets)
{
namespace cg = cooperative_groups;

auto const tile = cg::tiled_partition<cg_size>(cg::this_thread_block());
auto const idx = (blockDim.x * blockIdx.x + threadIdx.x) / cg_size;

auto set_ref = ref_type<cuco::experimental::find_tag>{
cuco::empty_key<key_type>{-1}, {}, {}, storage_ref_type{sizes[idx], windows + offsets[idx]}};

__shared__ int result;
if (threadIdx.x == 0) { result = 0; }
__syncthreads();

for (int i = 0; i < N; i++) {
auto const found = set_ref.find(tile, data[i]);
// Record if the inserted data has been found
atomicOr(&result, *found != data[i]);
}
__syncthreads();

if (threadIdx.x == 0) {
if (result == 0) { printf("Success! Found all inserted elements.\n"); }
}
}

/**
* @file device_subsets_example.cu
* @brief Demonstrates usage of the static_set device-side APIs.
PointKernel marked this conversation as resolved.
Show resolved Hide resolved
*
* static_set provides a non-owning reference which can be used to interact with
* the container from within device code.
*/
int main()
{
// Number of subsets
auto constexpr num = 16;
// Sizes of the 16 subsets to be created on the device
auto constexpr subset_sizes =
std::array<std::size_t, num>{20, 20, 20, 20, 30, 30, 30, 30, 40, 40, 40, 40, 50, 50, 50, 50};

auto valid_sizes = std::vector<std::size_t>(num);
std::generate(valid_sizes.begin(), valid_sizes.end(), [&, n = 0]() mutable {
return cuco::experimental::make_window_extent<cg_size, window_size>(subset_sizes[n++]);
});

auto const d_sizes = thrust::device_vector<std::size_t>{valid_sizes};
auto d_offsets = thrust::device_vector<std::size_t>(num);
thrust::exclusive_scan(d_sizes.begin(), d_sizes.end(), d_offsets.begin());

auto const num_windows = thrust::reduce(valid_sizes.begin(), valid_sizes.end());
PointKernel marked this conversation as resolved.
Show resolved Hide resolved

// One allocation for all subsets
auto d_set_storage = storage_type{num_windows};
// Initializes the storage with the given sentinel
d_set_storage.initialize(empty_key_sentinel);

insert<<<1, 128>>>(d_set_storage.data(), d_sizes.data().get(), d_offsets.data().get());
find<<<1, 128>>>(d_set_storage.data(), d_sizes.data().get(), d_offsets.data().get());

return 0;
}
23 changes: 17 additions & 6 deletions include/cuco/aow_storage.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@

#pragma once

#include <cuco/detail/storage/aow_storage_base.cuh>

#include <cuco/cuda_stream_ref.hpp>
#include <cuco/detail/storage/aow_storage_base.cuh>
#include <cuco/extent.cuh>
#include <cuco/utility/allocator.hpp>

#include <cuda/std/array>

Expand Down Expand Up @@ -47,7 +47,10 @@ class aow_storage_ref;
* @tparam Extent Type of extent denoting number of windows
* @tparam Allocator Type of allocator used for device storage (de)allocation
*/
template <typename T, int32_t WindowSize, typename Extent, typename Allocator>
template <typename T,
int32_t WindowSize,
typename Extent = cuco::experimental::extent<std::size_t>,
typename Allocator = cuco::cuda_allocator<cuco::experimental::window<T, WindowSize>>>
class aow_storage : public detail::aow_storage_base<T, WindowSize, Extent> {
public:
using base_type = detail::aow_storage_base<T, WindowSize, Extent>; ///< AoW base class type
Expand Down Expand Up @@ -78,7 +81,7 @@ class aow_storage : public detail::aow_storage_base<T, WindowSize, Extent> {
* @param size Number of windows to (de)allocate
* @param allocator Allocator used for (de)allocating device storage
*/
explicit constexpr aow_storage(Extent size, Allocator const& allocator) noexcept;
explicit constexpr aow_storage(Extent size, Allocator const& allocator = {}) noexcept;

aow_storage(aow_storage&&) = default; ///< Move constructor
/**
Expand Down Expand Up @@ -119,7 +122,15 @@ class aow_storage : public detail::aow_storage_base<T, WindowSize, Extent> {
* @param key Key to which all keys in `slots` are initialized
* @param stream Stream used for executing the kernel
*/
void initialize(value_type key, cuda_stream_ref stream) noexcept;
void initialize(value_type key, cuda_stream_ref stream = {}) noexcept;

/**
* @brief Asynchronously initializes each slot in the AoW storage to contain `key`.
*
* @param key Key to which all keys in `slots` are initialized
* @param stream Stream used for executing the kernel
*/
void initialize_async(value_type key, cuda_stream_ref stream = {}) noexcept;

private:
allocator_type allocator_; ///< Allocator used to (de)allocate windows
Expand All @@ -134,7 +145,7 @@ class aow_storage : public detail::aow_storage_base<T, WindowSize, Extent> {
* @tparam WindowSize Number of slots in each window
* @tparam Extent Type of extent denoting storage capacity
*/
template <typename T, int32_t WindowSize, typename Extent>
template <typename T, int32_t WindowSize, typename Extent = cuco::experimental::extent<std::size_t>>
class aow_storage_ref : public detail::aow_storage_base<T, WindowSize, Extent> {
public:
using base_type = detail::aow_storage_base<T, WindowSize, Extent>; ///< AoW base class type
Expand Down
8 changes: 2 additions & 6 deletions include/cuco/detail/open_addressing_impl.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -141,11 +141,7 @@ class open_addressing_impl {
*
* @param stream CUDA stream this operation is executed in
*/
void clear(cuda_stream_ref stream) noexcept
{
this->clear_async(stream);
stream.synchronize();
}
void clear(cuda_stream_ref stream) noexcept { storage_.initialize(empty_slot_sentinel_, stream); }

/**
* @brief Asynchronously erases all elements from the container. After this call, `size()` returns
Expand All @@ -155,7 +151,7 @@ class open_addressing_impl {
*/
void clear_async(cuda_stream_ref stream) noexcept
{
storage_.initialize(empty_slot_sentinel_, stream);
storage_.initialize_async(empty_slot_sentinel_, stream);
}

/**
Expand Down
13 changes: 7 additions & 6 deletions include/cuco/detail/open_addressing_ref_impl.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
#include <cuco/detail/equal_wrapper.cuh>
#include <cuco/extent.cuh>
#include <cuco/pair.cuh>
#include <cuco/probing_scheme.cuh>

#include <thrust/distance.h>
#include <thrust/pair.h>
Expand Down Expand Up @@ -63,12 +64,12 @@ class open_addressing_ref_impl {
ProbingScheme>,
"ProbingScheme must inherit from cuco::detail::probing_scheme_base");

static_assert(is_window_extent_v<typename StorageRef::extent_type>,
"Extent is not a valid cuco::window_extent");
static_assert(ProbingScheme::cg_size == StorageRef::extent_type::cg_size,
"Extent has incompatible CG size");
static_assert(StorageRef::window_size == StorageRef::extent_type::window_size,
"Extent has incompatible window size");
// static_assert(is_window_extent_v<typename StorageRef::extent_type>,
// "Extent is not a valid cuco::window_extent");
// static_assert(ProbingScheme::cg_size == StorageRef::extent_type::cg_size,
// "Extent has incompatible CG size");
// static_assert(StorageRef::window_size == StorageRef::extent_type::window_size,
// "Extent has incompatible window size");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could we solve the issue where the sum of window_extents is not a window_extent itself?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing these checks isn't ideal. We could create a new window_extent from the sum using make_window_extent and pass that to the ctor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing these checks isn't ideal.

Agreed. That's the complicated part.

In general, when users have a pointer and a size on hand. Creating a ref should be as simple as:

auto ref = ref_type{&data, size};

That's what I'm trying to achieve for storage_ref in the subset example:

  auto set_ref = ref_type<cuco::experimental::find_tag>{
    cuco::empty_key<key_type>{-1}, {}, {}, storage_ref_type{sizes[idx], windows + offsets[idx]}};

Enabling those checks enforces users to invoke make_window_extent again over sizes[i] (Note sizes[i] is already the output of make_window_extent). Also, default template parameters are no longer valid either thus users need to specify them explicitly. The above code would turn into:

  using extent_type =
    decltype(make_window_extent<cg_size, window_size>(std::declval<cuco::experimental::extent<size_t>>()));
  auto set_ref = ref_type<cuco::experimental::find_tag>{
    cuco::empty_key<key_type>{-1},
    {},
    {},
    aow_storage_ref<key_type,
                    window_size,
                    extent_type>{make_valid_extent<cg_size, window_size>(sizes[idx]), windows + offsets[idx]}};

This is way more complex than needed.

One solution I can think of is to set the proper data type for sizes array, so instead of:

  auto valid_sizes = std::vector<std::size_t>(num);

Users should get the return type of make_window_extent first and then declare the array:

  using extent_type =
    decltype(make_window_extent<cg_size, window_size>(std::declval<cuco::experimental::extent<size_t>>()));
  auto valid_sizes = std::vector<extent_type>(num);

One thing I don't like here is the obscure way that users have to follow to set up the proper extent type. Isn't all those fiddlings around cg_size, window_size, and size_t, etc too complicated for people who just want a size? By all means, this is a doable workaround but doesn't solve the core problem that we are prohibiting users to create a data ref with a pointer and a size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree on the unnecessary complexity. It should be as simple as passing a pointer and a size (or a cuda::std::span once we have it).

Can we provide an additional ctor with signature storage_ref_type(ptr, size), which then internally constructs a window_extent from the size?

We can discuss this in today's dev sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which then internally constructs a window_extent from the size?

We lose the motivation of having a window_extent strong type in that way.


public:
using key_type = Key; ///< Key type
Expand Down
8 changes: 8 additions & 0 deletions include/cuco/detail/storage/aow_storage.inl
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,14 @@ aow_storage<T, WindowSize, Extent, Allocator>::ref() const noexcept
template <typename T, int32_t WindowSize, typename Extent, typename Allocator>
void aow_storage<T, WindowSize, Extent, Allocator>::initialize(value_type key,
cuda_stream_ref stream) noexcept
{
this->initialize_async(key, stream);
stream.synchronize();
}

template <typename T, int32_t WindowSize, typename Extent, typename Allocator>
void aow_storage<T, WindowSize, Extent, Allocator>::initialize_async(
value_type key, cuda_stream_ref stream) noexcept
{
auto constexpr stride = 4;
auto const grid_size = (this->num_windows() + stride * detail::CUCO_DEFAULT_BLOCK_SIZE - 1) /
Expand Down
1 change: 1 addition & 0 deletions include/cuco/detail/storage/storage.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ class storage : StorageImpl::template impl<T, Extent, Allocator> {
using impl_type::capacity;
using impl_type::data;
using impl_type::initialize;
using impl_type::initialize_async;
using impl_type::num_windows;
using impl_type::ref;

Expand Down
3 changes: 3 additions & 0 deletions include/cuco/static_set_ref.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,11 @@

#include <cuco/detail/equal_wrapper.cuh>
#include <cuco/detail/open_addressing_ref_impl.cuh>
#include <cuco/hash_functions.cuh>
#include <cuco/operator.hpp>
#include <cuco/probing_scheme.cuh>
#include <cuco/sentinel.cuh>
#include <cuco/storage.cuh>

#include <cuda/std/atomic>

Expand Down
1 change: 1 addition & 0 deletions include/cuco/storage.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@

namespace cuco {
namespace experimental {

/**
* @brief Public storage class.
*
Expand Down