Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor hash join with cuCollections multimap #8934

Merged
merged 75 commits into from
Nov 2, 2021
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
506e610
Add cuco dependency in CMake
PointKernel Jun 14, 2021
0bc5881
Add CUDF_GetCUCO CMake file
PointKernel Jun 14, 2021
04d588c
Add cuco CPM build options
PointKernel Jun 14, 2021
108a0aa
Add cuco include dir
PointKernel Jun 14, 2021
fea4331
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jun 14, 2021
9f8b68c
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jun 21, 2021
23d1ab1
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jun 22, 2021
720922a
Set cuco::static_multimap as default multimap
PointKernel Jun 22, 2021
c192e71
Refactor join APIs: take cuco multimap view as argument
PointKernel Jun 22, 2021
a3a92ea
Update docs
PointKernel Jun 22, 2021
d265477
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jun 23, 2021
0e84203
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jun 24, 2021
0837404
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jun 25, 2021
c3e67ad
Insert using cuco multimap
PointKernel Jun 28, 2021
8c36b6e
Merge remote-tracking branch 'upstream/branch-21.08' into cuco-integr…
PointKernel Jul 19, 2021
6376ca6
Add pair_equality callable
PointKernel Jul 20, 2021
8c5adb4
Optimize pair_equality callable
PointKernel Jul 22, 2021
9920ccb
Refactor build_hash_table and compute_join_output_size to use cuco mu…
PointKernel Jul 27, 2021
8c1e04e
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Jul 27, 2021
3f5a444
Minor cleanups in common utils
PointKernel Jul 29, 2021
9e7662e
Use cuco host bulk function instead of cudf multimap
PointKernel Jul 29, 2021
627b669
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Jul 29, 2021
7acb608
Refactor probe_join_hash_table and get_full_join_size to use cuco mul…
PointKernel Jul 29, 2021
16ff127
Cleanups: get rid of join kernels
PointKernel Jul 29, 2021
da2ca5d
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Jul 30, 2021
7af07f8
Remove join kernels
PointKernel Jul 30, 2021
e16aa8b
Use JoinNoneValue as empty value sentinel
PointKernel Aug 3, 2021
2952b02
Code formatting
PointKernel Aug 3, 2021
bc0530e
More descriptive naming: row_contains_null instead of build_predicate
PointKernel Aug 3, 2021
31e24a6
Get rid of multimap unique_ptr
PointKernel Aug 3, 2021
29298a9
Use char as default alloactor data type
PointKernel Aug 3, 2021
7025ae6
Get rid of const reference
PointKernel Aug 4, 2021
a98f156
Pass stream to insert_if function
PointKernel Aug 5, 2021
a485cb7
Use make_transform_iterator instead of naive declarations
PointKernel Aug 5, 2021
115a5e3
Move pair equality functor to cuh header
PointKernel Aug 6, 2021
1a6e3cd
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Aug 6, 2021
fe1757a
Fix a wrong logic in full_join
PointKernel Aug 6, 2021
cb591c8
Minor improvement: use constexpr
PointKernel Aug 6, 2021
5aacf19
Fix a bug in JoinDictionaryTest.InnerJoinNoNulls: sort before compare
PointKernel Aug 6, 2021
f379426
Sort before compare for all JoinDictionaryTest tests
PointKernel Aug 9, 2021
b2da418
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Aug 9, 2021
d0d481d
Multimap takes stream as argument
PointKernel Aug 9, 2021
d49988a
Minor optimization of pair_equality
PointKernel Aug 10, 2021
ef18f18
Refactor generate_input_tables: enable key multiplicity control
PointKernel Aug 22, 2021
8d8741f
Fix a minor bug: stream instead of stream view
PointKernel Aug 22, 2021
993e1cc
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Aug 24, 2021
9d9cd55
Fetch cuco static multimap branch
PointKernel Aug 24, 2021
31cd017
Remove redundant CUDF_GetCUCO.cmake
PointKernel Aug 24, 2021
46f5166
Update test_joining pytests: sort before compare
PointKernel Aug 24, 2021
57ac731
Use assert_join_results_equal instead of naive assert_eq
PointKernel Aug 25, 2021
05817b2
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Aug 25, 2021
c5690e0
Update pytests: sort before compare
PointKernel Aug 25, 2021
8a00fd2
Corrections: use sort_values()
PointKernel Aug 25, 2021
1385d50
Minor corrections
PointKernel Aug 25, 2021
a317428
Use insert instead of insert_if if row_bitmask is null
PointKernel Aug 26, 2021
5cfc15e
Compare against nullptr to avoid implicit type conversion
PointKernel Aug 26, 2021
a8ddcf7
Add default cg size = 2
PointKernel Aug 30, 2021
290a49d
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Aug 30, 2021
6165d3f
Avoid unnecessary row_bitmask build
PointKernel Aug 31, 2021
12618fb
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Aug 31, 2021
dd5b177
Sort before assert_eq to avoid out-of-order comparison in pytests
PointKernel Sep 1, 2021
f6d3df4
Update cuco cmake: use rapids_cpm_find instead of CPMFindPackage
PointKernel Sep 1, 2021
d2252b6
Remove redundant if TARGET check in cuco cmake
PointKernel Sep 1, 2021
af85966
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Sep 1, 2021
8eeaf75
Minor correction on comment
PointKernel Sep 2, 2021
7175e35
Minor improvement: insert_if predicate on the stencil sequence
PointKernel Sep 3, 2021
0af148b
Cleanups: remove unnecessary nullptr check + insert_if instead of ins…
PointKernel Sep 3, 2021
bc205e4
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Sep 9, 2021
7ee81c8
Updates: proper variable names + rename _empty
PointKernel Sep 10, 2021
8cb0762
Minor update: double hashing moved to cuco::detail namespace
PointKernel Sep 17, 2021
ce16ad6
Merge remote-tracking branch 'upstream/branch-21.10' into cuco-integr…
PointKernel Sep 17, 2021
bb66362
Address review comments: add docs + use make_counting_transform_iterator
PointKernel Sep 20, 2021
df0da48
Minor updates + rename row_contains_null as row_is_valid
PointKernel Sep 20, 2021
2783b48
Merge remote-tracking branch 'upstream/branch-21.12' into cuco-integr…
PointKernel Nov 1, 2021
c49166e
Updates: fetching the latest cuco tag
PointKernel Nov 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 22 additions & 95 deletions cpp/benchmarks/join/generate_input_tables.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -41,46 +41,20 @@ __global__ static void init_curand(curandState* state, const int nstates)
template <typename key_type, typename size_type>
__global__ static void init_build_tbl(key_type* const build_tbl,
const size_type build_tbl_size,
const key_type rand_max,
const bool uniq_build_tbl_keys,
key_type* const lottery,
const size_type lottery_size,
const int multiplicity,
curandState* state,
const int num_states)
{
static_assert(std::is_signed<key_type>::value, "key_type needs to be signed for lottery to work");

const int start_idx = blockIdx.x * blockDim.x + threadIdx.x;
const key_type stride = blockDim.x * gridDim.x;
auto const start_idx = blockIdx.x * blockDim.x + threadIdx.x;
auto const stride = blockDim.x * gridDim.x;
assert(start_idx < num_states);

curandState localState = state[start_idx];

for (size_type idx = start_idx; idx < build_tbl_size; idx += stride) {
const double x = curand_uniform_double(&localState);

if (uniq_build_tbl_keys) {
// If the build table keys need to be unique, go through lottery array from lottery_idx until
// finding a key which has not been used (-1). Mark the key as been used by atomically setting
// the spot to -1.

size_type lottery_idx = x * lottery_size;
key_type lottery_val = -1;

while (-1 == lottery_val) {
lottery_val = lottery[lottery_idx];

if (-1 != lottery_val) {
lottery_val = atomicCAS<key_type>(lottery + lottery_idx, lottery_val, -1);
}

lottery_idx = (lottery_idx + 1) % lottery_size;
}

build_tbl[idx] = lottery_val;
} else {
build_tbl[idx] = x * rand_max;
}
build_tbl[idx] = static_cast<key_type>(x * (build_tbl_size / multiplicity));
}

state[start_idx] = localState;
Expand All @@ -89,16 +63,15 @@ __global__ static void init_build_tbl(key_type* const build_tbl,
template <typename key_type, typename size_type>
__global__ void init_probe_tbl(key_type* const probe_tbl,
const size_type probe_tbl_size,
const key_type* const build_tbl,
const size_type build_tbl_size,
const key_type* const lottery,
const size_type lottery_size,
const key_type rand_max,
const double selectivity,
const int multiplicity,
curandState* state,
const int num_states)
{
const int start_idx = blockIdx.x * blockDim.x + threadIdx.x;
const size_type stride = blockDim.x * gridDim.x;
auto const start_idx = blockIdx.x * blockDim.x + threadIdx.x;
auto const stride = blockDim.x * gridDim.x;
assert(start_idx < num_states);

curandState localState = state[start_idx];
Expand All @@ -109,21 +82,15 @@ __global__ void init_probe_tbl(key_type* const probe_tbl,

if (x <= selectivity) {
// x <= selectivity means this key in the probe table should be present in the build table, so
// we pick a key from build_tbl
x = curand_uniform_double(&localState);
size_type build_tbl_idx = x * build_tbl_size;

if (build_tbl_idx >= build_tbl_size) { build_tbl_idx = build_tbl_size - 1; }

val = build_tbl[build_tbl_idx];
// we pick a key from [0, build_tbl_size / multiplicity]
x = curand_uniform_double(&localState);
val = static_cast<key_type>(x * (build_tbl_size / multiplicity));
} else {
// This key in the probe table should not be present in the build table, so we pick a key from
// lottery.
x = curand_uniform_double(&localState);
size_type lottery_idx = x * lottery_size;
val = lottery[lottery_idx];
// [build_tbl_size, rand_max].
x = curand_uniform_double(&localState);
val = static_cast<key_type>(x * (rand_max - build_tbl_size) + build_tbl_size);
}

probe_tbl[idx] = val;
}

Expand Down Expand Up @@ -152,28 +119,23 @@ __global__ void init_probe_tbl(key_type* const probe_tbl,
* @param[in] build_tbl_size number of keys in the build table
* @param[in] selectivity probability with which an element of the probe table is
* present in the build table.
* @param[in] rand_max maximum random number to generate. I.e. random numbers are
* integers from [0,rand_max].
* @param[in] uniq_build_tbl_keys if each key in the build table should appear exactly once.
* @param[in] multiplicity number of matches for each key.
*/
template <typename key_type, typename size_type>
void generate_input_tables(key_type* const build_tbl,
const size_type build_tbl_size,
key_type* const probe_tbl,
const size_type probe_tbl_size,
const double selectivity,
const key_type rand_max,
const bool uniq_build_tbl_keys)
const int multiplicity)
{
// With large values of rand_max the a lot of temporary storage is needed for the lottery. At the
// expense of not being that accurate with applying the selectivity an especially more memory
// efficient implementations would be to partition the random numbers into two intervals and then
// let one table choose random numbers from only one interval and the other only select with
// selective probability from the same interval and from the other in the other cases.

static_assert(std::is_signed<key_type>::value, "key_type needs to be signed for lottery to work");

const int block_size = 128;
constexpr int block_size = 128;

// Maximize exposed parallelism while minimizing storage for curand state
int num_blocks_init_build_tbl{-1};
Expand All @@ -198,55 +160,20 @@ void generate_input_tables(key_type* const build_tbl,

CHECK_CUDA(0);

size_type lottery_size =
rand_max < std::numeric_limits<key_type>::max() - 1 ? rand_max + 1 : rand_max;
rmm::device_uvector<key_type> lottery(lottery_size, rmm::cuda_stream_default);

if (uniq_build_tbl_keys) {
thrust::sequence(rmm::exec_policy(), lottery.begin(), lottery.end(), 0);
}

init_build_tbl<key_type, size_type>
<<<num_sms * num_blocks_init_build_tbl, block_size>>>(build_tbl,
build_tbl_size,
rand_max,
uniq_build_tbl_keys,
lottery.data(),
lottery_size,
devStates.data(),
num_states);
init_build_tbl<key_type, size_type><<<num_sms * num_blocks_init_build_tbl, block_size>>>(
build_tbl, build_tbl_size, multiplicity, devStates.data(), num_states);

CHECK_CUDA(0);

rmm::device_uvector<key_type> build_tbl_sorted(build_tbl_size, rmm::cuda_stream_default);

CUDA_TRY(cudaMemcpy(build_tbl_sorted.data(),
build_tbl,
build_tbl_size * sizeof(key_type),
cudaMemcpyDeviceToDevice));

thrust::sort(rmm::exec_policy(), build_tbl_sorted.begin(), build_tbl_sorted.end());

// Exclude keys used in build table from lottery
thrust::counting_iterator<key_type> first_lottery_elem(0);
thrust::counting_iterator<key_type> last_lottery_elem = first_lottery_elem + lottery_size;
key_type* lottery_end = thrust::set_difference(rmm::exec_policy(),
first_lottery_elem,
last_lottery_elem,
build_tbl_sorted.begin(),
build_tbl_sorted.end(),
lottery.data());

lottery_size = thrust::distance(lottery.data(), lottery_end);
auto const rand_max = std::numeric_limits<key_type>::max();

init_probe_tbl<key_type, size_type>
<<<num_sms * num_blocks_init_build_tbl, block_size>>>(probe_tbl,
probe_tbl_size,
build_tbl,
build_tbl_size,
lottery.data(),
lottery_size,
rand_max,
selectivity,
multiplicity,
devStates.data(),
num_states);

Expand Down
20 changes: 9 additions & 11 deletions cpp/benchmarks/join/join_benchmark_common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,8 @@ static void BM_join(state_type& state, Join JoinFunc)
}
}();

const cudf::size_type rand_max_val{build_table_size * 2};
const double selectivity = 0.3;
const bool is_build_table_key_unique = true;
const double selectivity = 0.3;
const int multiplicity = 1;

// Generate build and probe tables
cudf::test::UniformRandomGenerator<cudf::size_type> rand_gen(0, build_table_size);
Expand Down Expand Up @@ -95,8 +94,7 @@ static void BM_join(state_type& state, Join JoinFunc)
probe_key_column->mutable_view().data<key_type>(),
probe_table_size,
selectivity,
rand_max_val,
is_build_table_key_unique);
multiplicity);

auto payload_data_it = thrust::make_counting_iterator(0);
cudf::test::fixed_width_column_wrapper<payload_type> build_payload_column(
Expand Down Expand Up @@ -125,12 +123,12 @@ static void BM_join(state_type& state, Join JoinFunc)
if constexpr (std::is_same_v<state_type, nvbench::state> and (not is_conditional)) {
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
rmm::cuda_stream_view stream_view{launch.get_stream()};
JoinFunc(probe_table,
build_table,
columns_to_join,
columns_to_join,
cudf::null_equality::UNEQUAL,
stream_view);
auto result = JoinFunc(probe_table,
build_table,
columns_to_join,
columns_to_join,
cudf::null_equality::UNEQUAL,
stream_view);
});
}

Expand Down
4 changes: 2 additions & 2 deletions cpp/cmake/thirdparty/CUDF_GetcuCollections.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ function(find_and_configure_cucollections)

# Find or install cuCollections
CPMFindPackage(NAME cuco
GITHUB_REPOSITORY NVIDIA/cuCollections
GIT_TAG 0d602ae21ea4f38d23ed816aa948453d97b2ee4e
GITHUB_REPOSITORY PointKernel/cuCollections
GIT_TAG static-multi-map
OPTIONS "BUILD_TESTS OFF"
"BUILD_BENCHMARKS OFF"
"BUILD_EXAMPLES OFF"
Expand Down
Loading