Skip to content

Commit

Permalink
[FEA] Performance improvement for mixed left semi/anti join (#15288)
Browse files Browse the repository at this point in the history
Current implementation of mixed semi/anti join probes the built hash table twice -- once to find the output table size and once to build the output. Since the upper bound on output table size is O(N) where N is the size of the left table, we can avoid probing twice and achieve a faster join implementation.

This implementation reserves the required upper memory bound, builds the output, and then collects the relevant output rows. This probes the hash table only once.

This PR also removes the size kernels for mixed semi join and output size parameters passed to the mixed semi join.

Closes #15250

# Benchmark Results from cudf repository

## mixed_left_semi_join_32bit (New implementation)

### [0] NVIDIA TITAN V
```
| Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|----------|--------------|----------|------------------|------------------|---------|------------|-------|------------|-------|
|      I32 |          I32 |        0 |           100000 |           100000 |   1920x | 266.239 us | 3.43% | 261.324 us | 2.84% |
|      I32 |          I32 |        0 |           100000 |           400000 |   1024x | 495.434 us | 1.18% | 490.544 us | 0.63% |
|      I32 |          I32 |        0 |         10000000 |         10000000 |     24x |  20.919 ms | 0.04% |  20.914 ms | 0.03% |
|      I32 |          I32 |        0 |         10000000 |         40000000 |     11x |  54.697 ms | 0.03% |  54.692 ms | 0.03% |
|      I32 |          I32 |        0 |         10000000 |        100000000 |     11x | 122.171 ms | 0.03% | 122.166 ms | 0.03% |
|      I32 |          I32 |        0 |         80000000 |        100000000 |     11x | 192.979 ms | 0.01% | 192.975 ms | 0.01% |
|      I32 |          I32 |        0 |        100000000 |        100000000 |     11x | 212.878 ms | 0.01% | 212.874 ms | 0.01% |
|      I32 |          I32 |        0 |         10000000 |        240000000 |     11x | 279.794 ms | 0.01% | 279.790 ms | 0.01% |
|      I32 |          I32 |        0 |         80000000 |        240000000 |     11x | 351.186 ms | 0.01% | 351.183 ms | 0.01% |
|      I32 |          I32 |        0 |        100000000 |        240000000 |     11x | 370.794 ms | 0.01% | 370.790 ms | 0.01% |
```

## mixed_left_semi_join_32bit (Old implementation)

### [0] NVIDIA TITAN V
```
| Key Type | Payload Type | Nullable | Build Table Size | Probe Table Size | Samples |  CPU Time  | Noise |  GPU Time  | Noise |
|----------|--------------|----------|------------------|------------------|---------|------------|-------|------------|-------|
|      I32 |          I32 |        0 |           100000 |           100000 |   1392x | 368.030 us | 3.05% | 363.065 us | 2.70% |
|      I32 |          I32 |        0 |           100000 |           400000 |    832x | 832.492 us | 0.84% | 827.586 us | 0.60% |
|      I32 |          I32 |        0 |         10000000 |         10000000 |     16x |  32.310 ms | 0.03% |  32.305 ms | 0.03% |
|      I32 |          I32 |        0 |         10000000 |         40000000 |     11x | 100.222 ms | 0.03% | 100.218 ms | 0.03% |
|      I32 |          I32 |        0 |         10000000 |        100000000 |     11x | 235.874 ms | 0.01% | 235.870 ms | 0.01% |
|      I32 |          I32 |        0 |         80000000 |        100000000 |     11x | 307.042 ms | 0.01% | 307.038 ms | 0.01% |
|      I32 |          I32 |        0 |        100000000 |        100000000 |     11x | 326.797 ms | 0.01% | 326.794 ms | 0.01% |
|      I32 |          I32 |        0 |         10000000 |        240000000 |     11x | 552.730 ms | 0.01% | 552.728 ms | 0.01% |
|      I32 |          I32 |        0 |         80000000 |        240000000 |     11x | 624.958 ms | 0.01% | 624.956 ms | 0.01% |
|      I32 |          I32 |        0 |        100000000 |        240000000 |     11x | 644.148 ms | 0.00% | 644.146 ms | 0.00% |
```

Authors:
  - Tanmay Gujar (https://github.com/tgujar)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Yunsong Wang (https://github.com/PointKernel)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Bradley Dice (https://github.com/bdice)

URL: #15288
  • Loading branch information
tgujar authored Apr 3, 2024
1 parent 5192b60 commit fbaad8a
Show file tree
Hide file tree
Showing 10 changed files with 42 additions and 992 deletions.
1 change: 0 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,6 @@ add_library(
src/join/mixed_join_semi.cu
src/join/mixed_join_size_kernel.cu
src/join/mixed_join_size_kernel_nulls.cu
src/join/mixed_join_size_kernels_semi.cu
src/join/semi_join.cu
src/json/json_path.cu
src/lists/contains.cu
Expand Down
90 changes: 2 additions & 88 deletions cpp/include/cudf/join.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -944,9 +944,6 @@ mixed_full_join(
* @param right_conditional The right table used for the conditional join
* @param binary_predicate The condition on which to join
* @param compare_nulls Whether or not null values join to each other or not
* @param output_size_data An optional pair of values indicating the exact output size and the
* number of matches for each row in the larger of the two input tables, left or right (may be
* precomputed using the corresponding mixed_full_join_size API).
* @param mr Device memory resource used to allocate the returned table and columns' device memory
*
* @return A pair of vectors [`left_indices`, `right_indices`] that can be used to construct
Expand All @@ -958,8 +955,7 @@ std::unique_ptr<rmm::device_uvector<size_type>> mixed_left_semi_join(
table_view const& left_conditional,
table_view const& right_conditional,
ast::expression const& binary_predicate,
null_equality compare_nulls = null_equality::EQUAL,
std::optional<std::pair<std::size_t, device_span<size_type const>>> output_size_data = {},
null_equality compare_nulls = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
Expand Down Expand Up @@ -996,9 +992,6 @@ std::unique_ptr<rmm::device_uvector<size_type>> mixed_left_semi_join(
* @param right_conditional The right table used for the conditional join
* @param binary_predicate The condition on which to join
* @param compare_nulls Whether or not null values join to each other or not
* @param output_size_data An optional pair of values indicating the exact output size and the
* number of matches for each row in the larger of the two input tables, left or right (may be
* precomputed using the corresponding mixed_full_join_size API).
* @param mr Device memory resource used to allocate the returned table and columns' device memory
*
* @return A pair of vectors [`left_indices`, `right_indices`] that can be used to construct
Expand All @@ -1010,8 +1003,7 @@ std::unique_ptr<rmm::device_uvector<size_type>> mixed_left_anti_join(
table_view const& left_conditional,
table_view const& right_conditional,
ast::expression const& binary_predicate,
null_equality compare_nulls = null_equality::EQUAL,
std::optional<std::pair<std::size_t, device_span<size_type const>>> output_size_data = {},
null_equality compare_nulls = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
Expand Down Expand Up @@ -1094,84 +1086,6 @@ std::pair<std::size_t, std::unique_ptr<rmm::device_uvector<size_type>>> mixed_le
null_equality compare_nulls = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the exact number of matches (rows) when performing a mixed
* left semi join between the specified tables where the columns of the
* equality table are equal and the predicate evaluates to true on the
* conditional tables.
*
* If the provided predicate returns NULL for a pair of rows (left, right),
* that pair is not included in the output. It is the user's responsibility to
* choose a suitable compare_nulls value AND use appropriate null-safe
* operators in the expression.
*
* @throw cudf::logic_error If the binary predicate outputs a non-boolean result.
* @throw cudf::logic_error If the number of rows in left_equality and left_conditional do not
* match.
* @throw cudf::logic_error If the number of rows in right_equality and right_conditional do not
* match.
*
* @param left_equality The left table used for the equality join
* @param right_equality The right table used for the equality join
* @param left_conditional The left table used for the conditional join
* @param right_conditional The right table used for the conditional join
* @param binary_predicate The condition on which to join
* @param compare_nulls Whether or not null values join to each other or not
* @param mr Device memory resource used to allocate the returned table and columns' device memory
*
* @return A pair containing the size that would result from performing the
* requested join and the number of matches for each row in one of the two
* tables. Which of the two tables is an implementation detail and should not
* be relied upon, simply passed to the corresponding `mixed_left_join` API as
* is.
*/
std::pair<std::size_t, std::unique_ptr<rmm::device_uvector<size_type>>> mixed_left_semi_join_size(
table_view const& left_equality,
table_view const& right_equality,
table_view const& left_conditional,
table_view const& right_conditional,
ast::expression const& binary_predicate,
null_equality compare_nulls = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the exact number of matches (rows) when performing a mixed
* left anti join between the specified tables.
*
* If the provided predicate returns NULL for a pair of rows (left, right),
* that pair is not included in the output. It is the user's responsibility to
* choose a suitable compare_nulls value AND use appropriate null-safe
* operators in the expression.
*
* @throw cudf::logic_error If the binary predicate outputs a non-boolean result.
* @throw cudf::logic_error If the number of rows in left_equality and left_conditional do not
* match.
* @throw cudf::logic_error If the number of rows in right_equality and right_conditional do not
* match.
*
* @param left_equality The left table used for the equality join
* @param right_equality The right table used for the equality join
* @param left_conditional The left table used for the conditional join
* @param right_conditional The right table used for the conditional join
* @param binary_predicate The condition on which to join
* @param compare_nulls Whether or not null values join to each other or not
* @param mr Device memory resource used to allocate the returned table and columns' device memory
*
* @return A pair containing the size that would result from performing the
* requested join and the number of matches for each row in one of the two
* tables. Which of the two tables is an implementation detail and should not
* be relied upon, simply passed to the corresponding `mixed_left_join` API as
* is.
*/
std::pair<std::size_t, std::unique_ptr<rmm::device_uvector<size_type>>> mixed_left_anti_join_size(
table_view const& left_equality,
table_view const& right_equality,
table_view const& left_conditional,
table_view const& right_conditional,
ast::expression const& binary_predicate,
null_equality compare_nulls = null_equality::EQUAL,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the exact number of matches (rows) when performing a
* conditional inner join between the specified tables where the predicate
Expand Down
31 changes: 10 additions & 21 deletions cpp/src/join/mixed_join_kernels_semi.cu
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,9 @@ __attribute__((visibility("hidden"))) __launch_bounds__(block_size) __global__
table_device_view build,
row_hash const hash_probe,
row_equality const equality_probe,
join_kind const join_type,
cudf::detail::semi_map_type::device_view hash_table_view,
size_type* join_output_l,
cudf::ast::detail::expression_device_view device_expression_data,
cudf::size_type const* join_result_offsets,
bool const swap_tables)
cudf::device_span<bool> left_table_keep_mask,
cudf::ast::detail::expression_device_view device_expression_data)
{
// Normally the casting of a shared memory array is used to create multiple
// arrays of different types from the shared memory buffer, but here it is
Expand All @@ -60,7 +57,7 @@ __attribute__((visibility("hidden"))) __launch_bounds__(block_size) __global__

cudf::size_type const left_num_rows = left_table.num_rows();
cudf::size_type const right_num_rows = right_table.num_rows();
auto const outer_num_rows = (swap_tables ? right_num_rows : left_num_rows);
auto const outer_num_rows = left_num_rows;

cudf::size_type outer_row_index = threadIdx.x + blockIdx.x * block_size;

Expand All @@ -70,12 +67,10 @@ __attribute__((visibility("hidden"))) __launch_bounds__(block_size) __global__
if (outer_row_index < outer_num_rows) {
// Figure out the number of elements for this key.
auto equality = single_expression_equality<has_nulls>{
evaluator, thread_intermediate_storage, swap_tables, equality_probe};
evaluator, thread_intermediate_storage, false, equality_probe};

if ((join_type == join_kind::LEFT_ANTI_JOIN) !=
(hash_table_view.contains(outer_row_index, hash_probe, equality))) {
*(join_output_l + join_result_offsets[outer_row_index]) = outer_row_index;
}
left_table_keep_mask[outer_row_index] =
hash_table_view.contains(outer_row_index, hash_probe, equality);
}
}

Expand All @@ -86,12 +81,9 @@ template __global__ void mixed_join_semi<DEFAULT_JOIN_BLOCK_SIZE, true>(
table_device_view build,
row_hash const hash_probe,
row_equality const equality_probe,
join_kind const join_type,
cudf::detail::semi_map_type::device_view hash_table_view,
size_type* join_output_l,
cudf::ast::detail::expression_device_view device_expression_data,
cudf::size_type const* join_result_offsets,
bool const swap_tables);
cudf::device_span<bool> left_table_keep_mask,
cudf::ast::detail::expression_device_view device_expression_data);

template __global__ void mixed_join_semi<DEFAULT_JOIN_BLOCK_SIZE, false>(
table_device_view left_table,
Expand All @@ -100,12 +92,9 @@ template __global__ void mixed_join_semi<DEFAULT_JOIN_BLOCK_SIZE, false>(
table_device_view build,
row_hash const hash_probe,
row_equality const equality_probe,
join_kind const join_type,
cudf::detail::semi_map_type::device_view hash_table_view,
size_type* join_output_l,
cudf::ast::detail::expression_device_view device_expression_data,
cudf::size_type const* join_result_offsets,
bool const swap_tables);
cudf::device_span<bool> left_table_keep_mask,
cudf::ast::detail::expression_device_view device_expression_data);

} // namespace detail

Expand Down
64 changes: 5 additions & 59 deletions cpp/src/join/mixed_join_kernels_semi.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -27,53 +27,7 @@ namespace cudf {
namespace detail {

/**
* @brief Computes the output size of joining the left table to the right table for semi/anti joins.
*
* This method probes the hash table with each row in the probe table using a
* custom equality comparator that also checks that the conditional expression
* evaluates to true between the left/right tables when a match is found
* between probe and build rows.
*
* @tparam block_size The number of threads per block for this kernel
* @tparam has_nulls Whether or not the inputs may contain nulls.
*
* @param[in] left_table The left table
* @param[in] right_table The right table
* @param[in] probe The table with which to probe the hash table for matches.
* @param[in] build The table with which the hash table was built.
* @param[in] hash_probe The hasher used for the probe table.
* @param[in] equality_probe The equality comparator used when probing the hash table.
* @param[in] join_type The type of join to be performed
* @param[in] hash_table_view The hash table built from `build`.
* @param[in] device_expression_data Container of device data required to evaluate the desired
* expression.
* @param[in] swap_tables If true, the kernel was launched with one thread per right row and
* the kernel needs to internally loop over left rows. Otherwise, loop over right rows.
* @param[out] output_size The resulting output size
* @param[out] matches_per_row The number of matches in one pair of
* equality/conditional tables for each row in the other pair of tables. If
* swap_tables is true, matches_per_row corresponds to the right_table,
* otherwise it corresponds to the left_table. Note that corresponding swap of
* left/right tables to determine which is the build table and which is the
* probe table has already happened on the host.
*/
template <int block_size, bool has_nulls>
__global__ void compute_mixed_join_output_size_semi(
table_device_view left_table,
table_device_view right_table,
table_device_view probe,
table_device_view build,
row_hash const hash_probe,
row_equality const equality_probe,
join_kind const join_type,
cudf::detail::semi_map_type::device_view hash_table_view,
ast::detail::expression_device_view device_expression_data,
bool const swap_tables,
std::size_t* output_size,
cudf::device_span<cudf::size_type> matches_per_row);

/**
* @brief Performs a semi/anti join using the combination of a hash lookup to
* @brief Performs a semi join using the combination of a hash lookup to
* identify equal rows between one pair of tables and the evaluation of an
* expression containing an arbitrary expression.
*
Expand All @@ -91,16 +45,11 @@ __global__ void compute_mixed_join_output_size_semi(
* @param[in] build The table with which the hash table was built.
* @param[in] hash_probe The hasher used for the probe table.
* @param[in] equality_probe The equality comparator used when probing the hash table.
* @param[in] join_type The type of join to be performed
* @param[in] hash_table_view The hash table built from `build`.
* @param[out] join_output_l The left result of the join operation
* @param[out] left_table_keep_mask The result of the join operation with "true" element indicating
* the corresponding index from left table is present in output
* @param[in] device_expression_data Container of device data required to evaluate the desired
* expression.
* @param[in] join_result_offsets The starting indices in join_output[l|r]
* where the matches for each row begin. Equivalent to a prefix sum of
* matches_per_row.
* @param[in] swap_tables If true, the kernel was launched with one thread per right row and
* the kernel needs to internally loop over left rows. Otherwise, loop over right rows.
*/
template <cudf::size_type block_size, bool has_nulls>
__global__ void mixed_join_semi(table_device_view left_table,
Expand All @@ -109,12 +58,9 @@ __global__ void mixed_join_semi(table_device_view left_table,
table_device_view build,
row_hash const hash_probe,
row_equality const equality_probe,
join_kind const join_type,
cudf::detail::semi_map_type::device_view hash_table_view,
size_type* join_output_l,
cudf::ast::detail::expression_device_view device_expression_data,
cudf::size_type const* join_result_offsets,
bool const swap_tables);
cudf::device_span<bool> left_table_keep_mask,
cudf::ast::detail::expression_device_view device_expression_data);

} // namespace detail

Expand Down
Loading

0 comments on commit fbaad8a

Please sign in to comment.