Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of low cardinality groupby #16619

Merged
merged 175 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from 174 commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
1fa441e
Update docs
PointKernel Aug 19, 2024
65e1b5a
Minor improvement
PointKernel Aug 19, 2024
c58ddef
Migrate the GQE shared memory groupby to cudf
PointKernel Aug 20, 2024
bb24053
Merge remote-tracking branch 'upstream/branch-24.10' into shm-groupby
PointKernel Aug 20, 2024
d604d0a
Many cleanups
PointKernel Aug 21, 2024
9ab1c02
Minor cleanups: use CCCL traits in device APIs
PointKernel Aug 21, 2024
db1b26a
Move more constexpr to the helper
PointKernel Aug 21, 2024
9993283
More cleanups with constexprs
PointKernel Aug 21, 2024
c96d02c
Add doc
PointKernel Aug 21, 2024
7cd14d6
Renaming
PointKernel Aug 21, 2024
1e04c10
Fix cardinality bench
PointKernel Aug 21, 2024
b21909a
Merge remote-tracking branch 'upstream/branch-24.10' into shm-groupby
PointKernel Aug 22, 2024
47aee18
More cleanups with CG
PointKernel Aug 22, 2024
6eb3459
Use custom cuco
PointKernel Aug 28, 2024
f9adaad
Merge branch 'branch-24.10' into shm-groupby
PointKernel Aug 28, 2024
c08e9aa
Merge remote-tracking branch 'upstream/branch-24.10' into shm-groupby
PointKernel Aug 28, 2024
ee5f7fa
Cleanups with new key_eq and hash_function
PointKernel Aug 28, 2024
aa4e957
Remove the redundant num_sms function
PointKernel Aug 28, 2024
4fdb4b8
Add missing header + minor cleanup
PointKernel Aug 28, 2024
4049aeb
Clean up grid_size and shmem_size utilities
PointKernel Aug 28, 2024
690fcee
Minor cleanups with CG
PointKernel Aug 28, 2024
716a73c
Improve docs for aggregation details
PointKernel Aug 28, 2024
3c8403d
Minor cleanup
PointKernel Aug 29, 2024
25932e8
Merge remote-tracking branch 'upstream/branch-24.10' into shm-groupby
PointKernel Sep 17, 2024
e7224cb
Update device operator overloads to agg identity_initializer
PointKernel Sep 19, 2024
124aac0
Clean up groupby details for ODR
PointKernel Sep 19, 2024
50094f7
Revert back to GQE init
PointKernel Sep 19, 2024
13620c7
Pass null policies to agg kernels
PointKernel Sep 20, 2024
47de4b3
Add notes + cleanups
PointKernel Sep 20, 2024
2f04781
Fix null bugs
PointKernel Sep 20, 2024
4a0d7a0
Make var const
PointKernel Sep 21, 2024
398c9f4
Make vars const
PointKernel Sep 23, 2024
c1c53a3
Cleanups for ODR
PointKernel Sep 23, 2024
367d698
Fix a typo
PointKernel Sep 23, 2024
df15519
Renaming for clarity
PointKernel Sep 23, 2024
2a39f8f
Remove unused file
PointKernel Sep 23, 2024
890ef45
Add missing pragma once for header
PointKernel Sep 23, 2024
57bdf2c
Minor fixes
PointKernel Sep 23, 2024
d585678
Fix dictionary test failures
PointKernel Sep 24, 2024
f75f2c9
Add missing headers
PointKernel Sep 24, 2024
6b8230a
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 24, 2024
feb93c3
Separate files to reduce build time
PointKernel Sep 24, 2024
29cba47
Minor cleanups
PointKernel Sep 24, 2024
81c0e19
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 24, 2024
523737f
More explicit instantiations
PointKernel Sep 25, 2024
a574345
Test rollback
PointKernel Sep 26, 2024
4b2b55f
More explicit instantiations
PointKernel Sep 26, 2024
905ae9d
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 26, 2024
44806ba
Add missing headers + more explicit instantiations
PointKernel Sep 26, 2024
85bf877
Reorder files
PointKernel Sep 26, 2024
049acff
Fix typos + add missing header
PointKernel Sep 26, 2024
2d42b9b
Revert temp rollback
PointKernel Sep 26, 2024
b2fb181
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 26, 2024
45573e0
Cleanups
PointKernel Sep 26, 2024
dec49a8
Header cleanups
PointKernel Sep 26, 2024
7774009
More cleanups for hash_compound_agg_finalizer
PointKernel Sep 26, 2024
2acd15f
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 26, 2024
b4422c0
Separate create_sparse_results_table
PointKernel Sep 26, 2024
8ce4cda
Add groupby multi-aggs test
PointKernel Sep 27, 2024
40e3bfa
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 27, 2024
06cf48f
Further separate compute_single_pass_aggs
PointKernel Sep 27, 2024
4a95298
test
PointKernel Sep 27, 2024
7d2d00a
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 27, 2024
80ac0f9
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 29, 2024
4b247b1
Renaming + minor cleanups
PointKernel Sep 30, 2024
9059728
Remove unused code
PointKernel Sep 30, 2024
87312fa
Make compute_aggregations return sparse table
PointKernel Sep 30, 2024
bb7187d
Add rollback if encounting CUDA errors
PointKernel Sep 30, 2024
5f05ca7
Add explicit instantiations for compute_aggregations
PointKernel Sep 30, 2024
0d01c16
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Sep 30, 2024
2e30d9b
Clean up the shmem agg determination logic
PointKernel Sep 30, 2024
c242475
Fix mismatch
PointKernel Sep 30, 2024
30e572e
Clean up device aggregators
PointKernel Sep 30, 2024
0916fe7
Header cleanups
PointKernel Sep 30, 2024
e7ff94d
More header cleanups
PointKernel Sep 30, 2024
d01f0a2
Switch to cuda::std utilities for device APIs
PointKernel Sep 30, 2024
221bed4
Clean up shared aggregator early exit logic
PointKernel Sep 30, 2024
b31f16f
Clean up global aggregator early exit logic
PointKernel Sep 30, 2024
64a9065
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 1, 2024
9cea918
Fix merge conflicts
PointKernel Oct 1, 2024
fe9c212
Clean up device aggregator early exit logic
PointKernel Oct 1, 2024
1933edd
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 1, 2024
a7a9d75
Add traits to minimize code duplication
PointKernel Oct 1, 2024
cb042ef
Use traits to avoid code duplication
PointKernel Oct 1, 2024
ecdd3fd
Cannot query shmem with nested type dispatcher
PointKernel Oct 4, 2024
8c5c655
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 7, 2024
2a96255
Remove unused overloads
PointKernel Oct 7, 2024
aa30df0
Formatting
PointKernel Oct 7, 2024
c0e1a32
Fix dict request determination logic
PointKernel Oct 7, 2024
fc5dc01
Remove can_use_shmem_aggs logic
PointKernel Oct 7, 2024
c1a421f
Remove groupby multi-aggs cpp tests
PointKernel Oct 7, 2024
7a7ad61
Renaming for clarity
PointKernel Oct 7, 2024
9a7d432
Renaming
PointKernel Oct 7, 2024
328c13d
Merge branch 'branch-24.12' into shm-groupby
PointKernel Oct 8, 2024
c81cbdd
Add rollback for insufficient shared memory case
PointKernel Oct 8, 2024
2d601a7
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 8, 2024
ed3e92b
Minor cleanups
PointKernel Oct 8, 2024
7c1aa4a
Minor fix
PointKernel Oct 8, 2024
e976678
Revert custom cuco
PointKernel Oct 8, 2024
4b8974f
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 8, 2024
e028fa5
Set proper ref type on host
PointKernel Oct 8, 2024
43ac320
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 8, 2024
6fb9b27
Merge remote-tracking branch 'origin/shm-groupby' into shm-groupby
PointKernel Oct 8, 2024
1354db2
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 10, 2024
5ea276c
Clean up mapping indices calculations
PointKernel Oct 10, 2024
d32b1e7
Minor cleanups for find_global_mapping
PointKernel Oct 10, 2024
32655cf
Use size_type instead of int
PointKernel Oct 10, 2024
2548871
Renaming + spacing for clarity
PointKernel Oct 11, 2024
1b09ec1
Clean up shared memory agg init
PointKernel Oct 11, 2024
56d75fb
Move compute_mapping_indices to its own TU to reduce build time
PointKernel Oct 11, 2024
4eed822
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 11, 2024
ab5ef60
Clean up the shared memory init function
PointKernel Oct 11, 2024
5bfe6ea
Add reminder
PointKernel Oct 11, 2024
d597ea7
Remove unused header
PointKernel Oct 11, 2024
6403b3c
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 11, 2024
1e85f08
Renaming + API cleanups
PointKernel Oct 11, 2024
80f9275
Get rid of redundant bitmask calculation
PointKernel Oct 11, 2024
5baa2cf
Add missing header
PointKernel Oct 11, 2024
53e0e00
Add missing header
PointKernel Oct 11, 2024
57a450a
Clean up headers
PointKernel Oct 11, 2024
4c35ced
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 14, 2024
5b92cd0
Minor cleanup on ref type determination
PointKernel Oct 14, 2024
98aa468
Add device num_bitmask_words device utility
PointKernel Oct 14, 2024
8be8d15
Fix a minor bug determining column C++ type
PointKernel Oct 14, 2024
d3c465b
Bug fix: use type_dispatcher
PointKernel Oct 14, 2024
f109b81
Pass block to compute_final_aggregations
PointKernel Oct 14, 2024
280db67
Cleanup: use offsets instead pointers to save memory space
PointKernel Oct 14, 2024
8a0551e
Rename for clarity
PointKernel Oct 14, 2024
5c49300
Minor improvement to reduce build time
PointKernel Oct 14, 2024
99010b3
Use mask logic instead of null logic
PointKernel Oct 14, 2024
0a8ef96
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 14, 2024
d071662
Minor header cleanup
PointKernel Oct 15, 2024
4e2c2cc
Remove unused code + clean up null check
PointKernel Oct 15, 2024
c2514f6
Use cuda::std::byte on device
PointKernel Oct 15, 2024
51114c9
Revert agg details
PointKernel Oct 19, 2024
a3c6eb2
Fetch trunk aggregators
PointKernel Oct 19, 2024
a8f8ab3
Fetch trunk hash_compound_agg_finalizer
PointKernel Oct 19, 2024
9746891
Fetch trunk groupby
PointKernel Oct 19, 2024
91c75a2
Fetch trunk compute_groupby
PointKernel Oct 19, 2024
5319fcc
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 20, 2024
17072b0
Make mask const
PointKernel Oct 21, 2024
4672734
Use size_type instead of int
PointKernel Oct 21, 2024
948810e
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 22, 2024
f8220d9
Move global agg to its own TU + renaming
PointKernel Oct 22, 2024
aeef28b
Rename for clarity
PointKernel Oct 22, 2024
6b323f0
Rename direct_aggregations as needs_global_memory_fallback
PointKernel Oct 22, 2024
ed9243b
Use atomic_flag to avoid UB
PointKernel Oct 22, 2024
bcce437
Cleanups
PointKernel Oct 22, 2024
086cbe8
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 22, 2024
e3726e3
Further split compute_global_memory_aggs
PointKernel Oct 22, 2024
9f49f5f
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 23, 2024
0318417
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 23, 2024
3775ec8
Remove unused code
PointKernel Oct 23, 2024
8dd5535
Sync to make sure the data is valid
PointKernel Oct 23, 2024
59accf6
Add comments
PointKernel Oct 23, 2024
8be28d0
Add comments
PointKernel Oct 23, 2024
91da22e
Remove redundant sync
PointKernel Oct 23, 2024
cf289d1
Add CUDF_UNREACHABLE instead of silent break + remove outdated comments
PointKernel Oct 23, 2024
a1d139a
Add doc
PointKernel Oct 23, 2024
6cdf36b
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 28, 2024
f9f201a
Fix leftover
PointKernel Oct 28, 2024
a693932
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 30, 2024
ac03ce8
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 31, 2024
fef1ca8
Renaming for clarity + add missing func
PointKernel Oct 31, 2024
8ccd817
Minor fix
PointKernel Oct 31, 2024
540503d
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Oct 31, 2024
5d5e7ff
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Nov 4, 2024
0c315f8
Update comments
PointKernel Nov 4, 2024
7131c9f
Apply suggestions from code review
PointKernel Nov 6, 2024
c520f41
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Nov 6, 2024
5c6b33c
Make compute_shmem_offsets_size constexpr
PointKernel Nov 6, 2024
b05fab4
Formatting
PointKernel Nov 7, 2024
f32bbf8
Merge remote-tracking branch 'upstream/branch-24.12' into shm-groupby
PointKernel Nov 7, 2024
6a5d582
Merge branch 'branch-24.12' into shm-groupby
hyperbolic2346 Nov 8, 2024
96fbaa9
Update cpp/src/groupby/hash/single_pass_functors.cuh
PointKernel Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -368,11 +368,14 @@ add_library(
src/filling/repeat.cu
src/filling/sequence.cu
src/groupby/groupby.cu
src/groupby/hash/compute_aggregations.cu
src/groupby/hash/compute_aggregations_null.cu
src/groupby/hash/compute_global_memory_aggs.cu
src/groupby/hash/compute_global_memory_aggs_null.cu
src/groupby/hash/compute_groupby.cu
src/groupby/hash/compute_mapping_indices.cu
src/groupby/hash/compute_mapping_indices_null.cu
src/groupby/hash/compute_shared_memory_aggs.cu
src/groupby/hash/compute_single_pass_aggs.cu
src/groupby/hash/create_sparse_results_table.cu
src/groupby/hash/flatten_single_pass_aggs.cpp
src/groupby/hash/groupby.cu
Expand Down
1 change: 0 additions & 1 deletion cpp/src/groupby/groupby.cu
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,6 @@
#include <cudf/dictionary/dictionary_column_view.hpp>
#include <cudf/groupby.hpp>
#include <cudf/reduction/detail/histogram.hpp>
#include <cudf/strings/string_view.hpp>
#include <cudf/table/table.hpp>
#include <cudf/table/table_view.hpp>
#include <cudf/types.hpp>
Expand Down
29 changes: 29 additions & 0 deletions cpp/src/groupby/hash/compute_aggregations.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "compute_aggregations.cuh"
#include "compute_aggregations.hpp"

namespace cudf::groupby::detail::hash {
template rmm::device_uvector<cudf::size_type> compute_aggregations<global_set_t>(
int64_t num_rows,
bool skip_rows_with_nulls,
bitmask_type const* row_bitmask,
global_set_t& global_set,
cudf::host_span<cudf::groupby::aggregation_request const> requests,
cudf::detail::result_cache* sparse_results,
rmm::cuda_stream_view stream);
} // namespace cudf::groupby::detail::hash
185 changes: 185 additions & 0 deletions cpp/src/groupby/hash/compute_aggregations.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include "compute_aggregations.hpp"
#include "compute_global_memory_aggs.hpp"
#include "compute_mapping_indices.hpp"
#include "compute_shared_memory_aggs.hpp"
#include "create_sparse_results_table.hpp"
#include "flatten_single_pass_aggs.hpp"
#include "helpers.cuh"
#include "single_pass_functors.cuh"

#include <cudf/detail/aggregation/result_cache.hpp>
#include <cudf/detail/utilities/vector_factories.hpp>
#include <cudf/groupby.hpp>
#include <cudf/table/table_device_view.cuh>
#include <cudf/types.hpp>
#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_scalar.hpp>
#include <rmm/device_uvector.hpp>
#include <rmm/exec_policy.hpp>

#include <cuco/static_set.cuh>
#include <cuda/std/atomic>
#include <thrust/for_each.h>

#include <algorithm>
#include <memory>
#include <vector>

namespace cudf::groupby::detail::hash {
/**
* @brief Computes all aggregations from `requests` that require a single pass
* over the data and stores the results in `sparse_results`
*/
template <typename SetType>
rmm::device_uvector<cudf::size_type> compute_aggregations(
int64_t num_rows,
bool skip_rows_with_nulls,
bitmask_type const* row_bitmask,
SetType& global_set,
cudf::host_span<cudf::groupby::aggregation_request const> requests,
cudf::detail::result_cache* sparse_results,
rmm::cuda_stream_view stream)
{
// flatten the aggs to a table that can be operated on by aggregate_row
auto [flattened_values, agg_kinds, aggs] = flatten_single_pass_aggs(requests);
auto const d_agg_kinds = cudf::detail::make_device_uvector_async(
agg_kinds, stream, rmm::mr::get_current_device_resource());

auto const grid_size =
max_occupancy_grid_size<typename SetType::ref_type<cuco::insert_and_find_tag>>(num_rows);
auto const available_shmem_size = get_available_shared_memory_size(grid_size);
auto const has_sufficient_shmem =
available_shmem_size > (compute_shmem_offsets_size(flattened_values.num_columns()) * 2);
auto const has_dictionary_request = std::any_of(
requests.begin(), requests.end(), [](cudf::groupby::aggregation_request const& request) {
return cudf::is_dictionary(request.values.type());
});
auto const is_shared_memory_compatible = !has_dictionary_request and has_sufficient_shmem;

// Performs naive global memory aggregations when the workload is not compatible with shared
// memory, such as when aggregating dictionary columns or when there is insufficient dynamic
// shared memory for shared memory aggregations.
if (!is_shared_memory_compatible) {
return compute_global_memory_aggs(num_rows,
skip_rows_with_nulls,
row_bitmask,
flattened_values,
d_agg_kinds.data(),
agg_kinds,
global_set,
aggs,
sparse_results,
stream);
}

// 'populated_keys' contains inserted row_indices (keys) of global hash set
rmm::device_uvector<cudf::size_type> populated_keys(num_rows, stream);
// 'local_mapping_index' maps from the global row index of the input table to its block-wise rank
rmm::device_uvector<cudf::size_type> local_mapping_index(num_rows, stream);
// 'global_mapping_index' maps from the block-wise rank to the row index of global aggregate table
rmm::device_uvector<cudf::size_type> global_mapping_index(grid_size * GROUPBY_SHM_MAX_ELEMENTS,
stream);
rmm::device_uvector<cudf::size_type> block_cardinality(grid_size, stream);

// Flag indicating whether a global memory aggregation fallback is required or not
rmm::device_scalar<cuda::std::atomic_flag> needs_global_memory_fallback(stream);
PointKernel marked this conversation as resolved.
Show resolved Hide resolved

auto global_set_ref = global_set.ref(cuco::op::insert_and_find);

compute_mapping_indices(grid_size,
num_rows,
global_set_ref,
row_bitmask,
skip_rows_with_nulls,
local_mapping_index.data(),
global_mapping_index.data(),
block_cardinality.data(),
needs_global_memory_fallback.data(),
stream);

cuda::std::atomic_flag h_needs_fallback;
// Cannot use `device_scalar::value` as it requires a copy constructor, which
// `atomic_flag` doesn't have.
CUDF_CUDA_TRY(cudaMemcpyAsync(&h_needs_fallback,
needs_global_memory_fallback.data(),
sizeof(cuda::std::atomic_flag),
cudaMemcpyDefault,
stream.value()));
stream.synchronize();
auto const needs_fallback = h_needs_fallback.test();

// make table that will hold sparse results
cudf::table sparse_table = create_sparse_results_table(flattened_values,
d_agg_kinds.data(),
agg_kinds,
needs_fallback,
global_set,
populated_keys,
stream);
// prepare to launch kernel to do the actual aggregation
auto d_values = table_device_view::create(flattened_values, stream);
auto d_sparse_table = mutable_table_device_view::create(sparse_table, stream);

compute_shared_memory_aggs(grid_size,
available_shmem_size,
num_rows,
row_bitmask,
skip_rows_with_nulls,
local_mapping_index.data(),
global_mapping_index.data(),
block_cardinality.data(),
*d_values,
*d_sparse_table,
d_agg_kinds.data(),
stream);
davidwendt marked this conversation as resolved.
Show resolved Hide resolved

// The shared memory groupby is designed so that each thread block can handle up to 128 unique
// keys. When a block reaches this cardinality limit, shared memory becomes insufficient to store
// the temporary aggregation results. In these situations, we must fall back to a global memory
// aggregator to process the remaining aggregation requests.
if (needs_fallback) {
auto const stride = GROUPBY_BLOCK_SIZE * grid_size;
thrust::for_each_n(rmm::exec_policy_nosync(stream),
thrust::counting_iterator{0},
num_rows,
global_memory_fallback_fn{global_set_ref,
*d_values,
*d_sparse_table,
d_agg_kinds.data(),
block_cardinality.data(),
stride,
row_bitmask,
skip_rows_with_nulls});
extract_populated_keys(global_set, populated_keys, stream);
}

// Add results back to sparse_results cache
auto sparse_result_cols = sparse_table.release();
for (size_t i = 0; i < aggs.size(); i++) {
// Note that the cache will make a copy of this temporary aggregation
sparse_results->add_result(
flattened_values.column(i), *aggs[i], std::move(sparse_result_cols[i]));
}

return populated_keys;
}
} // namespace cudf::groupby::detail::hash
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,20 @@
#include <cudf/utilities/span.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>

namespace cudf::groupby::detail::hash {
/**
* @brief Computes all aggregations from `requests` that require a single pass
* over the data and stores the results in `sparse_results`
*/
template <typename SetType>
void compute_single_pass_aggs(int64_t num_keys,
bool skip_rows_with_nulls,
bitmask_type const* row_bitmask,
SetType set,
cudf::host_span<cudf::groupby::aggregation_request const> requests,
cudf::detail::result_cache* sparse_results,
rmm::cuda_stream_view stream);
rmm::device_uvector<cudf::size_type> compute_aggregations(
int64_t num_rows,
bool skip_rows_with_nulls,
bitmask_type const* row_bitmask,
SetType& global_set,
cudf::host_span<cudf::groupby::aggregation_request const> requests,
cudf::detail::result_cache* sparse_results,
rmm::cuda_stream_view stream);
} // namespace cudf::groupby::detail::hash
29 changes: 29 additions & 0 deletions cpp/src/groupby/hash/compute_aggregations_null.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "compute_aggregations.cuh"
#include "compute_aggregations.hpp"

namespace cudf::groupby::detail::hash {
template rmm::device_uvector<cudf::size_type> compute_aggregations<nullable_global_set_t>(
int64_t num_rows,
bool skip_rows_with_nulls,
bitmask_type const* row_bitmask,
nullable_global_set_t& global_set,
cudf::host_span<cudf::groupby::aggregation_request const> requests,
cudf::detail::result_cache* sparse_results,
rmm::cuda_stream_view stream);
} // namespace cudf::groupby::detail::hash
32 changes: 32 additions & 0 deletions cpp/src/groupby/hash/compute_global_memory_aggs.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include "compute_global_memory_aggs.cuh"
#include "compute_global_memory_aggs.hpp"

namespace cudf::groupby::detail::hash {
template rmm::device_uvector<cudf::size_type> compute_global_memory_aggs<global_set_t>(
cudf::size_type num_rows,
bool skip_rows_with_nulls,
bitmask_type const* row_bitmask,
cudf::table_view const& flattened_values,
cudf::aggregation::Kind const* d_agg_kinds,
std::vector<cudf::aggregation::Kind> const& agg_kinds,
global_set_t& global_set,
std::vector<std::unique_ptr<aggregation>>& aggregations,
cudf::detail::result_cache* sparse_results,
rmm::cuda_stream_view stream);
} // namespace cudf::groupby::detail::hash
Loading
Loading