Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support structs column in min, max, argmin and argmax groupby aggregate() and scan() #9545

Merged
merged 53 commits into from
Nov 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
1d1a35b
Add condition to fallback to sort-based aggregates if the input value…
ttnghia Oct 26, 2021
f36abed
Rename function
ttnghia Oct 26, 2021
b0b4535
Implement argmin/argmax for structs
ttnghia Oct 26, 2021
6387ae2
Add comments and cleanup
ttnghia Oct 27, 2021
fddbac9
Cleanup
ttnghia Oct 27, 2021
b86665e
Simplify code
ttnghia Oct 27, 2021
96683ac
Fix null order
ttnghia Oct 28, 2021
b26cc93
Add unit tests
ttnghia Oct 28, 2021
e5d6475
Merge branch 'branch-21.12' into min_max_for_structs
ttnghia Oct 28, 2021
895cabb
Rename functor
ttnghia Oct 28, 2021
bfc0585
Move `has_struct` condition check into `can_use_has_groupby`
ttnghia Oct 28, 2021
cc5c8c4
Rename structs and function
ttnghia Oct 29, 2021
d9703c8
Merge branch 'branch-21.12' into min_max_for_structs
ttnghia Oct 29, 2021
cd7f7a4
Fix SFINAE condition, and extract a struct functor
ttnghia Nov 1, 2021
f7d1b3e
Implement groupby scan for struct min/max
ttnghia Nov 1, 2021
5d77d4f
Implement unit tests
ttnghia Nov 1, 2021
bce93e4
Rewrite SFINAE style
ttnghia Nov 1, 2021
b1b916f
Add missing `mr` parameter
ttnghia Nov 1, 2021
75e201f
Refactor `row_arg_minmax`
ttnghia Nov 2, 2021
08a60f8
Adopt "dispatch to static invoke" pattern
ttnghia Nov 2, 2021
3a0c580
Rename functors to better expressive names
ttnghia Nov 2, 2021
42e8f23
Merge branch 'branch-21.12' into min_max_for_structs
ttnghia Nov 2, 2021
f4c53c2
Fix formatting style
ttnghia Nov 2, 2021
b1a3628
Fix formatting style
ttnghia Nov 2, 2021
57858d5
Merge branch 'branch-21.12' into min_max_for_structs
ttnghia Nov 3, 2021
d868d66
Remove redundant template argument
ttnghia Nov 3, 2021
94eed99
Rewrite SFINAE into specialization
ttnghia Nov 3, 2021
cb6fb5f
Attempt to patch thrust
ttnghia Nov 4, 2021
e6885d4
Revert "Attempt to patch thrust"
ttnghia Nov 4, 2021
d4d4644
Add declaration for new internal APIs
ttnghia Nov 4, 2021
ad7998f
Call the specialized functions for struct type values
ttnghia Nov 4, 2021
0c2b0b4
Add new .cu files
ttnghia Nov 4, 2021
f063e26
Remove `struct_view` specialization
ttnghia Nov 4, 2021
9456ea3
Implement `struct_view` specialization
ttnghia Nov 4, 2021
43be509
Fix output order
ttnghia Nov 4, 2021
7cee90c
Fix EXPECT conditions
ttnghia Nov 4, 2021
534648a
Merge branch 'branch-21.12' into min_max_for_structs
ttnghia Nov 4, 2021
fee55e3
Refactor `row_operators.cuh`
ttnghia Nov 5, 2021
2fde89a
Fix function name typo
ttnghia Nov 5, 2021
ac9c603
Remove redundant header
ttnghia Nov 5, 2021
f5d27ae
Revert "Refactor `row_operators.cuh`"
ttnghia Nov 5, 2021
7a7c706
Prevent functor code from inlining
ttnghia Nov 5, 2021
2a6a106
Revert "Remove redundant header"
ttnghia Nov 5, 2021
88ef471
Revert "Remove `struct_view` specialization"
ttnghia Nov 5, 2021
494af01
Revert "Add new .cu files"
ttnghia Nov 5, 2021
3833bbc
Revert "Call the specialized functions for struct type values"
ttnghia Nov 5, 2021
61c774a
Revert "Add declaration for new internal APIs"
ttnghia Nov 5, 2021
a5ab52d
Fix function name
ttnghia Nov 5, 2021
731426a
Fix CMakeList.txt
ttnghia Nov 5, 2021
49259f3
Remove files
ttnghia Nov 5, 2021
cbf386f
Add groupby struct benchmark
ttnghia Nov 8, 2021
8b3f72e
Implement benchmark
ttnghia Nov 8, 2021
cdfc602
Unify 2 functions into a template function
ttnghia Nov 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ ConfigureBench(FILL_BENCH filling/repeat_benchmark.cpp)
# * groupby benchmark -----------------------------------------------------------------------------
ConfigureBench(
GROUPBY_BENCH groupby/group_sum_benchmark.cu groupby/group_nth_benchmark.cu
groupby/group_shift_benchmark.cu
groupby/group_shift_benchmark.cu groupby/group_struct_benchmark.cu
)

# ##################################################################################################
Expand Down
107 changes: 107 additions & 0 deletions cpp/benchmarks/groupby/group_struct_benchmark.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <benchmarks/common/generate_benchmark_input.hpp>
#include <benchmarks/fixture/benchmark_fixture.hpp>
#include <benchmarks/synchronization/synchronization.hpp>

#include <cudf_test/column_wrapper.hpp>

#include <cudf/aggregation.hpp>
#include <cudf/column/column_factories.hpp>
#include <cudf/groupby.hpp>
#include <cudf/structs/structs_column_view.hpp>
#include <cudf/table/table.hpp>

#include <thrust/iterator/transform_iterator.h>

static constexpr cudf::size_type num_struct_members = 8;
static constexpr cudf::size_type max_int = 100;
static constexpr cudf::size_type max_str_length = 32;

static auto create_data_table(cudf::size_type n_rows)
{
data_profile table_profile;
table_profile.set_distribution_params(cudf::type_id::INT32, distribution_id::UNIFORM, 0, max_int);
table_profile.set_distribution_params(
cudf::type_id::STRING, distribution_id::NORMAL, 0, max_str_length);

// The first two struct members are int32 and string.
// The first column is also used as keys in groupby.
auto col_ids = std::vector<cudf::type_id>{cudf::type_id::INT32, cudf::type_id::STRING};

// The subsequent struct members are int32 and string again.
for (cudf::size_type i = 3; i <= num_struct_members; ++i) {
if (i % 2) {
col_ids.push_back(cudf::type_id::INT32);
} else {
col_ids.push_back(cudf::type_id::STRING);
}
}

return create_random_table(col_ids, num_struct_members, row_count{n_rows}, table_profile);
}

// Max aggregation/scan technically has the same performance as min.
template <typename OpType>
void BM_groupby_min_struct(benchmark::State& state)
{
auto const n_rows = static_cast<cudf::size_type>(state.range(0));
auto data_cols = create_data_table(n_rows)->release();

auto const keys_view = data_cols.front()->view();
auto const values =
cudf::make_structs_column(keys_view.size(), std::move(data_cols), 0, rmm::device_buffer());

using RequestType = std::conditional_t<std::is_same_v<OpType, cudf::groupby_aggregation>,
cudf::groupby::aggregation_request,
cudf::groupby::scan_request>;

auto gb_obj = cudf::groupby::groupby(cudf::table_view({keys_view}));
auto requests = std::vector<RequestType>();
requests.emplace_back(RequestType());
requests.front().values = values->view();
requests.front().aggregations.push_back(cudf::make_min_aggregation<OpType>());

for (auto _ : state) {
[[maybe_unused]] auto const timer = cuda_event_timer(state, true);
if constexpr (std::is_same_v<OpType, cudf::groupby_aggregation>) {
[[maybe_unused]] auto const result = gb_obj.aggregate(requests);
} else {
[[maybe_unused]] auto const result = gb_obj.scan(requests);
}
}
}

class Groupby : public cudf::benchmark {
};

#define MIN_RANGE 10'000
#define MAX_RANGE 10'000'000

#define REGISTER_BENCHMARK(name, op_type) \
BENCHMARK_DEFINE_F(Groupby, name)(::benchmark::State & state) \
{ \
BM_groupby_min_struct<op_type>(state); \
} \
BENCHMARK_REGISTER_F(Groupby, name) \
->UseManualTime() \
->Unit(benchmark::kMillisecond) \
->RangeMultiplier(4) \
->Ranges({{MIN_RANGE, MAX_RANGE}});

REGISTER_BENCHMARK(Aggregation, cudf::groupby_aggregation)
REGISTER_BENCHMARK(Scan, cudf::groupby_scan_aggregation)
19 changes: 15 additions & 4 deletions cpp/src/groupby/hash/groupby.cu
Original file line number Diff line number Diff line change
Expand Up @@ -632,11 +632,22 @@ std::unique_ptr<table> groupby_null_templated(table_view const& keys,
*/
bool can_use_hash_groupby(table_view const& keys, host_span<aggregation_request const> requests)
{
return std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
return std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
return is_hash_aggregation(a->kind);
auto const all_hash_aggregations =
std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
return std::all_of(r.aggregations.begin(), r.aggregations.end(), [](auto const& a) {
return is_hash_aggregation(a->kind);
});
});
});

// Currently, structs are not supported in any of hash-based aggregations.
// Therefore, if any request contains structs then we must fallback to sort-based aggregations.
// TODO: Support structs in hash-based aggregations.
auto const has_struct =
std::all_of(requests.begin(), requests.end(), [](aggregation_request const& r) {
return r.values.type().id() == type_id::STRUCT;
});

return all_hash_aggregations && !has_struct;
}

// Hash-based groupby
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/groupby/sort/group_argmax.cu
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ std::unique_ptr<column> group_argmax(column_view const& values,
rmm::mr::device_memory_resource* mr)
{
auto indices = type_dispatcher(values.type(),
reduce_functor<aggregation::ARGMAX>{},
group_reduction_dispatcher<aggregation::ARGMAX>{},
values,
num_groups,
group_labels,
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/groupby/sort/group_argmin.cu
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ std::unique_ptr<column> group_argmin(column_view const& values,
rmm::mr::device_memory_resource* mr)
{
auto indices = type_dispatcher(values.type(),
reduce_functor<aggregation::ARGMIN>{},
group_reduction_dispatcher<aggregation::ARGMIN>{},
values,
num_groups,
group_labels,
Expand Down
9 changes: 7 additions & 2 deletions cpp/src/groupby/sort/group_max.cu
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,13 @@ std::unique_ptr<column> group_max(column_view const& values,
auto values_type = cudf::is_dictionary(values.type())
? dictionary_column_view(values).keys().type()
: values.type();
return type_dispatcher(
values_type, reduce_functor<aggregation::MAX>{}, values, num_groups, group_labels, stream, mr);
return type_dispatcher(values_type,
group_reduction_dispatcher<aggregation::MAX>{},
values,
num_groups,
group_labels,
stream,
mr);
}

} // namespace detail
Expand Down
9 changes: 7 additions & 2 deletions cpp/src/groupby/sort/group_max_scan.cu
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,13 @@ std::unique_ptr<column> max_scan(column_view const& values,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
return type_dispatcher(
values.type(), scan_functor<aggregation::MAX>{}, values, num_groups, group_labels, stream, mr);
return type_dispatcher(values.type(),
group_scan_dispatcher<aggregation::MAX>{},
values,
num_groups,
group_labels,
stream,
mr);
}

} // namespace detail
Expand Down
9 changes: 7 additions & 2 deletions cpp/src/groupby/sort/group_min.cu
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,13 @@ std::unique_ptr<column> group_min(column_view const& values,
auto values_type = cudf::is_dictionary(values.type())
? dictionary_column_view(values).keys().type()
: values.type();
return type_dispatcher(
values_type, reduce_functor<aggregation::MIN>{}, values, num_groups, group_labels, stream, mr);
return type_dispatcher(values_type,
group_reduction_dispatcher<aggregation::MIN>{},
values,
num_groups,
group_labels,
stream,
mr);
}

} // namespace detail
Expand Down
9 changes: 7 additions & 2 deletions cpp/src/groupby/sort/group_min_scan.cu
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,13 @@ std::unique_ptr<column> min_scan(column_view const& values,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
return type_dispatcher(
values.type(), scan_functor<aggregation::MIN>{}, values, num_groups, group_labels, stream, mr);
return type_dispatcher(values.type(),
group_scan_dispatcher<aggregation::MIN>{},
values,
num_groups,
group_labels,
stream,
mr);
}

} // namespace detail
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/groupby/sort/group_product.cu
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ std::unique_ptr<column> group_product(column_view const& values,
? dictionary_column_view(values).keys().type()
: values.type();
return type_dispatcher(values_type,
reduce_functor<aggregation::PRODUCT>{},
group_reduction_dispatcher<aggregation::PRODUCT>{},
values,
num_groups,
group_labels,
Expand Down
Loading