Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby scan operations (sort groupby) #7387

Merged
merged 56 commits into from
Mar 23, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
e3c6c53
rename sort/groupby.cu to sort/aggregate.cpp
karthikeyann Feb 16, 2021
92414a0
cpp error order instantiation fix
karthikeyann Feb 16, 2021
6e7910f
rename store_result_functor to aggregate_result_functor
karthikeyann Feb 16, 2021
ab9daec
add groupby::sort_scan, scan_result_functor
karthikeyann Feb 16, 2021
cdc1bb5
add group_cumsum (initial commit, compilable)
karthikeyann Feb 16, 2021
37c3144
Add groupby::scan interface
isVoid Feb 17, 2021
dff1268
review comments (jake)
karthikeyann Mar 1, 2021
3d5c022
rename group_cumsum.cu to group_sum_scan.cu
karthikeyann Mar 1, 2021
ba8b2bc
groupby sum_scan, add unit tests (null cases)
karthikeyann Mar 1, 2021
e4639cc
enable fixed_point for groupby sum_scan
karthikeyann Mar 2, 2021
26c5668
add groupby min_scan, max_scan
karthikeyann Mar 2, 2021
3e097d7
add groupby count_scan
karthikeyann Mar 3, 2021
35085b6
minor test code fix
karthikeyann Mar 3, 2021
ff97904
add groupby min_scan, max_scan tests
karthikeyann Mar 3, 2021
62e7dec
add unit test unsupported throw
karthikeyann Mar 3, 2021
a4d5a04
add groupby count_scan tests
karthikeyann Mar 3, 2021
8e725db
review comments (davidwendt)
karthikeyann Mar 3, 2021
0d09c0a
include cleanup
karthikeyann Mar 3, 2021
0e1f601
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 3, 2021
39e1bd6
fix file rename in cmake
karthikeyann Mar 3, 2021
65e7378
indentation fix
karthikeyann Mar 3, 2021
970a054
fix missing file cmake
karthikeyann Mar 3, 2021
5b7ea1c
missing files include cmake
karthikeyann Mar 3, 2021
dddfd94
move shift to another PR
karthikeyann Mar 3, 2021
3b588f1
replace device_vector with device_uvector in sort groupby
karthikeyann Mar 5, 2021
837c0f1
replace device_vector const& with device_span<const>
karthikeyann Mar 5, 2021
69cac99
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 5, 2021
f83c5a0
Merge branch 'enh-groupby_uvector_span' of github.com:karthikeyann/cu…
karthikeyann Mar 5, 2021
13bbaf8
replace device_vector const& with device_span
karthikeyann Mar 5, 2021
511d1be
Update cpp/src/groupby/sort/sort_helper.cu
karthikeyann Mar 6, 2021
f5d0f5b
use device_span (group_scan.hpp)
karthikeyann Mar 8, 2021
4b5aa41
fix segmentation fault in thrust::copy
karthikeyann Mar 8, 2021
8c375fb
fix zero init in group_label
karthikeyann Mar 8, 2021
d9fbda0
Merge branch 'enh-groupby_uvector_span' of github.com:karthikeyann/cu…
karthikeyann Mar 8, 2021
7830260
Apply suggestions from code review (vuule)
karthikeyann Mar 9, 2021
206d463
update copyright year in all files
karthikeyann Mar 9, 2021
8d1bd8a
add mismatch num_rows error test
karthikeyann Mar 9, 2021
42825d3
stylefix
karthikeyann Mar 9, 2021
2175a85
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 9, 2021
43add1a
fix device_span error
karthikeyann Mar 9, 2021
f841f9f
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 9, 2021
592b26a
in scan don't return sorted grouped values
karthikeyann Mar 11, 2021
003d2f2
Apply suggestions from code review
karthikeyann Mar 16, 2021
9fad1b6
review comments
karthikeyann Mar 16, 2021
216dc09
Merge branch 'fea-groupby_scan' of github.com:karthikeyann/cudf into …
karthikeyann Mar 16, 2021
7c9d18a
dictionary column unsupported
karthikeyann Mar 16, 2021
f5e31cb
style fix, include cleanup
karthikeyann Mar 17, 2021
51e505d
remove reduntant tests
karthikeyann Mar 18, 2021
632c62b
address review comments(ttnghia)
karthikeyann Mar 19, 2021
0b3f92a
Revert "address review comments(ttnghia)"
karthikeyann Mar 19, 2021
0998a07
address review comments
karthikeyann Mar 19, 2021
6b7ef34
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 19, 2021
3902d3d
Apply suggestions from code review (davidwendt)
karthikeyann Mar 22, 2021
118513d
include cleanup
karthikeyann Mar 22, 2021
bb5dfd7
update unit test style, type aliases
karthikeyann Mar 22, 2021
c8e4b99
more tests cleanup
karthikeyann Mar 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cpp/include/cudf/detail/groupby/sort_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@ struct sort_groupby_helper {
sorted keys_pre_sorted = sorted::NO)
: _keys(keys),
_num_keys(-1),
_include_null_keys(include_null_keys),
_keys_pre_sorted(keys_pre_sorted)
_keys_pre_sorted(keys_pre_sorted),
_include_null_keys(include_null_keys)
{
if (keys_pre_sorted == sorted::YES and include_null_keys == null_policy::EXCLUDE and
has_nulls(keys)) {
Expand Down
5 changes: 5 additions & 0 deletions cpp/include/cudf/groupby.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,11 @@ class groupby {
std::vector<aggregation_request> const& requests,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> sort_scan(
std::vector<aggregation_request> const& requests,
rmm::cuda_stream_view stream,
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
rmm::mr::device_memory_resource* mr);
};
/** @} */
} // namespace groupby
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -15,6 +15,7 @@
*/

#include <groupby/common/utils.hpp>
ttnghia marked this conversation as resolved.
Show resolved Hide resolved
#include "functors.hpp"
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
#include "group_reductions.hpp"

#include <cudf/aggregation.hpp>
Expand Down Expand Up @@ -51,71 +52,16 @@ namespace detail {
* memoised sorted and/or grouped values and re-using will save on computation
* of these values.
*/
struct store_result_functor {
store_result_functor(size_type col_idx,
column_view const& values,
sort::sort_groupby_helper& helper,
cudf::detail::result_cache& cache,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
: col_idx(col_idx), helper(helper), cache(cache), values(values), stream(stream), mr(mr)
{
}

struct aggregrate_result_functor : store_result_functor {
using store_result_functor::store_result_functor;
template <aggregation::Kind k>
void operator()(aggregation const& agg)
{
}

private:
/**
* @brief Get the grouped values
*
* Computes the grouped values from @p values on first invocation and returns
* the stored result on subsequent invocation
*/
column_view get_grouped_values()
{
// TODO (dm): After implementing single pass multi-agg, explore making a
// cache of all grouped value columns rather than one at a time
if (grouped_values)
return grouped_values->view();
else if (sorted_values)
// TODO (dm): When we implement scan, it wouldn't be ok to return sorted
// values when asked for grouped values. Change this then.
return sorted_values->view();
else
grouped_values = helper.grouped_values(values);
return grouped_values->view();
};

/**
* @brief Get the grouped and sorted values
*
* Computes the grouped and sorted (within each group) values from @p values
* on first invocation and returns the stored result on subsequent invocation
*/
column_view get_sorted_values()
{
if (not sorted_values) sorted_values = helper.sorted_values(values);
return sorted_values->view();
};

private:
size_type col_idx; ///< Index of column in requests being operated on
sort::sort_groupby_helper& helper; ///< Sort helper
cudf::detail::result_cache& cache; ///< cache of results to store into
column_view const& values; ///< Column of values to group and aggregate

rmm::cuda_stream_view stream; ///< CUDA stream on which to execute kernels
rmm::mr::device_memory_resource* mr; ///< Memory resource to allocate space for results

std::unique_ptr<column> sorted_values; ///< Memoised grouped and sorted values
std::unique_ptr<column> grouped_values; ///< Memoised grouped values
};

template <>
void store_result_functor::operator()<aggregation::COUNT_VALID>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::COUNT_VALID>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -129,7 +75,7 @@ void store_result_functor::operator()<aggregation::COUNT_VALID>(aggregation cons
}

template <>
void store_result_functor::operator()<aggregation::COUNT_ALL>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::COUNT_ALL>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -138,7 +84,7 @@ void store_result_functor::operator()<aggregation::COUNT_ALL>(aggregation const&
}

template <>
void store_result_functor::operator()<aggregation::SUM>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::SUM>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -149,7 +95,7 @@ void store_result_functor::operator()<aggregation::SUM>(aggregation const& agg)
};

template <>
void store_result_functor::operator()<aggregation::ARGMAX>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::ARGMAX>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -164,7 +110,7 @@ void store_result_functor::operator()<aggregation::ARGMAX>(aggregation const& ag
};

template <>
void store_result_functor::operator()<aggregation::ARGMIN>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::ARGMIN>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -179,7 +125,7 @@ void store_result_functor::operator()<aggregation::ARGMIN>(aggregation const& ag
};

template <>
void store_result_functor::operator()<aggregation::MIN>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::MIN>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand Down Expand Up @@ -216,7 +162,7 @@ void store_result_functor::operator()<aggregation::MIN>(aggregation const& agg)
};

template <>
void store_result_functor::operator()<aggregation::MAX>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::MAX>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand Down Expand Up @@ -253,7 +199,7 @@ void store_result_functor::operator()<aggregation::MAX>(aggregation const& agg)
};

template <>
void store_result_functor::operator()<aggregation::MEAN>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::MEAN>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -277,7 +223,7 @@ void store_result_functor::operator()<aggregation::MEAN>(aggregation const& agg)
};

template <>
void store_result_functor::operator()<aggregation::VARIANCE>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::VARIANCE>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -300,7 +246,7 @@ void store_result_functor::operator()<aggregation::VARIANCE>(aggregation const&
};

template <>
void store_result_functor::operator()<aggregation::STD>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::STD>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -314,7 +260,7 @@ void store_result_functor::operator()<aggregation::STD>(aggregation const& agg)
};

template <>
void store_result_functor::operator()<aggregation::QUANTILE>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::QUANTILE>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -335,7 +281,7 @@ void store_result_functor::operator()<aggregation::QUANTILE>(aggregation const&
};

template <>
void store_result_functor::operator()<aggregation::MEDIAN>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::MEDIAN>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -355,7 +301,7 @@ void store_result_functor::operator()<aggregation::MEDIAN>(aggregation const& ag
};

template <>
void store_result_functor::operator()<aggregation::NUNIQUE>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::NUNIQUE>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -372,7 +318,7 @@ void store_result_functor::operator()<aggregation::NUNIQUE>(aggregation const& a
};

template <>
void store_result_functor::operator()<aggregation::NTH_ELEMENT>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::NTH_ELEMENT>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand Down Expand Up @@ -401,7 +347,7 @@ void store_result_functor::operator()<aggregation::NTH_ELEMENT>(aggregation cons
}

template <>
void store_result_functor::operator()<aggregation::COLLECT>(aggregation const& agg)
void aggregrate_result_functor::operator()<aggregation::COLLECT>(aggregation const& agg)
{
if (cache.has_result(col_idx, agg)) return;

Expand All @@ -426,7 +372,7 @@ std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> groupby::sort

for (size_t i = 0; i < requests.size(); i++) {
auto store_functor =
detail::store_result_functor(i, requests[i].values, helper(), cache, stream, mr);
detail::aggregrate_result_functor(i, requests[i].values, helper(), cache, stream, mr);
for (size_t j = 0; j < requests[i].aggregations.size(); j++) {
// TODO (dm): single pass compute all supported reductions
cudf::detail::aggregation_dispatcher(
Expand Down
97 changes: 97 additions & 0 deletions cpp/src/groupby/sort/functors.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include <cudf/column/column.hpp>
#include <cudf/column/column_view.hpp>
#include <cudf/detail/aggregation/result_cache.hpp>
#include <cudf/detail/groupby/sort_helper.hpp>
#include <cudf/types.hpp>

#include <rmm/cuda_stream_view.hpp>

#include <memory>

namespace cudf {
namespace groupby {
namespace detail {
/**
* @brief Functor to dispatch aggregation with
*
* This functor is to be used with `aggregation_dispatcher` to compute the
* appropriate aggregation. If the values on which to run the aggregation are
* unchanged, then this functor should be re-used. This is because it stores
* memoised sorted and/or grouped values and re-using will save on computation
* of these values.
*/
struct store_result_functor {
store_result_functor(size_type col_idx,
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
column_view const& values,
sort::sort_groupby_helper& helper,
cudf::detail::result_cache& cache,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
: col_idx(col_idx), helper(helper), cache(cache), values(values), stream(stream), mr(mr)
{
}

protected:
/**
* @brief Get the grouped values
*
* Computes the grouped values from @p values on first invocation and returns
* the stored result on subsequent invocation
*/
column_view get_grouped_values()
{
// TODO (dm): After implementing single pass multi-agg, explore making a
// cache of all grouped value columns rather than one at a time
if (grouped_values)
return grouped_values->view();
else if (sorted_values)
// TODO (dm): When we implement scan, it wouldn't be ok to return sorted
// values when asked for grouped values. Change this then.
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
return sorted_values->view();
else
grouped_values = helper.grouped_values(values);
return grouped_values->view();
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
};

/**
* @brief Get the grouped and sorted values
*
* Computes the grouped and sorted (within each group) values from @p values
* on first invocation and returns the stored result on subsequent invocation
*/
column_view get_sorted_values()
{
if (not sorted_values) sorted_values = helper.sorted_values(values);
return sorted_values->view();
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
};

protected:
size_type col_idx; ///< Index of column in requests being operated on
sort::sort_groupby_helper& helper; ///< Sort helper
cudf::detail::result_cache& cache; ///< cache of results to store into
column_view const& values; ///< Column of values to group and aggregate

rmm::cuda_stream_view stream; ///< CUDA stream on which to execute kernels
rmm::mr::device_memory_resource* mr; ///< Memory resource to allocate space for results

std::unique_ptr<column> sorted_values; ///< Memoised grouped and sorted values
std::unique_ptr<column> grouped_values; ///< Memoised grouped values
};
} // namespace detail
} // namespace groupby
} // namespace cudf
40 changes: 40 additions & 0 deletions cpp/src/groupby/sort/group_cumsum.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/*
* Copyright (c) 2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cudf/dictionary/dictionary_column_view.hpp>
#include <groupby/sort/group_scan_util.cuh>

#include <rmm/cuda_stream_view.hpp>

namespace cudf {
namespace groupby {
namespace detail {
std::unique_ptr<column> group_cumsum(column_view const& values,
size_type num_groups,
rmm::device_vector<size_type> const& group_labels,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
auto values_type = cudf::is_dictionary(values.type())
? dictionary_column_view(values).keys().type()
: values.type();
return type_dispatcher(
values_type, scan_functor<aggregation::SUM>{}, values, num_groups, group_labels, stream, mr);
}

} // namespace detail
} // namespace groupby
} // namespace cudf
Loading