Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby scan operations (sort groupby) #7387

Merged
merged 56 commits into from
Mar 23, 2021
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
e3c6c53
rename sort/groupby.cu to sort/aggregate.cpp
karthikeyann Feb 16, 2021
92414a0
cpp error order instantiation fix
karthikeyann Feb 16, 2021
6e7910f
rename store_result_functor to aggregate_result_functor
karthikeyann Feb 16, 2021
ab9daec
add groupby::sort_scan, scan_result_functor
karthikeyann Feb 16, 2021
cdc1bb5
add group_cumsum (initial commit, compilable)
karthikeyann Feb 16, 2021
37c3144
Add groupby::scan interface
isVoid Feb 17, 2021
dff1268
review comments (jake)
karthikeyann Mar 1, 2021
3d5c022
rename group_cumsum.cu to group_sum_scan.cu
karthikeyann Mar 1, 2021
ba8b2bc
groupby sum_scan, add unit tests (null cases)
karthikeyann Mar 1, 2021
e4639cc
enable fixed_point for groupby sum_scan
karthikeyann Mar 2, 2021
26c5668
add groupby min_scan, max_scan
karthikeyann Mar 2, 2021
3e097d7
add groupby count_scan
karthikeyann Mar 3, 2021
35085b6
minor test code fix
karthikeyann Mar 3, 2021
ff97904
add groupby min_scan, max_scan tests
karthikeyann Mar 3, 2021
62e7dec
add unit test unsupported throw
karthikeyann Mar 3, 2021
a4d5a04
add groupby count_scan tests
karthikeyann Mar 3, 2021
8e725db
review comments (davidwendt)
karthikeyann Mar 3, 2021
0d09c0a
include cleanup
karthikeyann Mar 3, 2021
0e1f601
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 3, 2021
39e1bd6
fix file rename in cmake
karthikeyann Mar 3, 2021
65e7378
indentation fix
karthikeyann Mar 3, 2021
970a054
fix missing file cmake
karthikeyann Mar 3, 2021
5b7ea1c
missing files include cmake
karthikeyann Mar 3, 2021
dddfd94
move shift to another PR
karthikeyann Mar 3, 2021
3b588f1
replace device_vector with device_uvector in sort groupby
karthikeyann Mar 5, 2021
837c0f1
replace device_vector const& with device_span<const>
karthikeyann Mar 5, 2021
69cac99
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 5, 2021
f83c5a0
Merge branch 'enh-groupby_uvector_span' of github.com:karthikeyann/cu…
karthikeyann Mar 5, 2021
13bbaf8
replace device_vector const& with device_span
karthikeyann Mar 5, 2021
511d1be
Update cpp/src/groupby/sort/sort_helper.cu
karthikeyann Mar 6, 2021
f5d0f5b
use device_span (group_scan.hpp)
karthikeyann Mar 8, 2021
4b5aa41
fix segmentation fault in thrust::copy
karthikeyann Mar 8, 2021
8c375fb
fix zero init in group_label
karthikeyann Mar 8, 2021
d9fbda0
Merge branch 'enh-groupby_uvector_span' of github.com:karthikeyann/cu…
karthikeyann Mar 8, 2021
7830260
Apply suggestions from code review (vuule)
karthikeyann Mar 9, 2021
206d463
update copyright year in all files
karthikeyann Mar 9, 2021
8d1bd8a
add mismatch num_rows error test
karthikeyann Mar 9, 2021
42825d3
stylefix
karthikeyann Mar 9, 2021
2175a85
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 9, 2021
43add1a
fix device_span error
karthikeyann Mar 9, 2021
f841f9f
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 9, 2021
592b26a
in scan don't return sorted grouped values
karthikeyann Mar 11, 2021
003d2f2
Apply suggestions from code review
karthikeyann Mar 16, 2021
9fad1b6
review comments
karthikeyann Mar 16, 2021
216dc09
Merge branch 'fea-groupby_scan' of github.com:karthikeyann/cudf into …
karthikeyann Mar 16, 2021
7c9d18a
dictionary column unsupported
karthikeyann Mar 16, 2021
f5e31cb
style fix, include cleanup
karthikeyann Mar 17, 2021
51e505d
remove reduntant tests
karthikeyann Mar 18, 2021
632c62b
address review comments(ttnghia)
karthikeyann Mar 19, 2021
0b3f92a
Revert "address review comments(ttnghia)"
karthikeyann Mar 19, 2021
0998a07
address review comments
karthikeyann Mar 19, 2021
6b7ef34
Merge branch 'branch-0.19' of github.com:rapidsai/cudf into fea-group…
karthikeyann Mar 19, 2021
3902d3d
Apply suggestions from code review (davidwendt)
karthikeyann Mar 22, 2021
118513d
include cleanup
karthikeyann Mar 22, 2021
bb5dfd7
update unit test style, type aliases
karthikeyann Mar 22, 2021
c8e4b99
more tests cleanup
karthikeyann Mar 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#=============================================================================
# Copyright (c) 2018-2020, NVIDIA CORPORATION.
# Copyright (c) 2018-2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -189,7 +189,7 @@ add_library(cudf
src/groupby/hash/groupby.cu
src/groupby/sort/group_argmax.cu
src/groupby/sort/group_argmin.cu
src/groupby/sort/groupby.cu
src/groupby/sort/aggregate.cpp
src/groupby/sort/group_collect.cu
src/groupby/sort/group_count.cu
src/groupby/sort/group_max.cu
Expand All @@ -199,6 +199,11 @@ add_library(cudf
src/groupby/sort/group_quantiles.cu
src/groupby/sort/group_std.cu
src/groupby/sort/group_sum.cu
src/groupby/sort/scan.cpp
src/groupby/sort/group_count_scan.cu
src/groupby/sort/group_max_scan.cu
src/groupby/sort/group_min_scan.cu
src/groupby/sort/group_sum_scan.cu
src/groupby/sort/sort_helper.cu
src/hash/hashing.cu
src/interop/dlpack.cpp
Expand Down
10 changes: 5 additions & 5 deletions cpp/include/cudf/detail/groupby/sort_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
#include <cudf/types.hpp>

#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_vector.hpp>
#include <rmm/device_uvector.hpp>

namespace cudf {
namespace groupby {
Expand All @@ -40,8 +40,8 @@ namespace sort {
* value column
*/
struct sort_groupby_helper {
using index_vector = rmm::device_vector<size_type>;
using bitmask_vector = rmm::device_vector<bitmask_type>;
using index_vector = rmm::device_uvector<size_type>;
using bitmask_vector = rmm::device_uvector<bitmask_type>;
using column_ptr = std::unique_ptr<column>;
using index_vector_ptr = std::unique_ptr<index_vector>;
using bitmask_vector_ptr = std::unique_ptr<bitmask_vector>;
Expand All @@ -63,8 +63,8 @@ struct sort_groupby_helper {
sorted keys_pre_sorted = sorted::NO)
: _keys(keys),
_num_keys(-1),
_include_null_keys(include_null_keys),
_keys_pre_sorted(keys_pre_sorted)
_keys_pre_sorted(keys_pre_sorted),
_include_null_keys(include_null_keys)
{
if (keys_pre_sorted == sorted::YES and include_null_keys == null_policy::EXCLUDE and
has_nulls(keys)) {
Expand Down
2 changes: 0 additions & 2 deletions cpp/include/cudf/detail/null_mask.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,6 @@

#include <rmm/cuda_stream_view.hpp>

using cudf::device_span;

namespace cudf {
namespace detail {
/**
Expand Down
62 changes: 61 additions & 1 deletion cpp/include/cudf/groupby.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -166,6 +166,61 @@ class groupby {
std::vector<aggregation_request> const& requests,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Performs grouped scans on the specified values.
*
* The values to aggregate and the aggregations to perform are specifed in an
* `aggregation_request`. Each request contains a `column_view` of values to
* aggregate and a set of `aggregation`s to perform on those elements.
*
* For each `aggregation` in a request, `values[i]` is scan aggregated with
* all previous `values[j]` where rows `i` and `j` in `keys` are equivalent.
*
* The `size()` of the request column must equal `keys.num_rows()`.
*
* For every `aggregation_request` an `aggregation_result` will be returned.
* The `aggregation_result` holds the resulting column(s) for each requested
* aggregation on the `request`s values. The order of the columns in each
* result is the same order as was specified in the request.
*
* The returned `table` contains the group labels for each row, i.e., the
* `keys` given to groupby object. Element `i` across all aggregation results
* belongs to the group at row `i` in the group labels table.
*
* The order of the rows in the group labels is arbitrary. Furthermore,
* successive `groupby::scan` calls may return results in different orders.
*
* @throws cudf::logic_error If `requests[i].values.size() !=
* keys.num_rows()`.
*
* Example:
* ```
* Input:
* keys: {1 2 1 3 1}
* {1 2 1 4 1}
* request:
* values: {3 1 4 9 2}
* aggregations: {{SUM}, {MIN}}
*
* result:
*
* keys: {3 1 1 1 2}
* {4 1 1 1 2}
* values:
* SUM: {9 3 7 9 1}
* MIN: {9 3 3 2 1}
* ```
*
* @param requests The set of columns to scan and the scans to perform
* @param mr Device memory resource used to allocate the returned table and columns' device memory
* @return Pair containing the table with each group's key and
* a vector of aggregation_results for each request in the same order as
* specified in `requests`.
*/
std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> scan(
std::vector<aggregation_request> const& requests,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the latest guidelines

Suggested change
std::vector<aggregation_request> const& requests,
host_span<aggregation_request const> requests,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it to another PR.

rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief The grouped data corresponding to a groupby operation on a set of values.
*
Expand Down Expand Up @@ -231,6 +286,11 @@ class groupby {
std::vector<aggregation_request> const& requests,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> sort_scan(
std::vector<aggregation_request> const& requests,
rmm::cuda_stream_view stream,
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
rmm::mr::device_memory_resource* mr);
};
/** @} */
} // namespace groupby
Expand Down
2 changes: 0 additions & 2 deletions cpp/src/bitmask/null_mask.cu
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,6 @@
#include <numeric>
#include <type_traits>

using cudf::device_span;

namespace cudf {
size_type state_null_count(mask_state state, size_type size)
{
Expand Down
39 changes: 29 additions & 10 deletions cpp/src/groupby/groupby.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2020, NVIDIA CORPORATION.
* Copyright (c) 2019-2021, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -159,23 +159,42 @@ std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> groupby::aggr
return dispatch_aggregation(requests, 0, mr);
}

// Compute scan requests
std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> groupby::scan(
std::vector<aggregation_request> const& requests, rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
CUDF_EXPECTS(
std::all_of(requests.begin(),
requests.end(),
[this](auto const& request) { return request.values.size() == _keys.num_rows(); }),
"Size mismatch between request values and groupby keys.");
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved

verify_valid_requests(requests);

if (_keys.num_rows() == 0) { return std::make_pair(empty_like(_keys), empty_results(requests)); }

return sort_scan(requests, rmm::cuda_stream_default, mr);
}

groupby::groups groupby::get_groups(table_view values, rmm::mr::device_memory_resource* mr)
{
CUDF_FUNC_RANGE();
auto grouped_keys = helper().sorted_keys(rmm::cuda_stream_default, mr);

auto group_offsets = helper().group_offsets(0);
auto const& group_offsets = helper().group_offsets(rmm::cuda_stream_default);
std::vector<size_type> group_offsets_vector(group_offsets.size());
thrust::copy(group_offsets.begin(), group_offsets.end(), group_offsets_vector.begin());
thrust::copy(thrust::device_pointer_cast(group_offsets.begin()),
thrust::device_pointer_cast(group_offsets.end()),
group_offsets_vector.begin());

std::unique_ptr<table> grouped_values{nullptr};
if (values.num_columns()) {
grouped_values = cudf::detail::gather(values,
helper().key_sort_order(),
cudf::out_of_bounds_policy::DONT_CHECK,
cudf::detail::negative_index_policy::NOT_ALLOWED,
rmm::cuda_stream_default,
mr);
auto grouped_values = cudf::detail::gather(values,
helper().key_sort_order(),
cudf::out_of_bounds_policy::DONT_CHECK,
cudf::detail::negative_index_policy::NOT_ALLOWED,
rmm::cuda_stream_default,
mr);
return groupby::groups{
std::move(grouped_keys), std::move(group_offsets_vector), std::move(grouped_values)};
} else {
Expand Down
Loading