Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby scan aggregation to cudf #7759

Merged
merged 72 commits into from
Apr 22, 2021
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
26bafd0
Don't identify decimals as strings.
vyasr Mar 24, 2021
babcdfc
Reject all extension types as string types.
vyasr Mar 25, 2021
2b00611
Create separate lists for extension type methods.
vyasr Mar 25, 2021
76ab556
Merge branch 'branch-0.19' into fix/issue7687_part2
vyasr Mar 25, 2021
1ebde51
Enable collect for decimals.
vyasr Mar 25, 2021
4c5d876
Enable argmin and argmax.
vyasr Mar 25, 2021
4134e43
Fix variance key name.
vyasr Mar 25, 2021
43cf580
Move groupby aggregation list to groupby.py and clean up the assignme…
vyasr Mar 25, 2021
474a179
Disable aggs that are overrides of actual methods.
vyasr Mar 25, 2021
25e74ef
Move more logic out of the GroupBy class.
vyasr Mar 25, 2021
8a44827
Simplify getattr usage.
vyasr Mar 25, 2021
6b5c67f
Clearly documented unknown failures.
vyasr Mar 25, 2021
8e45ad0
Match other class groupbys to strings.
vyasr Mar 25, 2021
81ffe0a
Fix style and remove unsupported operations.
vyasr Mar 25, 2021
6d3fad3
Apply black reformattings.
vyasr Mar 25, 2021
714742d
Remove variance from obviously unsupported types.
vyasr Mar 25, 2021
ea4ed2e
Defer getattr to getitem if possible.
vyasr Mar 25, 2021
026bb4e
Make getattr safe for copying.
vyasr Mar 25, 2021
1259032
Remove support for aggregating structs.
vyasr Mar 25, 2021
6c61806
Update documented list of groupby operations.
vyasr Mar 25, 2021
a14d30f
Move function out of loop.
vyasr Mar 25, 2021
12caa06
Merge branch 'branch-0.19' into fix/issue7687_part2
vyasr Mar 26, 2021
5c71bfe
Remove redundant test, add test of decimal.
vyasr Mar 27, 2021
25811b0
Fix formatting.
vyasr Mar 27, 2021
1450f2d
Merge branch 'branch-0.19' into fix/issue7687_part2
vyasr Mar 28, 2021
4a0f27d
Simplify drop aggs logic.
vyasr Mar 28, 2021
596496f
Inline drop logic.
vyasr Mar 28, 2021
1f5dadd
Remove empty aggregations in place rather than beforehand.
vyasr Mar 29, 2021
f02231f
Reuse aggregation object.
vyasr Mar 29, 2021
5835f56
add cumsum, cummin, cummax, cumcount to Aggregation
karthikeyann Mar 30, 2021
87f9a59
expose groupby.scan()
karthikeyann Mar 30, 2021
5145501
add scan aggregrations as attributes
karthikeyann Mar 30, 2021
c3602f9
Cython code for groupby scan in groupby.agg()
karthikeyann Mar 30, 2021
1d48585
add python test for grouby.scan - sum, min, max, count
karthikeyann Mar 30, 2021
f557406
unit test mixed agg, scan error case
karthikeyann Mar 30, 2021
0e09786
Write a new make_aggregation2 function that returns the cdef class ra…
vyasr Mar 30, 2021
63897c1
Remove constructor of Aggregation to avoid recursion issue.
vyasr Mar 30, 2021
5fbde6f
Remove _AggregationFactory and just use classmethods of Aggregation.
vyasr Mar 30, 2021
f9508fe
Remove old API for aggregations, use new Cython object everywhere.
vyasr Mar 30, 2021
4c04b5f
Remove direct usage of the C++ groupby object in most of the Cython c…
vyasr Mar 30, 2021
b21dfaf
Add a docstring for the Aggregation class.
vyasr Mar 30, 2021
e3ae492
Don't use a dict as a default argument.
vyasr Mar 30, 2021
bac0de5
Use the uppercased versions of the acceptable aggregations.
vyasr Mar 30, 2021
4c7f8d2
Just define all the aggregations available on the GroupBy as methods.
vyasr Mar 30, 2021
8d64ae5
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 1, 2021
5ebbca2
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 1, 2021
46e0545
Fix up merge.
vyasr Apr 1, 2021
26f1016
Some minor cleanup.
vyasr Apr 1, 2021
a7148a4
Fix docstrings.
vyasr Apr 1, 2021
4dc1d79
Fix style.
vyasr Apr 1, 2021
0c18a16
Remove extra blank line.
vyasr Apr 1, 2021
794aad1
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 1, 2021
c9892b1
Remove TODO.
vyasr Apr 1, 2021
61e84b2
Apply black formatting.
vyasr Apr 1, 2021
c51a7d4
Merge branch 'branch-0.20' of github.com:rapidsai/cudf into fea-py-gr…
karthikeyann Apr 5, 2021
35deb2f
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 5, 2021
ae88281
Merge branch 'branch-0.20' of github.com:rapidsai/cudf into fea-py-gr…
karthikeyann Apr 12, 2021
621216d
add "cumcount" to all column types
karthikeyann Apr 12, 2021
7f35bde
cumcount specialization
karthikeyann Apr 12, 2021
351afb6
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 12, 2021
67c6ebc
Remove use of index and just use std::vector::back.
vyasr Apr 12, 2021
9cd725e
Update error message.
vyasr Apr 12, 2021
9f01b60
Create a temporary agg variable rather than popping from the vector.
vyasr Apr 12, 2021
cb3f5a0
Update docstring.
vyasr Apr 13, 2021
e96a795
Update error message.
vyasr Apr 13, 2021
2d304ae
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 14, 2021
f52fffb
Update docstrings.
vyasr Apr 14, 2021
affc9c9
Update python/cudf/cudf/_lib/aggregation.pyx
galipremsagar Apr 14, 2021
925f303
Merge branch 'branch-0.20' of github.com:rapidsai/cudf into fea-py-gr…
karthikeyann Apr 15, 2021
18abe66
Merge commit 'refs/pull/7818/head' of github.com:rapidsai/cudf into f…
karthikeyann Apr 15, 2021
b37b1d2
Merge branch 'branch-0.20' of github.com:rapidsai/cudf into fea-py-gr…
karthikeyann Apr 16, 2021
0240825
address review comments
karthikeyann Apr 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions python/cudf/cudf/_lib/aggregation.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -282,3 +282,38 @@ cdef class _AggregationFactory:
libcudf_aggregation.udf_type.PTX, cpp_str, out_dtype
))
return agg

# scan aggregations
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
# TODO: update this after adding per algorithm aggregation derived types
# https://github.com/rapidsai/cudf/issues/7106
@classmethod
def cumsum(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
agg.c_obj = move(libcudf_aggregation.make_sum_aggregation())
return agg

@classmethod
def cummin(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
agg.c_obj = move(libcudf_aggregation.make_min_aggregation())
return agg

@classmethod
def cummax(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
agg.c_obj = move(libcudf_aggregation.make_max_aggregation())
return agg

@classmethod
def cumcount(cls, dropna=False):
cdef libcudf_types.null_policy c_null_handling
if dropna:
c_null_handling = libcudf_types.null_policy.EXCLUDE
else:
c_null_handling = libcudf_types.null_policy.INCLUDE

cdef Aggregation agg = Aggregation.__new__(Aggregation)
agg.c_obj = move(libcudf_aggregation.make_count_aggregation(
c_null_handling
))
return agg
7 changes: 7 additions & 0 deletions python/cudf/cudf/_lib/cpp/groupby.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -65,5 +65,12 @@ cdef extern from "cudf/groupby.hpp" \
const vector[aggregation_request]& requests,
) except +

pair[
unique_ptr[table],
vector[aggregation_result]
] scan(
const vector[aggregation_request]& requests,
) except +

groups get_groups() except +
groups get_groups(table_view values) except +
57 changes: 53 additions & 4 deletions python/cudf/cudf/_lib/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,19 @@ cimport cudf._lib.cpp.aggregation as libcudf_aggregation
# The sets below define the possible aggregations that can be performed on
# different dtypes. The uppercased versions of these strings correspond to
# elements of the AggregationKind enum.
_GROUPBY_SCANS = {
"cumcount",
"cumsum",
"cummin",
"cummax",
}

_CATEGORICAL_AGGS = {
"count",
"size",
"nunique",
"unique",
"cumcount",
}

_STRING_AGGS = {
Expand All @@ -40,6 +48,7 @@ _STRING_AGGS = {
"nth",
"collect",
"unique",
"cumcount",
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved
}

_LIST_AGGS = {
Expand Down Expand Up @@ -135,6 +144,7 @@ cdef class GroupBy:
cdef Column col

aggregations = _drop_unsupported_aggs(values, aggregations)
cdef bool scan = _is_all_scan_aggregate(aggregations)
vyasr marked this conversation as resolved.
Show resolved Hide resolved

for i, (col_name, aggs) in enumerate(aggregations.items()):
col = values._data[col_name]
Expand All @@ -154,11 +164,18 @@ cdef class GroupBy:

try:
with nogil:
c_result = move(
self.c_obj.get()[0].aggregate(
c_agg_requests
if scan:
c_result = move(
self.c_obj.get()[0].scan(
c_agg_requests
)
)
else:
c_result = move(
self.c_obj.get()[0].aggregate(
c_agg_requests
)
)
)
except RuntimeError as e:
# TODO: remove this try..except after
# https://github.com/rapidsai/cudf/issues/7611
Expand Down Expand Up @@ -254,3 +271,35 @@ def _drop_unsupported_aggs(Table values, aggs):
raise DataError("No numeric types to aggregate")

return result
karthikeyann marked this conversation as resolved.
Show resolved Hide resolved


def _is_all_scan_aggregate(aggs):
"""
Returns true if all are scan aggregations.
Raises
------
NotImplementedError
If both reduction aggregations and scan aggregations are present.
vyasr marked this conversation as resolved.
Show resolved Hide resolved
"""

def get_name(agg):
return agg.__name__ if callable(agg) else agg

all_scan = all(
all(
get_name(agg_name) in _GROUPBY_SCANS for agg_name in aggs[col_name]
)
for col_name in aggs
)
any_scan = any(
any(
get_name(agg_name) in _GROUPBY_SCANS for agg_name in aggs[col_name]
)
for col_name in aggs
)

if not all_scan and any_scan:
raise NotImplementedError(
"Cannot perform both aggregation and scan in one operation"
)
return all_scan and any_scan
4 changes: 4 additions & 0 deletions python/cudf/cudf/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -588,6 +588,10 @@ def rolling(self, *args, **kwargs):
"nunique",
"collect",
"unique",
"cumcount",
"cumsum",
"cummin",
"cummax",
}


Expand Down
33 changes: 32 additions & 1 deletion python/cudf/cudf/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
_tomorrow = _now + np.timedelta64(1, "D")
_now = np.int64(_now.astype("datetime64[ns]"))
_tomorrow = np.int64(_tomorrow.astype("datetime64[ns]"))
_index_type_aggs = {"count", "idxmin", "idxmax"}
_index_type_aggs = {"count", "idxmin", "idxmax", "cumcount"}


def assert_groupby_results_equal(expect, got, sort=True, **kwargs):
Expand Down Expand Up @@ -1588,3 +1588,34 @@ def test_groupby_unique(by, data, dtype):
expect = pdf.groupby("by")["data"].unique()
got = gdf.groupby("by")["data"].unique()
assert_groupby_results_equal(expect, got)


@pytest.mark.parametrize("nelem", [2, 3, 100, 1000])
@pytest.mark.parametrize("func", ["cummin", "cummax", "cumcount", "cumsum"])
def test_groupby_2keys_scan(nelem, func):
pdf = make_frame(pd.DataFrame, nelem=nelem)
expect_df = pdf.groupby(["x", "y"], sort=True).agg(func)
got_df = (
make_frame(DataFrame, nelem=nelem)
.groupby(["x", "y"], sort=True)
.agg(func)
)
# pd.groupby.cumcount returns a series.
if isinstance(expect_df, pd.Series):
expect_df = expect_df.to_frame("val")
Comment on lines +1637 to +1639
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that we should be matching the behavior of Pandas on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas gives cumcount for entire groupby object. Not on individual columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be groupby::count() instead?

Copy link
Contributor

@shwina shwina Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few cases to consider here:

  1. df.groupby(...).cumcount()
  2. df.groupby(...).agg('cumcount')
  3. df.groupby(...).agg(['cumcount'])

Pandas returns a Series for (1) and (2), and a DataFrame for (3). We currently return a DataFrame in all cases.

(1) is easy for us to do: we define a cumcount() method and make sure we return a Series. (2) is less trivial for us: one of the first things we do is normalize the argument passed to agg() into a dictionary. Thus, we lose information about whether the user passed a string, a list or a dictionary to agg().

Handling (2) is also not necessarily relevant to this PR: we have the exact same problem for the size aggreegation as well.

My suggestion is let's make sure we handle (1) correctly for now, and not worry about (2) in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you raise an issue about (2)?

Agreed lets tackle it in a follow up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina did you already write up the issue? If so, this conversation can be resolved.

expect_df = expect_df.set_index([pdf["x"], pdf["y"]]).sort_index()

check_dtype = False if func in _index_type_aggs else True
assert_groupby_results_equal(got_df, expect_df, check_dtype=check_dtype)


def test_groupby_mix_agg_scan():
err_msg = "Cannot perform both aggregation and scan in one operation"
func = ["cumsum", "sum"]
gb = make_frame(DataFrame, nelem=10).groupby(["x", "y"], sort=True)

gb.agg(func[0])
gb.agg(func[1])
gb.agg(func[1:])
with pytest.raises(NotImplementedError, match=err_msg):
gb.agg(func)