Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Python and Cython internals for groupby aggregation #7818

Merged
Merged
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
26bafd0
Don't identify decimals as strings.
vyasr Mar 24, 2021
babcdfc
Reject all extension types as string types.
vyasr Mar 25, 2021
2b00611
Create separate lists for extension type methods.
vyasr Mar 25, 2021
76ab556
Merge branch 'branch-0.19' into fix/issue7687_part2
vyasr Mar 25, 2021
1ebde51
Enable collect for decimals.
vyasr Mar 25, 2021
4c5d876
Enable argmin and argmax.
vyasr Mar 25, 2021
4134e43
Fix variance key name.
vyasr Mar 25, 2021
43cf580
Move groupby aggregation list to groupby.py and clean up the assignme…
vyasr Mar 25, 2021
474a179
Disable aggs that are overrides of actual methods.
vyasr Mar 25, 2021
25e74ef
Move more logic out of the GroupBy class.
vyasr Mar 25, 2021
8a44827
Simplify getattr usage.
vyasr Mar 25, 2021
6b5c67f
Clearly documented unknown failures.
vyasr Mar 25, 2021
8e45ad0
Match other class groupbys to strings.
vyasr Mar 25, 2021
81ffe0a
Fix style and remove unsupported operations.
vyasr Mar 25, 2021
6d3fad3
Apply black reformattings.
vyasr Mar 25, 2021
714742d
Remove variance from obviously unsupported types.
vyasr Mar 25, 2021
ea4ed2e
Defer getattr to getitem if possible.
vyasr Mar 25, 2021
026bb4e
Make getattr safe for copying.
vyasr Mar 25, 2021
1259032
Remove support for aggregating structs.
vyasr Mar 25, 2021
6c61806
Update documented list of groupby operations.
vyasr Mar 25, 2021
a14d30f
Move function out of loop.
vyasr Mar 25, 2021
12caa06
Merge branch 'branch-0.19' into fix/issue7687_part2
vyasr Mar 26, 2021
5c71bfe
Remove redundant test, add test of decimal.
vyasr Mar 27, 2021
25811b0
Fix formatting.
vyasr Mar 27, 2021
1450f2d
Merge branch 'branch-0.19' into fix/issue7687_part2
vyasr Mar 28, 2021
4a0f27d
Simplify drop aggs logic.
vyasr Mar 28, 2021
596496f
Inline drop logic.
vyasr Mar 28, 2021
1f5dadd
Remove empty aggregations in place rather than beforehand.
vyasr Mar 29, 2021
f02231f
Reuse aggregation object.
vyasr Mar 29, 2021
0e09786
Write a new make_aggregation2 function that returns the cdef class ra…
vyasr Mar 30, 2021
63897c1
Remove constructor of Aggregation to avoid recursion issue.
vyasr Mar 30, 2021
5fbde6f
Remove _AggregationFactory and just use classmethods of Aggregation.
vyasr Mar 30, 2021
f9508fe
Remove old API for aggregations, use new Cython object everywhere.
vyasr Mar 30, 2021
4c04b5f
Remove direct usage of the C++ groupby object in most of the Cython c…
vyasr Mar 30, 2021
b21dfaf
Add a docstring for the Aggregation class.
vyasr Mar 30, 2021
e3ae492
Don't use a dict as a default argument.
vyasr Mar 30, 2021
bac0de5
Use the uppercased versions of the acceptable aggregations.
vyasr Mar 30, 2021
4c7f8d2
Just define all the aggregations available on the GroupBy as methods.
vyasr Mar 30, 2021
8d64ae5
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 1, 2021
5ebbca2
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 1, 2021
46e0545
Fix up merge.
vyasr Apr 1, 2021
26f1016
Some minor cleanup.
vyasr Apr 1, 2021
a7148a4
Fix docstrings.
vyasr Apr 1, 2021
4dc1d79
Fix style.
vyasr Apr 1, 2021
0c18a16
Remove extra blank line.
vyasr Apr 1, 2021
794aad1
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 1, 2021
c9892b1
Remove TODO.
vyasr Apr 1, 2021
61e84b2
Apply black formatting.
vyasr Apr 1, 2021
35deb2f
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 5, 2021
351afb6
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 12, 2021
67c6ebc
Remove use of index and just use std::vector::back.
vyasr Apr 12, 2021
9cd725e
Update error message.
vyasr Apr 12, 2021
9f01b60
Create a temporary agg variable rather than popping from the vector.
vyasr Apr 12, 2021
cb3f5a0
Update docstring.
vyasr Apr 13, 2021
e96a795
Update error message.
vyasr Apr 13, 2021
2d304ae
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 14, 2021
f52fffb
Update docstrings.
vyasr Apr 14, 2021
affc9c9
Update python/cudf/cudf/_lib/aggregation.pyx
galipremsagar Apr 14, 2021
4bc9601
Merge branch 'branch-0.20' into refactor/cleanup_aggregation_code
vyasr Apr 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions python/cudf/cudf/_lib/aggregation.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ from libcpp.memory cimport unique_ptr
from cudf._lib.cpp.aggregation cimport aggregation


cdef unique_ptr[aggregation] make_aggregation(op, kwargs=*) except *

cdef class Aggregation:
cdef unique_ptr[aggregation] c_obj

cdef Aggregation make_aggregation(op, kwargs=*)
137 changes: 73 additions & 64 deletions python/cudf/cudf/_lib/aggregation.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -56,85 +56,55 @@ class AggregationKind(Enum):


cdef class Aggregation:
def __init__(self, op, **kwargs):
self.c_obj = move(make_aggregation(op, kwargs))

"""A Cython wrapper for aggregations.

**This class should never be instantiated using a standard constructor,
only using one of its many factories.** These factories handle mapping
different cudf operations to their libcudf analogs, e.g.
`cudf.DataFrame.idxmin` -> `libcudf.argmin`. Additionally, they perform
any additional configuration needed to translate Python arguments into
their corresponding C++ types (for instance, C++ enumerations used for
flag arguments). The factory approach is necessary to support operations
like `df.agg(lambda x: x.sum())`; such functions are called with this
class as an argument to generation the desired aggregation.
"""
@property
def kind(self):
return AggregationKind(self.c_obj.get()[0].kind).name.lower()


cdef unique_ptr[aggregation] make_aggregation(op, kwargs={}) except *:
"""
Parameters
----------
op : str or callable
If callable, must meet one of the following requirements:

* Is of the form lambda x: x.agg(*args, **kwargs), where
`agg` is the name of a supported aggregation. Used to
to specify aggregations that take arguments, e.g.,
`lambda x: x.quantile(0.5)`.
* Is a user defined aggregation function that operates on
group values. In this case, the output dtype must be
specified in the `kwargs` dictionary.

Returns
-------
unique_ptr[aggregation]
"""
cdef Aggregation agg
if isinstance(op, str):
agg = getattr(_AggregationFactory, op)(**kwargs)
elif callable(op):
if op is list:
agg = _AggregationFactory.collect()
elif "dtype" in kwargs:
agg = _AggregationFactory.from_udf(op, **kwargs)
else:
agg = op(_AggregationFactory)
else:
raise TypeError("Unknown aggregation {}".format(op))
return move(agg.c_obj)

# The Cython pattern below enables us to create an Aggregation
# without ever calling its `__init__` method, which would otherwise
# result in a RecursionError.
cdef class _AggregationFactory:
return AggregationKind(self.c_obj.get()[0].kind).name

@classmethod
def sum(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_sum_aggregation())
return agg

@classmethod
def min(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_min_aggregation())
return agg

@classmethod
def max(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_max_aggregation())
return agg

@classmethod
def idxmin(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_argmin_aggregation())
return agg

@classmethod
def idxmax(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_argmax_aggregation())
return agg

@classmethod
def mean(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_mean_aggregation())
return agg

Expand All @@ -146,15 +116,15 @@ cdef class _AggregationFactory:
else:
c_null_handling = libcudf_types.null_policy.INCLUDE

cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_count_aggregation(
c_null_handling
))
return agg

@classmethod
def size(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_count_aggregation(
<libcudf_types.null_policy><underlying_type_t_null_policy>(
NullHandling.INCLUDE
Expand All @@ -164,63 +134,63 @@ cdef class _AggregationFactory:

@classmethod
def nunique(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_nunique_aggregation())
return agg

@classmethod
def nth(cls, libcudf_types.size_type size):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(
libcudf_aggregation.make_nth_element_aggregation(size)
)
return agg

@classmethod
def any(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_any_aggregation())
return agg

@classmethod
def all(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_all_aggregation())
return agg

@classmethod
def product(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_product_aggregation())
return agg

@classmethod
def sum_of_squares(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_sum_of_squares_aggregation())
return agg

@classmethod
def var(cls, ddof=1):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_variance_aggregation(ddof))
return agg

@classmethod
def std(cls, ddof=1):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_std_aggregation(ddof))
return agg

@classmethod
def median(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_median_aggregation())
return agg

@classmethod
def quantile(cls, q=0.5, interpolation="linear"):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()

if not pd.api.types.is_list_like(q):
q = [q]
Expand All @@ -240,19 +210,19 @@ cdef class _AggregationFactory:

@classmethod
def collect(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_collect_list_aggregation())
return agg

@classmethod
def unique(cls):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()
agg.c_obj = move(libcudf_aggregation.make_collect_set_aggregation())
return agg

@classmethod
def from_udf(cls, op, *args, **kwargs):
cdef Aggregation agg = Aggregation.__new__(Aggregation)
cdef Aggregation agg = cls()

cdef libcudf_types.type_id tid
cdef libcudf_types.data_type out_dtype
Expand Down Expand Up @@ -282,3 +252,42 @@ cdef class _AggregationFactory:
libcudf_aggregation.udf_type.PTX, cpp_str, out_dtype
))
return agg


cdef Aggregation make_aggregation(op, kwargs=None):
r"""
Parameters
----------
op : str or callable
If callable, must meet one of the following requirements:

* Is of the form lambda x: x.agg(*args, **kwargs), where
`agg` is the name of a supported aggregation. Used to
to specify aggregations that take arguments, e.g.,
`lambda x: x.quantile(0.5)`.
* Is a user defined aggregation function that operates on
group values. In this case, the output dtype must be
specified in the `kwargs` dictionary.
\*\*kwargs : dict, optional
vyasr marked this conversation as resolved.
Show resolved Hide resolved
Any keyword arguments to be passed to the op.

Returns
-------
Aggregation
"""
if kwargs is None:
kwargs = {}

cdef Aggregation agg
if isinstance(op, str):
agg = getattr(Aggregation, op)(**kwargs)
elif callable(op):
if op is list:
agg = Aggregation.collect()
elif "dtype" in kwargs:
agg = Aggregation.from_udf(op, **kwargs)
else:
agg = op(Aggregation)
else:
raise TypeError(f"Unknown aggregation {op}")
return agg
Loading