Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Create agg() function for dataframes #6483

Merged
merged 44 commits into from
Dec 4, 2020
Merged
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
41866c7
Renamed skip_rows parameter to skiprows
Sep 29, 2020
19f5260
Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into b…
skirui-source Oct 7, 2020
ecb8e61
Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into b…
skirui-source Oct 9, 2020
374b572
Added new agg() function and test for dataframes
skirui-source Oct 9, 2020
9c8fe97
Improved the agg function and unit test
skirui-source Oct 13, 2020
2b395fb
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Oct 14, 2020
45affb0
Made improvements to agg function and test cases
skirui-source Oct 17, 2020
366fe39
Made dtype conversion improvements to agg function
skirui-source Oct 21, 2020
494a5cd
Made improvements to the agg function
skirui-source Oct 21, 2020
25b69da
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Oct 21, 2020
62d9a22
Made edits to agg function
skirui-source Oct 28, 2020
8968e37
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Oct 28, 2020
2e883d0
Addressed review comments on PR 6483
skirui-source Nov 3, 2020
170db82
fixed conflict in test_dataframe.py
skirui-source Nov 3, 2020
bd6e75b
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 3, 2020
8050441
Addressed review comments by Michael
skirui-source Nov 7, 2020
0eaef5f
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 7, 2020
3d56cb3
Addressed Michael's review comments
skirui-source Nov 11, 2020
2fa0694
Addressed Michael's review comments
skirui-source Nov 12, 2020
4d578f5
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 12, 2020
2473e69
Fixed bad find and replace
skirui-source Nov 18, 2020
7c01498
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 18, 2020
680df2c
Fixed bad find and replace
skirui-source Nov 18, 2020
2e0fbf9
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 19, 2020
0689eae
split into three tests for exceptions
skirui-source Nov 19, 2020
4ccca2f
split unit tests, all tests passing
skirui-source Nov 21, 2020
e26ec64
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 21, 2020
f149443
Addressed typecasting problem-call with Keith and Ashwin
skirui-source Nov 25, 2020
f13c030
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 25, 2020
ba99f32
Updated CHANGELOD.md and fixed style issues
skirui-source Nov 25, 2020
625eb06
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 25, 2020
ce1089f
Addressed Michael's reviews- fixed doctring issues
skirui-source Dec 1, 2020
9c6ac0a
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Dec 1, 2020
65b1384
Update CHANGELOG.md to address review comments
isVoid Dec 1, 2020
14bc311
Update CHANGELOG.md
isVoid Dec 1, 2020
97cfe14
fixed style issues
skirui-source Dec 1, 2020
4235db2
Merge branch 'aggfordataframe' of https://github.com/skirui-source/cu…
skirui-source Dec 3, 2020
2d58da8
Merge branch 'branch-0.17' into aggfordataframe
skirui-source Dec 3, 2020
fe41d20
fix docs
galipremsagar Dec 3, 2020
90540bb
Merge branch 'branch-0.17' into aggfordataframe
galipremsagar Dec 3, 2020
87306bb
don't run function again after checking exception
kkraus14 Dec 3, 2020
027e1c6
fix style issues
kkraus14 Dec 3, 2020
d470bac
Pre-emptively raise when the frame contains string columns
shwina Dec 4, 2020
5a8fbc1
Escape match regex
shwina Dec 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
- PR #6765 Cupy fallback for __array_function__ and __array_ufunc__ for cudf.Series
- PR #6817 Add support for scatter() on lists-of-struct columns
- PR #6805 Implement `cudf::detail::copy_if` for `decimal32` and `decimal64`
- PR #6483 Add `agg` function to aggregate dataframe using one or more operations
- PR #6726 Support selecting different hash functions in hash_partition
- PR #6619 Improve Dockerfile

Expand Down
129 changes: 127 additions & 2 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Copyright (c) 2018-2020, NVIDIA CORPORATION.
from __future__ import division, print_function
from __future__ import division

import inspect
import itertools
Expand All @@ -8,7 +8,7 @@
import sys
import warnings
from collections import OrderedDict, defaultdict
from collections.abc import Mapping, Sequence
from collections.abc import Iterable, Mapping, Sequence

import cupy
import numpy as np
Expand Down Expand Up @@ -3728,6 +3728,131 @@ def sort_values(
keep_index=not ignore_index,
)

def agg(self, aggs, axis=None):
"""
Aggregate using one or more operations over the specified axis.

Parameters
----------
aggs : Iterable (set, list, string, tuple or dict)
Function to use for aggregating data. Accepted types are:
* string name, e.g. ``"sum"``
* list of functions, e.g. ``["sum", "min", "max"]``
* dict of axis labels specified operations per column,
e.g. ``{"a": "sum"}``

axis : not yet supported

Returns
-------
Aggregation Result : ``Series`` or ``DataFrame``
When ``DataFrame.agg`` is called with single agg,
``Series`` is returned.
When ``DataFrame.agg`` is called with several aggs,
``DataFrame`` is returned.

Notes
-----
Difference from pandas:
* Not supporting: ``axis``, ``*args``, ``**kwargs``

"""
# TODO: Remove the typecasting below once issue #6846 is fixed
# link <https://github.com/rapidsai/cudf/issues/6846>
dtypes = [self[col].dtype for col in self._column_names]
common_dtype = cudf.utils.dtypes.find_common_type(dtypes)
df_normalized = self.astype(common_dtype)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

if axis == 0 or axis is not None:
raise NotImplementedError("axis not implemented yet")

if isinstance(aggs, Iterable) and not isinstance(aggs, (str, dict)):
result = cudf.DataFrame()
# TODO : Allow simultaneous pass for multi-aggregation as
# a future optimization
for agg in aggs:
result[agg] = getattr(df_normalized, agg)()
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
return result.T.sort_index(axis=1, ascending=True)

elif isinstance(aggs, str):
if not hasattr(df_normalized, aggs):
raise AttributeError(
f"{aggs} is not a valid function for "
f"'DataFrame' object"
)
result = cudf.DataFrame()
result[aggs] = getattr(df_normalized, aggs)()
result = result.iloc[:, 0]
result.name = None
return result

elif isinstance(aggs, dict):
cols = aggs.keys()
if any([callable(val) for val in aggs.values()]):
raise NotImplementedError(
"callable parameter is not implemented yet"
)
elif all([isinstance(val, str) for val in aggs.values()]):
result = cudf.Series(index=cols)
for key, value in aggs.items():
col = df_normalized[key]
if not hasattr(col, value):
raise AttributeError(
f"{value} is not a valid function for "
f"'Series' object"
)
result[key] = getattr(col, value)()
elif all([isinstance(val, Iterable) for val in aggs.values()]):
idxs = set()
for val in aggs.values():
if isinstance(val, Iterable):
idxs.update(val)
elif isinstance(val, str):
idxs.add(val)
idxs = sorted(list(idxs))
for agg in idxs:
if agg is callable:
raise NotImplementedError(
"callable parameter is not implemented yet"
)
result = cudf.DataFrame(index=idxs, columns=cols)
for key in aggs.keys():
col = df_normalized[key]
col_empty = column_empty(
len(idxs), dtype=col.dtype, masked=True
)
ans = cudf.Series(data=col_empty, index=idxs)
if isinstance(aggs.get(key), Iterable):
# TODO : Allow simultaneous pass for multi-aggregation
# as a future optimization
for agg in aggs.get(key):
if not hasattr(col, agg):
raise AttributeError(
f"{agg} is not a valid function for "
f"'Series' object"
)
ans[agg] = getattr(col, agg)()
elif isinstance(aggs.get(key), str):
if not hasattr(col, aggs.get(key)):
raise AttributeError(
f"{aggs.get(key)} is not a valid function for "
f"'Series' object"
)
ans[aggs.get(key)] = getattr(col, agg)()
result[key] = ans
else:
raise ValueError("values of dict must be a string or list")

return result

elif callable(aggs):
raise NotImplementedError(
"callable parameter is not implemented yet"
)

else:
raise ValueError("argument must be a string, list or dict")

def nlargest(self, n, columns, keep="first"):
"""Get the rows of the DataFrame sorted by the n largest value of *columns*

Expand Down
131 changes: 131 additions & 0 deletions python/cudf/cudf/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -7995,3 +7995,134 @@ def test_dataframe_from_pandas_duplicate_columns():
ValueError, match="Duplicate column names are not allowed"
):
gd.from_pandas(pdf)


@pytest.mark.parametrize(
"data",
[
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]},
{"a": [1.0, 2.0, 3.0], "b": [3.0, 4.0, 5.0], "c": [True, True, False]},
{"a": [1, 2, 3], "b": [3, 4, 5], "c": [True, True, False]},
{"a": [1, 2, 3], "b": [True, True, False], "c": [False, True, False]},
{
"a": [1.0, 2.0, 3.0],
"b": [True, True, False],
"c": [False, True, False],
},
{"a": [1, 2, 3], "b": [3, 4, 5], "c": [2.0, 3.0, 4.0]},
{"a": [1, 2, 3], "b": [2.0, 3.0, 4.0], "c": [5.0, 6.0, 4.0]},
],
)
@pytest.mark.parametrize(
"aggs",
[
["min", "sum", "max"],
("min", "sum", "max"),
{"min", "sum", "max"},
"sum",
{"a": "sum", "b": "min", "c": "max"},
{"a": ["sum"], "b": ["min"], "c": ["max"]},
{"a": ("sum"), "b": ("min"), "c": ("max")},
{"a": {"sum"}, "b": {"min"}, "c": {"max"}},
{"a": ["sum", "min"], "b": ["sum", "max"], "c": ["min", "max"]},
{"a": ("sum", "min"), "b": ("sum", "max"), "c": ("min", "max")},
{"a": {"sum", "min"}, "b": {"sum", "max"}, "c": {"min", "max"}},
],
)
def test_agg_for_dataframes(data, aggs):
pdf = pd.DataFrame(data)
gdf = gd.DataFrame(data)

expect = pdf.agg(aggs)
got = gdf.agg(aggs)

assert_eq(expect, got, check_dtype=False)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize("aggs", [{"a": np.sum, "b": np.min, "c": np.max}])
def test_agg_for_unsupported_function(aggs):
pdf = pd.DataFrame(
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]}
)
gdf = gd.DataFrame(
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]}
)

with pytest.raises(NotImplementedError):
got = gdf.agg(aggs)

expect = pdf.agg(aggs)
got = gdf.agg(aggs)

assert_eq(expect, got, check_dtype=False)
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize("aggs", ["asdf"])
def test_agg_for_dataframe_with_invalid_function(aggs):
pdf = pd.DataFrame(
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]}
)
gdf = gd.DataFrame(
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]}
)

with pytest.raises(
AttributeError,
match=f"{aggs} is not a valid function for 'DataFrame' object",
):
got = gdf.agg(aggs)

expect = pdf.agg(aggs)
got = gdf.agg(aggs)

assert_eq(expect, got, check_dtype=False)
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize("aggs", [{"a": "asdf"}])
def test_agg_for_series_with_invalid_function(aggs):
pdf = pd.DataFrame(
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]}
)
gdf = gd.DataFrame(
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]}
)

with pytest.raises(
AttributeError,
match=f"{aggs['a']} is not a valid function for 'Series' object",
):
got = gdf.agg(aggs)

expect = pdf.agg(aggs)
got = gdf.agg(aggs)

assert_eq(expect, got, check_dtype=False)
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved


@pytest.mark.parametrize(
"aggs",
[
"sum",
["min", "sum", "max"],
{"a": {"sum", "min"}, "b": {"sum", "max"}, "c": {"min", "max"}},
],
)
def test_agg_for_dataframe_with_string_columns(aggs):
pdf = pd.DataFrame(
{"a": ["m", "n", "o"], "b": ["t", "u", "v"], "c": ["x", "y", "z"]},
index=["a", "b", "c"],
)
gdf = gd.DataFrame(
{"a": ["m", "n", "o"], "b": ["t", "u", "v"], "c": ["x", "y", "z"]},
index=["a", "b", "c"],
)

with pytest.raises(
NotImplementedError, match="Cannot transpose string columns",
):
got = gdf.agg(aggs)

expect = pdf.agg(aggs)
got = gdf.agg(aggs)

assert_eq(expect, got, check_dtype=False)
kkraus14 marked this conversation as resolved.
Show resolved Hide resolved