Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Create agg() function for dataframes #6483

Merged
merged 44 commits into from
Dec 4, 2020
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
41866c7
Renamed skip_rows parameter to skiprows
Sep 29, 2020
19f5260
Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into b…
skirui-source Oct 7, 2020
ecb8e61
Merge branch 'branch-0.16' of https://github.com/rapidsai/cudf into b…
skirui-source Oct 9, 2020
374b572
Added new agg() function and test for dataframes
skirui-source Oct 9, 2020
9c8fe97
Improved the agg function and unit test
skirui-source Oct 13, 2020
2b395fb
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Oct 14, 2020
45affb0
Made improvements to agg function and test cases
skirui-source Oct 17, 2020
366fe39
Made dtype conversion improvements to agg function
skirui-source Oct 21, 2020
494a5cd
Made improvements to the agg function
skirui-source Oct 21, 2020
25b69da
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Oct 21, 2020
62d9a22
Made edits to agg function
skirui-source Oct 28, 2020
8968e37
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Oct 28, 2020
2e883d0
Addressed review comments on PR 6483
skirui-source Nov 3, 2020
170db82
fixed conflict in test_dataframe.py
skirui-source Nov 3, 2020
bd6e75b
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 3, 2020
8050441
Addressed review comments by Michael
skirui-source Nov 7, 2020
0eaef5f
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 7, 2020
3d56cb3
Addressed Michael's review comments
skirui-source Nov 11, 2020
2fa0694
Addressed Michael's review comments
skirui-source Nov 12, 2020
4d578f5
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 12, 2020
2473e69
Fixed bad find and replace
skirui-source Nov 18, 2020
7c01498
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 18, 2020
680df2c
Fixed bad find and replace
skirui-source Nov 18, 2020
2e0fbf9
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 19, 2020
0689eae
split into three tests for exceptions
skirui-source Nov 19, 2020
4ccca2f
split unit tests, all tests passing
skirui-source Nov 21, 2020
e26ec64
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 21, 2020
f149443
Addressed typecasting problem-call with Keith and Ashwin
skirui-source Nov 25, 2020
f13c030
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 25, 2020
ba99f32
Updated CHANGELOD.md and fixed style issues
skirui-source Nov 25, 2020
625eb06
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Nov 25, 2020
ce1089f
Addressed Michael's reviews- fixed doctring issues
skirui-source Dec 1, 2020
9c6ac0a
Merge branch 'branch-0.17' of https://github.com/rapidsai/cudf into a…
skirui-source Dec 1, 2020
65b1384
Update CHANGELOG.md to address review comments
isVoid Dec 1, 2020
14bc311
Update CHANGELOG.md
isVoid Dec 1, 2020
97cfe14
fixed style issues
skirui-source Dec 1, 2020
4235db2
Merge branch 'aggfordataframe' of https://github.com/skirui-source/cu…
skirui-source Dec 3, 2020
2d58da8
Merge branch 'branch-0.17' into aggfordataframe
skirui-source Dec 3, 2020
fe41d20
fix docs
galipremsagar Dec 3, 2020
90540bb
Merge branch 'branch-0.17' into aggfordataframe
galipremsagar Dec 3, 2020
87306bb
don't run function again after checking exception
kkraus14 Dec 3, 2020
027e1c6
fix style issues
kkraus14 Dec 3, 2020
d470bac
Pre-emptively raise when the frame contains string columns
shwina Dec 4, 2020
5a8fbc1
Escape match regex
shwina Dec 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -3722,6 +3722,49 @@ def sort_values(
keep_index=not ignore_index,
)

def agg(self, aggs):
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
dtypes = [self[col].dtype for col in self._column_names]
common_dtype = np.find_common_type(dtypes, [])
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
df_normalized = self.astype(common_dtype)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

if isinstance(aggs, list):
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
result = cudf.DataFrame()
for agg in aggs:
result[agg] = getattr(df_normalized, agg)()
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

elif isinstance(aggs, str):
result = cudf.DataFrame()
result[aggs] = getattr(df_normalized, aggs)()
result = result.T.loc[aggs]
result.name = None
return result

elif isinstance(aggs, dict):
cols = aggs.keys()
idxs = set()
for agg_l in aggs.values():
for agg in agg_l:
idxs.add(agg)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
idxs = sorted(list(idxs))
result = cudf.DataFrame(index=idxs, columns=cols)
for key in aggs.keys():
col = df_normalized[key]
ans = cudf.Series(
[None] * len(idxs), index=idxs, dtype=col.dtype
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
)
for agg in aggs.get(key):
ans[agg] = getattr(col, agg)()
result[key] = ans
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
return result
elif callable(aggs):
raise NotImplementedError(
"callable parameter is not implemented yet"
)

else:
raise ValueError("argument must be a string or list")
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
return result.T.sort_index(axis=1, ascending=True)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved

def nlargest(self, n, columns, keep="first"):
"""Get the rows of the DataFrame sorted by the n largest value of *columns*

Expand Down
35 changes: 35 additions & 0 deletions python/cudf/cudf/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -7684,3 +7684,38 @@ def test_dataframe_error_equality(df1, df2, op):
gdf2 = gd.from_pandas(df2)

assert_exceptions_equal(op, op, ([df1, df2],), ([gdf1, gdf2],))


@pytest.mark.parametrize(
"data",
[
{"a": [1, 2, 3], "b": [3.0, 4.0, 5.0], "c": [True, True, False]},
{"a": [1.0, 2.0, 3.0], "b": [3.0, 4.0, 5.0], "c": [True, True, False]},
{"a": [1, 2, 3], "b": [3, 4, 5], "c": [True, True, False]},
{"a": [1, 2, 3], "b": [True, True, False], "c": [False, True, False]},
{
"a": [1.0, 2.0, 3.0],
"b": [True, True, False],
"c": [False, True, False],
},
{"a": [1, 2, 3], "b": [3, 4, 5], "c": [2.0, 3.0, 4.0]},
{"a": [1, 2, 3], "b": [2.0, 3.0, 4.0], "c": [5.0, 6.0, 4.0]},
],
)
@pytest.mark.parametrize(
"aggs",
[
["min", "sum", "max"],
"sum",
"min",
{"a": ["sum", "min"], "b": ["min"], "c": ["max"]},
skirui-source marked this conversation as resolved.
Show resolved Hide resolved
],
)
def test_agg_for_dataframes(data, aggs):
pdf = pd.DataFrame(data)
gdf = gd.DataFrame(data)

expect = pdf.agg(aggs)
got = gdf.agg(aggs)

assert_eq(expect, got, check_dtype=False)
skirui-source marked this conversation as resolved.
Show resolved Hide resolved