Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Enable groupby list aggregation for strings #6914

Merged
merged 13 commits into from
Dec 4, 2020
Merged
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@
- PR #6837 Avoid gather when copying strings view from start of strings column
- PR #6859 Move align_ptr_for_type() from cuda.cuh to alignment.hpp
- PR #6807 Refactor `std::array` usage in row group index writing in ORC
- PR #6914 Enable groupby `list` aggregation for strings
- PR #6908 Parquet option for strictly decimal reading

## Bug Fixes
Expand Down
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ _STRING_AGGS = {
"min",
"nunique",
"nth",
"collect"
}

_LIST_AGGS = {
Expand Down
20 changes: 20 additions & 0 deletions python/cudf/cudf/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1268,6 +1268,26 @@ def test_groupby_list_single_element(list_agg):
)


@pytest.mark.parametrize(
"agg", [list, [list, "count"], {"b": list, "c": "sum"}]
)
def test_groupby_list_strings(agg):
pdf = pd.DataFrame(
{
"a": [1, 1, 1, 2, 2],
"b": ["b", "a", None, "e", "d"],
"c": [1, 2, 3, 4, 5],
}
)
gdf = cudf.from_pandas(pdf)

assert_eq(
pdf.groupby("a").agg(agg),
gdf.groupby("a").agg(agg),
check_dtype=False,
)


def test_groupby_list_columns_excluded():
pdf = pd.DataFrame(
{
Expand Down