Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Groupby collect on struct columns losing field name #8474

Closed
ayushdg opened this issue Jun 9, 2021 · 2 comments · Fixed by #8499
Closed

[BUG] Groupby collect on struct columns losing field name #8474

ayushdg opened this issue Jun 9, 2021 · 2 comments · Fixed by #8499
Assignees
Labels
3 - Ready for Review Ready for review by team bug Something isn't working Python Affects Python cuDF API.

Comments

@ayushdg
Copy link
Member

ayushdg commented Jun 9, 2021

Describe the bug
Groupby collect aggregation on a struct columns causes loss of the struct field names in the output

Steps/Code to reproduce bug

df = cudf.DataFrame(
    {
        'a':['aa','aa','cc'],
        'd':[{"b": '1', "c": "one"}, {"b": '2', "c": "two"}, {"b": '3', "c": "one"}]
     }
)

df
	a	d
0	aa	{'b': '1', 'c': 'one'}
1	aa	{'b': '2', 'c': 'two'}
2	cc	{'b': '3', 'c': 'one'}

df.groupby('a').collect()

	d
a	
aa	[{'0': '1', '1': 'one'}, {'0': '2', '1': 'two'}]
cc	[{'0': '3', '1': 'one'}]

Expected behavior

	d
a	
aa	[{'b': '1', 'c': 'one'}, {'b': '2', 'c': 'two'}]
cc	[{'b': '3', 'c': 'one'}]

Environment overview (please complete the following information)

  • Docker nightly (21.08)
@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Jun 9, 2021
@charlesbluca
Copy link
Member

charlesbluca commented Jun 11, 2021

Are we expecting libcudf's groupby agg to return the underlying struct column with its metadata? If not, the fix here might be to reapply the source StructColumn's metadata to the resulting column's elements before returning.

That being said, I think the underlying issue here is that we shouldn't be allowing for a collect agg on StructColumns in the first place; looking in groupby.pyx, we can see that no aggs should be possible for StructColumns:

# The sets below define the possible aggregations that can be performed on
# different dtypes. These strings must be elements of the AggregationKind enum.
_CATEGORICAL_AGGS = {"COUNT", "SIZE", "NUNIQUE", "UNIQUE"}
_STRING_AGGS = {"COUNT", "SIZE", "MAX", "MIN", "NUNIQUE", "NTH", "COLLECT",
"UNIQUE"}
_LIST_AGGS = {"COLLECT"}
_STRUCT_AGGS = set()
_INTERVAL_AGGS = set()
_DECIMAL_AGGS = {"COUNT", "SUM", "ARGMIN", "ARGMAX", "MIN", "MAX", "NUNIQUE",
"NTH", "COLLECT"}

But we erroneously use _STRING_AGGS for struct dtypes:

else _STRING_AGGS if is_struct_dtype(dtype)

@charlesbluca charlesbluca self-assigned this Jun 11, 2021
@charlesbluca charlesbluca added 3 - Ready for Review Ready for review by team and removed Needs Triage Need team to review and classify labels Jun 11, 2021
@charlesbluca
Copy link
Member

Opened up an issue to address the underlying bug #8499; though in theory, we should have been able to fix this bug by simply applying the input column's metadata to the resulting list's elements. This might be worth exploring if we want to add the collect agg later on.

@charlesbluca charlesbluca added the Python Affects Python cuDF API. label Jun 11, 2021
rapids-bot bot pushed a commit that referenced this issue Jun 14, 2021
Closes #8474 

We were erroneously using the `_STRING_AGGS` set of allowed aggregations for struct dtypes in `groupby.pyx`, which allowed us to perform erroneous groupbys on `StructColumns`;  in @ayushdg's example:

```python
df = cudf.DataFrame(
    {
        'a':['aa','aa','cc'],
        'd':[{"b": '1', "c": "one"}, {"b": '2', "c": "two"}, {"b": '3', "c": "one"}]
     }
)

df
	a	d
0	aa	{'b': '1', 'c': 'one'}
1	aa	{'b': '2', 'c': 'two'}
2	cc	{'b': '3', 'c': 'one'}

df.groupby('a').collect()

	d
a	
aa	[{'0': '1', '1': 'one'}, {'0': '2', '1': 'two'}]
cc	[{'0': '3', '1': 'one'}]
```

This change corrects this error, which should now prevent groupby operations on `StructColumns`.

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - Marlene  (https://github.com/marlenezw)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #8499
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants