-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Groupby collect on struct columns losing field name #8474
Comments
Are we expecting libcudf's groupby agg to return the underlying struct column with its metadata? If not, the fix here might be to reapply the source That being said, I think the underlying issue here is that we shouldn't be allowing for a collect agg on cudf/python/cudf/cudf/_lib/groupby.pyx Lines 42 to 51 in 0a4e8a1
But we erroneously use cudf/python/cudf/cudf/_lib/groupby.pyx Line 139 in 0a4e8a1
|
Opened up an issue to address the underlying bug #8499; though in theory, we should have been able to fix this bug by simply applying the input column's metadata to the resulting list's elements. This might be worth exploring if we want to add the collect agg later on. |
Closes #8474 We were erroneously using the `_STRING_AGGS` set of allowed aggregations for struct dtypes in `groupby.pyx`, which allowed us to perform erroneous groupbys on `StructColumns`; in @ayushdg's example: ```python df = cudf.DataFrame( { 'a':['aa','aa','cc'], 'd':[{"b": '1', "c": "one"}, {"b": '2', "c": "two"}, {"b": '3', "c": "one"}] } ) df a d 0 aa {'b': '1', 'c': 'one'} 1 aa {'b': '2', 'c': 'two'} 2 cc {'b': '3', 'c': 'one'} df.groupby('a').collect() d a aa [{'0': '1', '1': 'one'}, {'0': '2', '1': 'two'}] cc [{'0': '3', '1': 'one'}] ``` This change corrects this error, which should now prevent groupby operations on `StructColumns`. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - Marlene (https://github.com/marlenezw) - Benjamin Zaitlen (https://github.com/quasiben) URL: #8499
Describe the bug
Groupby collect aggregation on a struct columns causes loss of the struct field names in the output
Steps/Code to reproduce bug
Expected behavior
Environment overview (please complete the following information)
The text was updated successfully, but these errors were encountered: