Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Explode struct column into multiple columns with Dask-cuDF #8660

Closed
beckernick opened this issue Jul 6, 2021 · 5 comments · Fixed by #9086
Closed

[FEA] Explode struct column into multiple columns with Dask-cuDF #8660

beckernick opened this issue Jul 6, 2021 · 5 comments · Fixed by #9086
Assignees
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Jul 6, 2021

A common operation in big data processing frameworks is to "explode" struct columns into multiple columns in a single command. I'd like to be able to do this with Dask-cuDF struct columns with a single command, rather than the code shown below. This would be analogous to something like a LATERAL VIEW explode(col) in Hive.

Can't manipulate struct columns in Dask cuDF yet (#8657 ), nor can I use a struct accessor to get the fields (#8658), so the following example uses cuDF to illustrate the desired behavior with Dask. For example, given:

import cudfs = cudf.Series([
    {"a":5, "b":10},
    {"a":3, "b":7},
    {"a":-3, "b":11}
    ]
)
print(s)
0     {'a': 5, 'b': 10}
1      {'a': 3, 'b': 7}
2    {'a': -3, 'b': 11}
dtype: struct

I'd like to create the following dataframe without explicitly looping through every field, which I can do today with:

results = []
for key in s.dtype.fields:
    results.append(s.struct.field(key))
​
out = cudf.concat(results, axis=1)
out.columns = s.dtype.fields
print(out)
   a   b
0  5  10
1  3   7
2 -3  11

We currently have an explode operator, but for now it appears to be a pass-through on struct columns. The explode docstring indicates it's designed for list-like columns. Perhaps this might be an area to explore for this.

s.explode()
0     {'a': 5, 'b': 10}
1      {'a': 3, 'b': 7}
2    {'a': -3, 'b': 11}
dtype: struct
@beckernick beckernick added feature request New feature or request Python Affects Python cuDF API. dask Dask issue labels Jul 6, 2021
@shwina shwina self-assigned this Jul 13, 2021
rapids-bot bot pushed a commit that referenced this issue Jul 20, 2021
Part of #8660. Note that the issue is asking for this feature in _dask-cudf_, which this PR does not implement.

Depends on: #8306

Authors:
  - Ashwin Srinath (https://github.com/shwina)

Approvers:
  - https://github.com/brandon-b-miller
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #8729
@jakirkham
Copy link
Member

jakirkham commented Jul 21, 2021

Are there still things to do here? See PR ( #8729 ), which appears related, was merged recently. If there are other things to do, what is left?

@shwina
Copy link
Contributor

shwina commented Jul 22, 2021

Yes -- that added explode() for struct columns on the cuDF side, but not the Dask side. I'm unassigning myself for now.

@shwina shwina removed their assignment Jul 22, 2021
@jakirkham
Copy link
Member

cc @rjzamora

@sarahyurick sarahyurick self-assigned this Jul 26, 2021
@rjzamora
Copy link
Member

Feel free to ping me if you run into any issues/questions here @sarahyurick - I'll be happy to help however I can :)

@sarahyurick
Copy link
Contributor

I've been caught up with other projects, so I'll unassign myself for now.

@sarahyurick sarahyurick removed their assignment Aug 4, 2021
@isVoid isVoid self-assigned this Aug 19, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 22, 2021
Closes #8660 

Per discussions in thread #8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series.

Example: 
```python
>>> import cudf, dask_cudf as dgd
>>> ds = dgd.from_cudf(cudf.Series(
...     [{'a': 42, 'b': 'str1', 'c': [-1]},
...      {'a': 0,  'b': 'str2', 'c': [400, 500]},
...      {'a': 7,  'b': '',     'c': []}]), npartitions=2)
>>> ds.struct.explode().compute()
    a     b           c
0  42  str1        [-1]
1   0  str2  [400, 500]
2   7                []
```

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #9086
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants