-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Explode struct column into multiple columns with Dask-cuDF #8660
Comments
Part of #8660. Note that the issue is asking for this feature in _dask-cudf_, which this PR does not implement. Depends on: #8306 Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - https://github.com/brandon-b-miller - Vyas Ramasubramani (https://github.com/vyasr) URL: #8729
Are there still things to do here? See PR ( #8729 ), which appears related, was merged recently. If there are other things to do, what is left? |
Yes -- that added |
cc @rjzamora |
Feel free to ping me if you run into any issues/questions here @sarahyurick - I'll be happy to help however I can :) |
I've been caught up with other projects, so I'll unassign myself for now. |
Closes #8660 Per discussions in thread #8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series. Example: ```python >>> import cudf, dask_cudf as dgd >>> ds = dgd.from_cudf(cudf.Series( ... [{'a': 42, 'b': 'str1', 'c': [-1]}, ... {'a': 0, 'b': 'str2', 'c': [400, 500]}, ... {'a': 7, 'b': '', 'c': []}]), npartitions=2) >>> ds.struct.explode().compute() a b c 0 42 str1 [-1] 1 0 str2 [400, 500] 2 7 [] ``` Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #9086
A common operation in big data processing frameworks is to "explode" struct columns into multiple columns in a single command. I'd like to be able to do this with Dask-cuDF struct columns with a single command, rather than the code shown below. This would be analogous to something like a
LATERAL VIEW explode(col)
in Hive.Can't manipulate struct columns in Dask cuDF yet (#8657 ), nor can I use a
struct
accessor to get the fields (#8658), so the following example uses cuDF to illustrate the desired behavior with Dask. For example, given:I'd like to create the following dataframe without explicitly looping through every field, which I can do today with:
We currently have an
explode
operator, but for now it appears to be a pass-through on struct columns. The explode docstring indicates it's designed for list-like columns. Perhaps this might be an area to explore for this.The text was updated successfully, but these errors were encountered: