-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add Series method to explode a list-like column #27267
Conversation
this is actually much simpler on Series; I did this in cython; should be pretty performant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR
The primary use-case in the original PR/issues is to end up with an expanded dataframe. Making this a This (and the doc example) demands: df.drop(['col1','col2'],axis=1).join([df['col1'].explode(),df['col2'].explode()])[df.columns] What would make users happy is:
How about adding some sugar? Making it available is good, but making it usable is as important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice change. Minor nit / comment around nested list-likes but otherwise lgtm
@pilkibun I don't think we want to add that here but could be a follow up if something you want to tackle |
That's what the original PR did, it's what the original issue requested, and what the SO questions were asking for. So why on earth would you not include it? and why then would a followup PR by me make any difference? |
@cpcloud how does this compare to semantics of unnest in various backends (postgres)? |
@jreback, would you please add a DataFrame method for convenience? I think users will find this feature cumbersome without it. |
I'd be -0 on adding a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither of my previous comments around style and nested lists are blockers so this lgtm ex other's comments. Certainly can address nested lists here if you want but follow up OK as well. Thanks Jeff
@mroeschke, yes, but it's very verbose. See #27267 (comment). All the user requests specifically care about making the dataframe case painless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API questions:
-
For empty list, do we want a NaN? Other option could also no entry in the result? (just wondering if this was discussed, or if there is prior art, didn't really think through or about use cases)
-
Do we want to return a MultiIndex keeping track of a counter for the values coming from the same original row? (or have an option to enable this) That would actually make it possible to combine this with
unstack
to produce multiple columns (a function which is linked as related)
|
||
df[['keys']].join(df['values'].explode()) | ||
|
||
:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other option for empty lists would be to have no entry in the result?
I am +1 on adding the DataFrame.explode method (so have both DataFrame and Series method), as the original PR did. I agree with @changhiskhan (https://github.com/pandas-dev/pandas/pull/24366/files#r243373508) that this is the most typical use case, so I don't see why we would not provide this convenience. |
cc @jorisvandenbossche I even support multi-column expoding; though this is likely not very performant (as it needs a recursive merge); but its there. This may be a bit of overkill and we should just allow a single column explosion. |
Both result in empty |
@icexelloss oh so the row is then excluded, ok since we propagate NaN I think that is equivalent. |
@TomAugspurger ready to go here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on a quick skim. I don't have any opinion on whether an empty list or NaN is preferable. I suspect we'll add a keyword in the future, but NaN
seems like a fine default.
yep |
This suddenly cropped up. This commit obviously didn't cause it, but not sure what triggered this... Can anyone else replicate this? |
@gfyoung have you rebuilt recently? |
I just did, and it seems fine now...strange. Looks like a Cython hiccup. |
replaces #24366
closes #16538
closes #10511
Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case
provides a nice inversion here
Dataframe exploding