Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add Series method to explode a list-like column #27267

Merged
merged 34 commits into from
Jul 18, 2019

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jul 6, 2019

replaces #24366
closes #16538
closes #10511

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case

provides a nice inversion here

In [31]: s = pd.DataFrame({'a': pd.date_range('20190101', periods=3, tz='UTC')}).apply(lambda x: x.array, axis=1)                                                                                                                                                             

In [32]: s                                                                                                                                                                                                                                                                    
Out[32]: 
0    [2019-01-01 00:00:00+00:00]
1    [2019-01-02 00:00:00+00:00]
2    [2019-01-03 00:00:00+00:00]
dtype: object

In [33]: s.iloc[0]                                                                                                                                                                                                                                                            
Out[33]: 
<DatetimeArray>
['2019-01-01 00:00:00+00:00']
Length: 1, dtype: datetime64[ns, UTC]

In [34]: s.explode()                                                                                                                                                                                                                                                          
Out[34]: 
0   2019-01-01 00:00:00+00:00
1   2019-01-02 00:00:00+00:00
2   2019-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]
In [4]: s = pd.Series([np.arange(np.random.randint(100)) for _ in range(100000)])                                                                                                                                                                            

In [5]: s                                                                                                                                                                                                                                                    
Out[5]: 
0        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
3                                          [0, 1, 2, 3, 4]
4        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
                               ...                        
99995    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
99996    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
99997    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
99998                                         [0, 1, 2, 3]
99999    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
Length: 100000, dtype: object

In [6]: s.explode()                                                                                                                                                                                                                                          
Out[6]: 
0         0
0         1
0         2
0         3
0         4
         ..
99999    43
99999    44
99999    45
99999    46
99999    47
Length: 4950205, dtype: object

In [7]: %timeit s.explode()                                                                                                                                                                                                                                  
584 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dataframe exploding

In [1]: df = pd.DataFrame( 
   ...:      [[11, range(5), 10], [22, range(3), 20]], columns=["A", "B", "C"] 
   ...: ).set_index("C") 
   ...: df                                                                                                                                                                                                                                                   
Out[1]: 
     A                B
C                      
10  11  (0, 1, 2, 3, 4)
20  22        (0, 1, 2)

In [2]: df.explode(['B'])                                                                                                                                                                                                                                    
Out[2]: 
     A  B
C        
10  11  0
10  11  1
10  11  2
10  11  3
10  11  4
20  22  0
20  22  1
20  22  2

@jreback jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 6, 2019
@jreback jreback added this to the 0.25.0 milestone Jul 6, 2019
@pep8speaks
Copy link

pep8speaks commented Jul 6, 2019

Hello @jreback! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-17 19:24:32 UTC

@jreback
Copy link
Contributor Author

jreback commented Jul 6, 2019

cc @changhiskhan

this is actually much simpler on Series; I did this in cython; should be pretty performant.

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
pandas/_libs/reshape.pyx Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
pandas/tests/series/test_explode.py Show resolved Hide resolved
doc/source/user_guide/reshaping.rst Outdated Show resolved Hide resolved
pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR

asv_bench/benchmarks/reshape.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/reshape.py Outdated Show resolved Hide resolved
pandas/_libs/lib.pyx Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
@ghost
Copy link

ghost commented Jul 7, 2019

The primary use-case in the original PR/issues is to end up with an expanded dataframe.

Making this a Series method is cleaner code-wise, but makes it far more cumbersome to use.

This (and the doc example) demands:

df.drop(['col1','col2'],axis=1).join([df['col1'].explode(),df['col2'].explode()])[df.columns]

What would make users happy is:

df.explode(['col1','col2'])

How about adding some sugar? Making it available is good, but making it usable is as important.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change. Minor nit / comment around nested list-likes but otherwise lgtm

pandas/_libs/reshape.pyx Show resolved Hide resolved
pandas/_libs/reshape.pyx Show resolved Hide resolved
@WillAyd
Copy link
Member

WillAyd commented Jul 7, 2019

@pilkibun I don't think we want to add that here but could be a follow up if something you want to tackle

@ghost
Copy link

ghost commented Jul 7, 2019

I don't think we want to add that here

That's what the original PR did, it's what the original issue requested, and what the SO questions were asking for. So why on earth would you not include it? and why then would a followup PR by me make any difference?

@jreback
Copy link
Contributor Author

jreback commented Jul 8, 2019

@cpcloud how does this compare to semantics of unnest in various backends (postgres)?

@ghost
Copy link

ghost commented Jul 8, 2019

@jreback, would you please add a DataFrame method for convenience? I think users will find this feature cumbersome without it.

@mroeschke
Copy link
Member

I'd be -0 on adding a DataFrame.explode since the result can idiomatically be obtained by chaining other pandas methods.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of my previous comments around style and nested lists are blockers so this lgtm ex other's comments. Certainly can address nested lists here if you want but follow up OK as well. Thanks Jeff

@ghost
Copy link

ghost commented Jul 8, 2019

I'd be -0 on adding a DataFrame.explode since the result can idiomatically be obtained by chaining
other pandas methods.

@mroeschke, yes, but it's very verbose. See #27267 (comment). All the user requests specifically care about making the dataframe case painless.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API questions:

  • For empty list, do we want a NaN? Other option could also no entry in the result? (just wondering if this was discussed, or if there is prior art, didn't really think through or about use cases)

  • Do we want to return a MultiIndex keeping track of a counter for the values coming from the same original row? (or have an option to enable this) That would actually make it possible to combine this with unstack to produce multiple columns (a function which is linked as related)

doc/source/user_guide/reshaping.rst Outdated Show resolved Hide resolved
doc/source/user_guide/reshaping.rst Outdated Show resolved Hide resolved
doc/source/user_guide/reshaping.rst Outdated Show resolved Hide resolved

df[['keys']].join(df['values'].explode())

:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other option for empty lists would be to have no entry in the result?

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
pandas/_libs/reshape.pyx Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
pandas/core/series.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

I am +1 on adding the DataFrame.explode method (so have both DataFrame and Series method), as the original PR did. I agree with @changhiskhan (https://github.com/pandas-dev/pandas/pull/24366/files#r243373508) that this is the most typical use case, so I don't see why we would not provide this convenience.

@jreback
Copy link
Contributor Author

jreback commented Jul 11, 2019

cc @jorisvandenbossche
I added the df.explode(subset=list-of-columns) (see the top of the PR); didn't update the docs or respond to your comments yet.

I even support multi-column expoding; though this is likely not very performant (as it needs a recursive merge); but its there. This may be a bit of overkill and we should just allow a single column explosion.

@icexelloss
Copy link
Contributor

For the empty list discussion, using Will's example from above, it gives 0 rows for an empty array (so no NULL):

test_db=# select a FROM unnest(array[]::integer[]) a;
 a 
---
(0 rows)

But, a NULL in the original data is also dropped when unnesting it, so both give here the same result as well, so not sure this is a good reference point.
Does somebody know what spark does?

@ljin ?

>>> df.show()
+---+-----+------+
| id|array|array2|
+---+-----+------+
|  0|   []|  null|
|  1|   []|  null|
|  2|   []|  null|
|  3|   []|  null|
|  4|   []|  null|
|  5|   []|  null|
|  6|   []|  null|
|  7|   []|  null|
|  8|   []|  null|
|  9|   []|  null|
+---+-----+------+

>>> df.withColumn('exploded', F.explode(df['array'])).show()
+---+-----+------+--------+
| id|array|array2|exploded|
+---+-----+------+--------+
+---+-----+------+--------+

>>> df.withColumn('exploded', F.explode(df['array2'])).show()
+---+-----+------+--------+
| id|array|array2|exploded|
+---+-----+------+--------+
+---+-----+------+--------+

Both result in empty

@jreback
Copy link
Contributor Author

jreback commented Jul 17, 2019

@icexelloss oh so the row is then excluded, ok since we propagate NaN I think that is equivalent.

@jreback
Copy link
Contributor Author

jreback commented Jul 17, 2019

@TomAugspurger ready to go here.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on a quick skim. I don't have any opinion on whether an empty list or NaN is preferable. I suspect we'll add a keyword in the future, but NaN seems like a fine default.

@jreback
Copy link
Contributor Author

jreback commented Jul 17, 2019

Looks good on a quick skim. I don't have any opinion on whether an empty list or NaN is preferable. I suspect we'll add a keyword in the future, but NaN seems like a fine default.

yep

@jreback jreback merged commit f2886c1 into pandas-dev:master Jul 18, 2019
@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2019

>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/pandas/pandas/__init__.py", line 55, in <module>
    from pandas.core.api import (
  File "/home/user/pandas/pandas/core/api.py", line 5, in <module>
    from pandas.core.arrays.integer import (
  File "/home/user/pandas/pandas/core/arrays/__init__.py", line 1, in <module>
    from .base import (  # noqa: F401
  File "/home/user/pandas/pandas/core/arrays/base.py", line 14, in <module>
    from pandas.compat.numpy import function as nv
  File "/home/user/Repositories/pandas/pandas/compat/numpy/function.py", line 28, in <module>
    from pandas.util._validators import (
  File "/home/user/pandas/pandas/util/__init__.py", line 3, in <module>
    from pandas.core.util.hashing import hash_array, hash_pandas_object  # noqa
  File "/home/user/pandas/pandas/core/util/hashing.py", line 11, in <module>
    from pandas.core.dtypes.cast import infer_dtype_from_scalar
  File "/home/user/pandas/pandas/core/dtypes/cast.py", line 9, in <module>
    from pandas.util._validators import validate_bool_kwarg
  File "/home/user/pandas/pandas/util/_validators.py", line 7, in <module>
    from pandas.core.dtypes.common import is_bool
  File "/home/user/Repositories/pandas/pandas/core/dtypes/common.py", line 11, in <module>
    from pandas.core.dtypes.dtypes import (
  File "/home/user/pandas/pandas/core/dtypes/dtypes.py", line 17, in <module>
    from .inference import is_bool, is_list_like
  File "/home/user/pandas/pandas/core/dtypes/inference.py", line 26, in <module>
    is_list_like = lib.is_list_like
AttributeError: module 'pandas._libs.lib' has no attribute 'is_list_like'

This suddenly cropped up. This commit obviously didn't cause it, but not sure what triggered this...

Can anyone else replicate this?

@jbrockmendel
Copy link
Member

@gfyoung have you rebuilt recently?

@gfyoung
Copy link
Member

gfyoung commented Aug 26, 2019

I just did, and it seems fine now...strange. Looks like a Cython hiccup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet