ENH: Add Series method to explode a list-like column #27267

jreback · 2019-07-06T19:28:11Z

replaces #24366
closes #16538
closes #10511

Sometimes a values column is presented with list-like values on one row.
Instead we may want to split each individual value onto its own row,
keeping the same mapping to the other key columns. While it's possible
to chain together existing pandas operations (in fact that's exactly
what this implementation is) to do this, the sequence of operations
is not obvious. By contrast this is available as a built-in operation
in say Spark and is a fairly common use case

provides a nice inversion here

In [31]: s = pd.DataFrame({'a': pd.date_range('20190101', periods=3, tz='UTC')}).apply(lambda x: x.array, axis=1)                                                                                                                                                             

In [32]: s                                                                                                                                                                                                                                                                    
Out[32]: 
0    [2019-01-01 00:00:00+00:00]
1    [2019-01-02 00:00:00+00:00]
2    [2019-01-03 00:00:00+00:00]
dtype: object

In [33]: s.iloc[0]                                                                                                                                                                                                                                                            
Out[33]: 
<DatetimeArray>
['2019-01-01 00:00:00+00:00']
Length: 1, dtype: datetime64[ns, UTC]

In [34]: s.explode()                                                                                                                                                                                                                                                          
Out[34]: 
0   2019-01-01 00:00:00+00:00
1   2019-01-02 00:00:00+00:00
2   2019-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]

In [4]: s = pd.Series([np.arange(np.random.randint(100)) for _ in range(100000)])                                                                                                                                                                            

In [5]: s                                                                                                                                                                                                                                                    
Out[5]: 
0        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
1        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
2        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
3                                          [0, 1, 2, 3, 4]
4        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
                               ...                        
99995    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
99996    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
99997    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
99998                                         [0, 1, 2, 3]
99999    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
Length: 100000, dtype: object

In [6]: s.explode()                                                                                                                                                                                                                                          
Out[6]: 
0         0
0         1
0         2
0         3
0         4
         ..
99999    43
99999    44
99999    45
99999    46
99999    47
Length: 4950205, dtype: object

In [7]: %timeit s.explode()                                                                                                                                                                                                                                  
584 ms ± 17.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Dataframe exploding

In [1]: df = pd.DataFrame( 
   ...:      [[11, range(5), 10], [22, range(3), 20]], columns=["A", "B", "C"] 
   ...: ).set_index("C") 
   ...: df                                                                                                                                                                                                                                                   
Out[1]: 
     A                B
C                      
10  11  (0, 1, 2, 3, 4)
20  22        (0, 1, 2)

In [2]: df.explode(['B'])                                                                                                                                                                                                                                    
Out[2]: 
     A  B
C        
10  11  0
10  11  1
10  11  2
10  11  3
10  11  4
20  22  0
20  22  1
20  22  2

pep8speaks · 2019-07-06T19:28:15Z

Hello @jreback! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-17 19:24:32 UTC

jreback · 2019-07-06T19:28:35Z

cc @changhiskhan

this is actually much simpler on Series; I did this in cython; should be pretty performant.

pandas/_libs/lib.pyx

pandas/_libs/reshape.pyx

pandas/core/series.py

pandas/tests/series/test_explode.py

doc/source/user_guide/reshaping.rst

pandas/_libs/lib.pyx

pandas/_libs/reshape.pyx

WillAyd

Nice PR

asv_bench/benchmarks/reshape.py

pandas/_libs/lib.pyx

pandas/core/series.py

ghost · 2019-07-07T04:39:39Z

The primary use-case in the original PR/issues is to end up with an expanded dataframe.

Making this a Series method is cleaner code-wise, but makes it far more cumbersome to use.

This (and the doc example) demands:

df.drop(['col1','col2'],axis=1).join([df['col1'].explode(),df['col2'].explode()])[df.columns]

What would make users happy is:

df.explode(['col1','col2'])

How about adding some sugar? Making it available is good, but making it usable is as important.

WillAyd

Very nice change. Minor nit / comment around nested list-likes but otherwise lgtm

pandas/_libs/reshape.pyx

WillAyd · 2019-07-07T16:26:37Z

@pilkibun I don't think we want to add that here but could be a follow up if something you want to tackle

ghost · 2019-07-07T21:09:06Z

I don't think we want to add that here

That's what the original PR did, it's what the original issue requested, and what the SO questions were asking for. So why on earth would you not include it? and why then would a followup PR by me make any difference?

jreback · 2019-07-08T12:08:19Z

@cpcloud how does this compare to semantics of unnest in various backends (postgres)?

ghost · 2019-07-08T17:28:37Z

@jreback, would you please add a DataFrame method for convenience? I think users will find this feature cumbersome without it.

pandas/tests/series/test_explode.py

mroeschke · 2019-07-08T18:33:48Z

I'd be -0 on adding a DataFrame.explode since the result can idiomatically be obtained by chaining other pandas methods.

WillAyd

Neither of my previous comments around style and nested lists are blockers so this lgtm ex other's comments. Certainly can address nested lists here if you want but follow up OK as well. Thanks Jeff

ghost · 2019-07-08T22:41:19Z

I'd be -0 on adding a DataFrame.explode since the result can idiomatically be obtained by chaining
other pandas methods.

@mroeschke, yes, but it's very verbose. See #27267 (comment). All the user requests specifically care about making the dataframe case painless.

jorisvandenbossche

API questions:

For empty list, do we want a NaN? Other option could also no entry in the result? (just wondering if this was discussed, or if there is prior art, didn't really think through or about use cases)
Do we want to return a MultiIndex keeping track of a counter for the values coming from the same original row? (or have an option to enable this) That would actually make it possible to combine this with unstack to produce multiple columns (a function which is linked as related)

doc/source/user_guide/reshaping.rst

jorisvandenbossche · 2019-07-09T01:52:14Z

doc/source/user_guide/reshaping.rst

+
+   df[['keys']].join(df['values'].explode())
+
+:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.


Other option for empty lists would be to have no entry in the result?

pandas/_libs/lib.pyx

pandas/_libs/reshape.pyx

pandas/core/series.py

jorisvandenbossche · 2019-07-09T02:36:06Z

I am +1 on adding the DataFrame.explode method (so have both DataFrame and Series method), as the original PR did. I agree with @changhiskhan (https://github.com/pandas-dev/pandas/pull/24366/files#r243373508) that this is the most typical use case, so I don't see why we would not provide this convenience.

jreback · 2019-07-11T00:44:25Z

cc @jorisvandenbossche
I added the df.explode(subset=list-of-columns) (see the top of the PR); didn't update the docs or respond to your comments yet.

I even support multi-column expoding; though this is likely not very performant (as it needs a recursive merge); but its there. This may be a bit of overkill and we should just allow a single column explosion.

icexelloss · 2019-07-17T19:16:35Z

For the empty list discussion, using Will's example from above, it gives 0 rows for an empty array (so no NULL):
test_db=# select a FROM unnest(array[]::integer[]) a;
 a 
---
(0 rows)
But, a NULL in the original data is also dropped when unnesting it, so both give here the same result as well, so not sure this is a good reference point.
Does somebody know what spark does?
@ljin ?

>>> df.show()
+---+-----+------+
| id|array|array2|
+---+-----+------+
|  0|   []|  null|
|  1|   []|  null|
|  2|   []|  null|
|  3|   []|  null|
|  4|   []|  null|
|  5|   []|  null|
|  6|   []|  null|
|  7|   []|  null|
|  8|   []|  null|
|  9|   []|  null|
+---+-----+------+

>>> df.withColumn('exploded', F.explode(df['array'])).show()
+---+-----+------+--------+
| id|array|array2|exploded|
+---+-----+------+--------+
+---+-----+------+--------+

>>> df.withColumn('exploded', F.explode(df['array2'])).show()
+---+-----+------+--------+
| id|array|array2|exploded|
+---+-----+------+--------+
+---+-----+------+--------+

Both result in empty

jreback · 2019-07-17T20:15:51Z

@icexelloss oh so the row is then excluded, ok since we propagate NaN I think that is equivalent.

jreback · 2019-07-17T20:16:00Z

@TomAugspurger ready to go here.

TomAugspurger

Looks good on a quick skim. I don't have any opinion on whether an empty list or NaN is preferable. I suspect we'll add a keyword in the future, but NaN seems like a fine default.

jreback · 2019-07-17T22:17:03Z

Looks good on a quick skim. I don't have any opinion on whether an empty list or NaN is preferable. I suspect we'll add a keyword in the future, but NaN seems like a fine default.

yep

gfyoung · 2019-08-25T22:44:03Z

>>> import pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/pandas/pandas/__init__.py", line 55, in <module>
    from pandas.core.api import (
  File "/home/user/pandas/pandas/core/api.py", line 5, in <module>
    from pandas.core.arrays.integer import (
  File "/home/user/pandas/pandas/core/arrays/__init__.py", line 1, in <module>
    from .base import (  # noqa: F401
  File "/home/user/pandas/pandas/core/arrays/base.py", line 14, in <module>
    from pandas.compat.numpy import function as nv
  File "/home/user/Repositories/pandas/pandas/compat/numpy/function.py", line 28, in <module>
    from pandas.util._validators import (
  File "/home/user/pandas/pandas/util/__init__.py", line 3, in <module>
    from pandas.core.util.hashing import hash_array, hash_pandas_object  # noqa
  File "/home/user/pandas/pandas/core/util/hashing.py", line 11, in <module>
    from pandas.core.dtypes.cast import infer_dtype_from_scalar
  File "/home/user/pandas/pandas/core/dtypes/cast.py", line 9, in <module>
    from pandas.util._validators import validate_bool_kwarg
  File "/home/user/pandas/pandas/util/_validators.py", line 7, in <module>
    from pandas.core.dtypes.common import is_bool
  File "/home/user/Repositories/pandas/pandas/core/dtypes/common.py", line 11, in <module>
    from pandas.core.dtypes.dtypes import (
  File "/home/user/pandas/pandas/core/dtypes/dtypes.py", line 17, in <module>
    from .inference import is_bool, is_list_like
  File "/home/user/pandas/pandas/core/dtypes/inference.py", line 26, in <module>
    is_list_like = lib.is_list_like
AttributeError: module 'pandas._libs.lib' has no attribute 'is_list_like'

This suddenly cropped up. This commit obviously didn't cause it, but not sure what triggered this...

Can anyone else replicate this?

jbrockmendel · 2019-08-26T14:49:13Z

@gfyoung have you rebuilt recently?

gfyoung · 2019-08-26T20:53:44Z

I just did, and it seems fine now...strange. Looks like a Cython hiccup.

jreback added Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 6, 2019

jreback added this to the 0.25.0 milestone Jul 6, 2019

jreback commented Jul 6, 2019

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

jschendel reviewed Jul 6, 2019

View reviewed changes

pandas/_libs/reshape.pyx Outdated Show resolved Hide resolved

jschendel reviewed Jul 6, 2019

View reviewed changes

pandas/core/series.py Outdated Show resolved Hide resolved

pandas/core/series.py Outdated Show resolved Hide resolved

pandas/tests/series/test_explode.py Show resolved Hide resolved

doc/source/user_guide/reshaping.rst Outdated Show resolved Hide resolved

jbrockmendel reviewed Jul 6, 2019

View reviewed changes

pandas/_libs/lib.pyx Show resolved Hide resolved

jbrockmendel reviewed Jul 6, 2019

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

jbrockmendel reviewed Jul 6, 2019

View reviewed changes

pandas/_libs/reshape.pyx Outdated Show resolved Hide resolved

jbrockmendel reviewed Jul 6, 2019

View reviewed changes

pandas/_libs/reshape.pyx Outdated Show resolved Hide resolved

WillAyd requested changes Jul 7, 2019

View reviewed changes

asv_bench/benchmarks/reshape.py Outdated Show resolved Hide resolved

asv_bench/benchmarks/reshape.py Outdated Show resolved Hide resolved

pandas/_libs/lib.pyx Show resolved Hide resolved

pandas/core/series.py Outdated Show resolved Hide resolved

jreback force-pushed the explode branch from 78ae084 to 20a74ee Compare July 7, 2019 14:21

WillAyd reviewed Jul 7, 2019

View reviewed changes

pandas/_libs/reshape.pyx Show resolved Hide resolved

pandas/_libs/reshape.pyx Show resolved Hide resolved

jreback force-pushed the explode branch from 20a74ee to 05c3733 Compare July 8, 2019 01:08

jreback mentioned this pull request Jul 8, 2019

ENH: infer_dtype should infer integer-na #27283

Closed

mroeschke reviewed Jul 8, 2019

View reviewed changes

pandas/tests/series/test_explode.py Show resolved Hide resolved

WillAyd approved these changes Jul 8, 2019

View reviewed changes

jorisvandenbossche reviewed Jul 9, 2019

View reviewed changes

jreback force-pushed the explode branch from 05c3733 to c2b91e2 Compare July 11, 2019 00:41

jreback added 17 commits July 17, 2019 14:24

isort

fdafb0c

clean object check & update doc-strings

b47e68d

lint

9856732

test for nested

0fa9c0d

better test

c426627

try adding frame

9d20b61

test for nested EA

5512e29

lint

dc17ef6

remove multi subset support

94f319e

update docs

b9d8a42

doc-string

df0fccf

add test for MI

720e309

lint and docs

26b91c4

ordering

d51aa47

moar lint

f9f82fc

multi-index column support

5c4635f

32-bit compat

7fc2159

jreback force-pushed the explode branch from fc66d10 to 7fc2159 Compare July 17, 2019 18:37

moar 32-bit compat

4e7755b

TomAugspurger reviewed Jul 17, 2019

View reviewed changes

jreback merged commit f2886c1 into pandas-dev:master Jul 18, 2019

WillAyd mentioned this pull request Sep 16, 2019

ENH: Explode multiple columns of DataFrame #28465

Closed

5 tasks

erfannariman mentioned this pull request Jun 22, 2020

ENH: add ignore_index argument to DataFrame.explode / Series.explode #34932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add Series method to explode a list-like column #27267

ENH: Add Series method to explode a list-like column #27267

jreback commented Jul 6, 2019 •

edited

Loading

pep8speaks commented Jul 6, 2019 •

edited

Loading

jreback commented Jul 6, 2019

WillAyd left a comment

ghost commented Jul 7, 2019 •

edited by ghost

Loading

WillAyd left a comment

WillAyd commented Jul 7, 2019

ghost commented Jul 7, 2019

jreback commented Jul 8, 2019

ghost commented Jul 8, 2019

mroeschke commented Jul 8, 2019

WillAyd left a comment

ghost commented Jul 8, 2019 •

edited by ghost

Loading

jorisvandenbossche left a comment

jorisvandenbossche Jul 9, 2019

jorisvandenbossche commented Jul 9, 2019

jreback commented Jul 11, 2019 •

edited

Loading

icexelloss commented Jul 17, 2019

jreback commented Jul 17, 2019

jreback commented Jul 17, 2019

TomAugspurger left a comment

jreback commented Jul 17, 2019

gfyoung commented Aug 25, 2019 •

edited

Loading

jbrockmendel commented Aug 26, 2019

gfyoung commented Aug 26, 2019 •

edited

Loading


		df[['keys']].join(df['values'].explode())

		:meth:`Series.explode` will replace empty lists with ``np.nan`` and preserve scalar entries. The dtype of the resulting ``Series`` is always ``object``.

ENH: Add Series method to explode a list-like column #27267

ENH: Add Series method to explode a list-like column #27267

Conversation

jreback commented Jul 6, 2019 • edited Loading

pep8speaks commented Jul 6, 2019 • edited Loading

Comment last updated at 2019-07-17 19:24:32 UTC

jreback commented Jul 6, 2019

WillAyd left a comment

Choose a reason for hiding this comment

ghost commented Jul 7, 2019 • edited by ghost Loading

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Jul 7, 2019

ghost commented Jul 7, 2019

jreback commented Jul 8, 2019

ghost commented Jul 8, 2019

mroeschke commented Jul 8, 2019

WillAyd left a comment

Choose a reason for hiding this comment

ghost commented Jul 8, 2019 • edited by ghost Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jul 9, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 9, 2019

jreback commented Jul 11, 2019 • edited Loading

icexelloss commented Jul 17, 2019

jreback commented Jul 17, 2019

jreback commented Jul 17, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

jreback commented Jul 17, 2019

gfyoung commented Aug 25, 2019 • edited Loading

jbrockmendel commented Aug 26, 2019

gfyoung commented Aug 26, 2019 • edited Loading

jreback commented Jul 6, 2019 •

edited

Loading

pep8speaks commented Jul 6, 2019 •

edited

Loading

ghost commented Jul 7, 2019 •

edited by ghost

Loading

ghost commented Jul 8, 2019 •

edited by ghost

Loading

jreback commented Jul 11, 2019 •

edited

Loading

gfyoung commented Aug 25, 2019 •

edited

Loading

gfyoung commented Aug 26, 2019 •

edited

Loading