Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions asv_bench/benchmarks/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,4 +228,22 @@ def time_qcut_datetime(self, bins):
pd.qcut(self.datetime_series, bins)


class Explode(object):
param_names = ['n_rows', 'max_list_length']
params = [[100, 1000, 10000], [3, 5, 10]]

def setup(self, n_rows, max_list_length):
import string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports at the top of the file

num_letters = np.random.randint(0, max_list_length, n_rows)
key_column = [','.join([np.random.choice(list(string.ascii_letters))
for _ in range(k)])
for k in num_letters]
value_column = np.random.randn(n_rows)
self.frame = pd.DataFrame({'key': key_column,
'value': value_column})

def time_explode(self, n_rows, max_list_length):
self.frame.explode('key', sep=',')


from .pandas_vb_common import setup # noqa: F401
31 changes: 31 additions & 0 deletions doc/source/user_guide/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -801,3 +801,34 @@ Note to subdivide over multiple columns we can pass in a list to the

df.pivot_table(
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])

.. _reshaping.explode:

Exploding a List-like Column
----------------------------

Sometimes the value column is list-like:

.. ipython:: python

keys = ['panda1', 'panda2', 'panda3']
values = [['eats', 'shoots'], ['shoots', 'leaves'], ['eats', 'leaves']]
df = pd.DataFrame({'keys': keys, 'values': values})
df

But we actually want to put each value onto its own row.
For this purpose we can use ``DataFrame.explode``:

.. ipython:: python

df.explode('values')

For convenience, we can use the optional keyword ``sep`` to automatically
split a string column before exploding:

.. ipython:: python

values = ['eats,shoots', 'shoots,leaves', 'eats,shoots,leaves']
df2 = pd.DataFrame({'keys': keys, 'values': values})
df2
df2.explode('values', sep=',')
44 changes: 44 additions & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,51 @@ This is a major release from 0.23.4 and includes a number of API changes, new
features, enhancements, and performance improvements along with a large number
of bug fixes.

<<<<<<< HEAD
These are the changes in pandas 0.24.0. See :ref:`release` for a full changelog
including other versions of pandas.

.. _whatsnew_0240.enhancements:

New features
~~~~~~~~~~~~
- :func:`merge` now directly allows merge between objects of type ``DataFrame`` and named ``Series``, without the need to convert the ``Series`` object into a ``DataFrame`` beforehand (:issue:`21220`)
- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`)
- ``FrozenList`` has gained the ``.union()`` and ``.difference()`` methods. This functionality greatly simplifies groupby's that rely on explicitly excluding certain columns. See :ref:`Splitting an object into groups <groupby.split>` for more information (:issue:`15475`, :issue:`15506`).
- :func:`DataFrame.to_parquet` now accepts ``index`` as an argument, allowing
the user to override the engine's default behavior to include or omit the
dataframe's indexes from the resulting Parquet file. (:issue:`20768`)
- :meth:`DataFrame.corr` and :meth:`Series.corr` now accept a callable for generic calculation methods of correlation, e.g. histogram intersection (:issue:`22684`)
- :func:`DataFrame.to_string` now accepts ``decimal`` as an argument, allowing the user to specify which decimal separator should be used in the output. (:issue:`23614`)
- :func:`read_feather` now accepts ``columns`` as an argument, allowing the user to specify which columns should be read. (:issue:`24025`)
- :func:`DataFrame.to_html` now accepts ``render_links`` as an argument, allowing the user to generate HTML with links to any URLs that appear in the DataFrame.
See the :ref:`section on writing HTML <io.html>` in the IO docs for example usage. (:issue:`2679`)
- :func:`DataFrame.explode` to split list-like values onto individual rows. See :ref:`section on Exploding list-like column <reshaping.html>` in docs for more information (:issue:`16538`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need a sub-section here, show a mini-example and also point to the docs (as you are doing)


.. _whatsnew_0240.values_api:

Accessing the values in a Series or Index
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:attr:`Series.array` and :attr:`Index.array` have been added for extracting the array backing a
``Series`` or ``Index``. (:issue:`19954`, :issue:`23623`)

.. ipython:: python

idx = pd.period_range('2000', periods=4)
idx.array
pd.Series(idx).array

Historically, this would have been done with ``series.values``, but with
``.values`` it was unclear whether the returned value would be the actual array,
some transformation of it, or one of pandas custom arrays (like
``Categorical``). For example, with :class:`PeriodIndex`, ``.values`` generates
a new ndarray of period objects each time.

.. ipython:: python
=======
Highlights include:
>>>>>>> master

* :ref:`Optional Integer NA Support <whatsnew_0240.enhancements.intna>`
* :ref:`New APIs for accessing the array backing a Series or Index <whatsnew_0240.values_api>`
Expand Down
51 changes: 51 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6165,6 +6165,57 @@ def melt(self, id_vars=None, value_vars=None, var_name=None,
var_name=var_name, value_name=value_name,
col_level=col_level)

def explode(self, col_name, sep=None, dtype=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm would this be better as a Series method? Requiring col_name as a parameter makes it so it only operates as such anyway, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started reviewing before I saw your above comment. Still think this is better served as a Series method instead of a frame method with a required col_name argument.

I think this would fail in cases where col_name is not unique

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in use cases I've seen, you'd want to join it back to the rest of the data right away. I wouldn't be opposed to having both if people ask for it, but only having it as a Series method is less useful IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point though I think it would be better to simply return an object that a user can join themselves rather than try to take care of the merging within the method.

It's entirely reasonable to expect this against a Series object, so not offering that I think makes for a more confusing API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is an entirely plausible/reasonable scenario. However I would prefer to wait until we see people asking about Series.explode on github/mailing-list/stackoverflow to add that to Series. Otherwise if people only actually ever reach for explode in the context of a DataFrame then why bother having it in Series?

Note that this is also consistent with SQL / Spark APIs so I think it's unlikely for a lot of confusion to arise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be a Series method.

"""
Create new DataFrame expanding a list-like column.

.. versionadded:: 0.24.0

Parameters
----------
col_name : str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've been moving towards always using the full column instead of col for parameter names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we distinguish between the string name of the column and the column's data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's enough context here. I'll change to column for consistency then. But in general I'm still curious what the conclusion was for the question above.

Name of the column to be exploded.
sep : str, default None
Convenience to split a string `col_name` before exploding.
dtype : str or dtype, default None
Optionally coerce the dtype of exploded column.

Returns
-------
exploded: DataFrame

See Also
--------
Series.str.split: Split string values on specified separator.
Series.str.extract: Extract groups from the first regex match.

Examples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a See Also linking to Series.str.split, Series.str.extract? Maybe others?

Are we interested in implementing the inverse operation (what dplyr calls unite: https://tidyr.tidyverse.org/reference/unite.html)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll add Series.str.split and Series.str.extract here. Which other ones do you think would be relevant?

unite: so it would be like a groupby.agg(list/concat) type of operation? I'm not opposed to it but I think there's no urgency since we haven't had much user demand. My guess is because it maps to groupby so it's more natural to think about than the reverse.

--------
>>> df = pd.DataFrame({'k': ['a,b', 'c,d'], 'v': [0, 1]})
>>> df.explode('k', sep=',')
k v
0 a 0
0 b 0
1 c 1
1 d 1
"""
col = self[col_name]
if len(self) == 0:
return self.copy()
if sep:
col_expanded = col.str.split(sep, expand=True)
else:
col_expanded = col.apply(Series)
col_stacked = (col_expanded
.stack()
.reset_index(level=-1, drop=True)
.rename(col_name))
if dtype:
col_stacked = col_stacked.astype(dtype)
return (col_stacked.to_frame()
.join(self.drop(col_name, axis=1))
.reindex(self.columns, axis=1))

# ----------------------------------------------------------------------
# Time series-related

Expand Down
95 changes: 95 additions & 0 deletions pandas/tests/frame/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -900,6 +900,101 @@ def test_unstack_swaplevel_sortlevel(self, level):
tm.assert_frame_equal(result, expected)


class TestDataFrameExplode(object):
# GH 16538
columns = ['a', 'b', 'c']

def test_sep(self):
# Automatically do str.split
df = pd.DataFrame([['foo,bar', 'x', 42],
['fizz,buzz', 'y', 43]],
columns=self.columns)
rs = df.explode('a', sep=',')
xp = pd.DataFrame({'a': ['foo', 'bar', 'fizz', 'buzz'],
'b': ['x', 'x', 'y', 'y'],
'c': [42, 42, 43, 43]},
index=[0, 0, 1, 1])
tm.assert_frame_equal(rs, xp)

def test_dtype(self):
# Coerce dtype
df = pd.DataFrame([[[0, 1, 4], 'x', 42],
[[2, 3], 'y', 43]],
columns=self.columns)
rs = df.explode('a', dtype='int')
xp = pd.DataFrame({'a': np.array([0, 1, 4, 2, 3], dtype='int'),
'b': ['x', 'x', 'x', 'y', 'y'],
'c': [42, 42, 42, 43, 43]},
index=[0, 0, 0, 1, 1])
tm.assert_frame_equal(rs, xp)

def test_na(self):
# NaN's and empty lists are omitted
# TODO: option to preserve explicit NAs instead
df = pd.DataFrame([[[], 'x', 42],
[[2.0, np.nan], 'y', 43]],
columns=self.columns)
rs = df.explode('a')
xp = pd.DataFrame({'a': [2.0],
'b': ['y'],
'c': [43]},
index=[1])
tm.assert_frame_equal(rs, xp)

def test_nonuniform_type(self):
# Not everything is a list
df = pd.DataFrame([[[0, 1, 4], 'x', 42],
[3, 'y', 43]],
columns=self.columns)
rs = df.explode('a', dtype='int')
xp = pd.DataFrame({'a': np.array([0, 1, 4, 3], dtype='int'),
'b': ['x', 'x', 'x', 'y'],
'c': [42, 42, 42, 43]},
index=[0, 0, 0, 1])
tm.assert_frame_equal(rs, xp)

def test_all_scalars(self):
# Nothing is a list
df = pd.DataFrame([[0, 'x', 42],
[3, 'y', 43]],
columns=self.columns)
rs = df.explode('a')
xp = pd.DataFrame({'a': [0, 3],
'b': ['x', 'y'],
'c': [42, 43]},
index=[0, 1])
tm.assert_frame_equal(rs, xp)

def test_empty(self):
# Empty frame
rs = pd.DataFrame(columns=['a', 'b']).explode('a')
xp = pd.DataFrame(columns=['a', 'b'])
tm.assert_frame_equal(rs, xp)

def test_missing_column(self):
# Bad column name
df = pd.DataFrame([[0, 'x', 42],
[3, 'y', 43]],
columns=self.columns)
pytest.raises(KeyError, df.explode, 'badcolumnname')

def test_multi_index(self):
# Multi-index
idx = pd.MultiIndex.from_tuples([(0, 'a'), (1, 'b')])
df = pd.DataFrame([['foo,bar', 'x', 42],
['fizz,buzz', 'y', 43]],
columns=self.columns,
index=idx)
rs = df.explode('a', sep=',')
idx = pd.MultiIndex.from_tuples(
[(0, 'a'), (0, 'a'), (1, 'b'), (1, 'b')])
xp = pd.DataFrame({'a': ['foo', 'bar', 'fizz', 'buzz'],
'b': ['x', 'x', 'y', 'y'],
'c': [42, 42, 43, 43]},
index=idx)
tm.assert_frame_equal(rs, xp)


def test_unstack_fill_frame_object():
# GH12815 Test unstacking with object.
data = pd.Series(['a', 'b', 'c', 'a'], dtype='object')
Expand Down