-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add DataFrame method to explode a list-like column (GH #16538) #24366
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,51 @@ This is a major release from 0.23.4 and includes a number of API changes, new | |
features, enhancements, and performance improvements along with a large number | ||
of bug fixes. | ||
|
||
<<<<<<< HEAD | ||
These are the changes in pandas 0.24.0. See :ref:`release` for a full changelog | ||
including other versions of pandas. | ||
|
||
.. _whatsnew_0240.enhancements: | ||
|
||
New features | ||
~~~~~~~~~~~~ | ||
- :func:`merge` now directly allows merge between objects of type ``DataFrame`` and named ``Series``, without the need to convert the ``Series`` object into a ``DataFrame`` beforehand (:issue:`21220`) | ||
- ``ExcelWriter`` now accepts ``mode`` as a keyword argument, enabling append to existing workbooks when using the ``openpyxl`` engine (:issue:`3441`) | ||
- ``FrozenList`` has gained the ``.union()`` and ``.difference()`` methods. This functionality greatly simplifies groupby's that rely on explicitly excluding certain columns. See :ref:`Splitting an object into groups <groupby.split>` for more information (:issue:`15475`, :issue:`15506`). | ||
- :func:`DataFrame.to_parquet` now accepts ``index`` as an argument, allowing | ||
the user to override the engine's default behavior to include or omit the | ||
dataframe's indexes from the resulting Parquet file. (:issue:`20768`) | ||
- :meth:`DataFrame.corr` and :meth:`Series.corr` now accept a callable for generic calculation methods of correlation, e.g. histogram intersection (:issue:`22684`) | ||
- :func:`DataFrame.to_string` now accepts ``decimal`` as an argument, allowing the user to specify which decimal separator should be used in the output. (:issue:`23614`) | ||
- :func:`read_feather` now accepts ``columns`` as an argument, allowing the user to specify which columns should be read. (:issue:`24025`) | ||
- :func:`DataFrame.to_html` now accepts ``render_links`` as an argument, allowing the user to generate HTML with links to any URLs that appear in the DataFrame. | ||
See the :ref:`section on writing HTML <io.html>` in the IO docs for example usage. (:issue:`2679`) | ||
- :func:`DataFrame.explode` to split list-like values onto individual rows. See :ref:`section on Exploding list-like column <reshaping.html>` in docs for more information (:issue:`16538`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will need a sub-section here, show a mini-example and also point to the docs (as you are doing) |
||
|
||
.. _whatsnew_0240.values_api: | ||
|
||
Accessing the values in a Series or Index | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:attr:`Series.array` and :attr:`Index.array` have been added for extracting the array backing a | ||
``Series`` or ``Index``. (:issue:`19954`, :issue:`23623`) | ||
|
||
.. ipython:: python | ||
|
||
idx = pd.period_range('2000', periods=4) | ||
idx.array | ||
pd.Series(idx).array | ||
|
||
Historically, this would have been done with ``series.values``, but with | ||
``.values`` it was unclear whether the returned value would be the actual array, | ||
some transformation of it, or one of pandas custom arrays (like | ||
``Categorical``). For example, with :class:`PeriodIndex`, ``.values`` generates | ||
a new ndarray of period objects each time. | ||
|
||
.. ipython:: python | ||
======= | ||
Highlights include: | ||
>>>>>>> master | ||
|
||
* :ref:`Optional Integer NA Support <whatsnew_0240.enhancements.intna>` | ||
* :ref:`New APIs for accessing the array backing a Series or Index <whatsnew_0240.values_api>` | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6165,6 +6165,57 @@ def melt(self, id_vars=None, value_vars=None, var_name=None, | |
var_name=var_name, value_name=value_name, | ||
col_level=col_level) | ||
|
||
def explode(self, col_name, sep=None, dtype=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm would this be better as a Series method? Requiring There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Started reviewing before I saw your above comment. Still think this is better served as a Series method instead of a frame method with a required col_name argument. I think this would fail in cases where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least in use cases I've seen, you'd want to join it back to the rest of the data right away. I wouldn't be opposed to having both if people ask for it, but only having it as a Series method is less useful IMO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I get your point though I think it would be better to simply return an object that a user can join themselves rather than try to take care of the merging within the method. It's entirely reasonable to expect this against a Series object, so not offering that I think makes for a more confusing API. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it is an entirely plausible/reasonable scenario. However I would prefer to wait until we see people asking about Note that this is also consistent with SQL / Spark APIs so I think it's unlikely for a lot of confusion to arise There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this should be a Series method. |
||
""" | ||
Create new DataFrame expanding a list-like column. | ||
|
||
.. versionadded:: 0.24.0 | ||
|
||
Parameters | ||
---------- | ||
col_name : str | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we've been moving towards always using the full There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how would we distinguish between the string name of the column and the column's data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess there's enough context here. I'll change to |
||
Name of the column to be exploded. | ||
sep : str, default None | ||
Convenience to split a string `col_name` before exploding. | ||
dtype : str or dtype, default None | ||
Optionally coerce the dtype of exploded column. | ||
|
||
Returns | ||
------- | ||
exploded: DataFrame | ||
|
||
See Also | ||
-------- | ||
Series.str.split: Split string values on specified separator. | ||
Series.str.extract: Extract groups from the first regex match. | ||
|
||
Examples | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps add a See Also linking to Are we interested in implementing the inverse operation (what dplyr calls There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea. I'll add Series.str.split and Series.str.extract here. Which other ones do you think would be relevant?
|
||
-------- | ||
>>> df = pd.DataFrame({'k': ['a,b', 'c,d'], 'v': [0, 1]}) | ||
>>> df.explode('k', sep=',') | ||
k v | ||
0 a 0 | ||
0 b 0 | ||
1 c 1 | ||
1 d 1 | ||
""" | ||
col = self[col_name] | ||
if len(self) == 0: | ||
return self.copy() | ||
if sep: | ||
col_expanded = col.str.split(sep, expand=True) | ||
else: | ||
col_expanded = col.apply(Series) | ||
col_stacked = (col_expanded | ||
.stack() | ||
.reset_index(level=-1, drop=True) | ||
.rename(col_name)) | ||
if dtype: | ||
col_stacked = col_stacked.astype(dtype) | ||
return (col_stacked.to_frame() | ||
.join(self.drop(col_name, axis=1)) | ||
.reindex(self.columns, axis=1)) | ||
|
||
# ---------------------------------------------------------------------- | ||
# Time series-related | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports at the top of the file