Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Enhancing pivot / reshape docs #21038

Merged
merged 15 commits into from
Nov 12, 2018
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 104 additions & 6 deletions doc/source/reshaping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Reshaping and Pivot Tables
Reshaping by pivoting DataFrame objects
---------------------------------------

.. image:: _static/reshaping_pivot.png

.. ipython::
:suppress:

Expand All @@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects

In [3]: df = unpivot(tm.makeTimeDataFrame())

Data is often stored in CSV files or databases in so-called "stacked" or
"record" format:
Data is often stored in so-called "stacked" or "record" format:

.. ipython:: python

Expand All @@ -60,8 +61,6 @@ To select out everything for variable ``A`` we could do:

df[df['variable'] == 'A']

.. image:: _static/reshaping_pivot.png

But suppose we wish to do time series operations with the variables. A better
representation would be where the ``columns`` are the unique variables and an
``index`` of dates identifies individual observations. To reshape the data into
Expand All @@ -81,7 +80,7 @@ column:
.. ipython:: python

df['value2'] = df['value'] * 2
pivoted = df.pivot('date', 'variable')
pivoted = df.pivot(index='date', columns='variable')
pivoted

You can then select subsets from the pivoted ``DataFrame``:
Expand All @@ -93,6 +92,12 @@ You can then select subsets from the pivoted ``DataFrame``:
Note that this returns a view on the underlying data in the case where the data
are homogeneously-typed.

.. note::
:func:`~pandas.pivot` will error with a ``ValueError: Index contains duplicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this render? Might need a space after directive

entries, cannot reshape`` if the index/column pair is not unique. In this
case, consider using :func:`~pandas.pivot_table` which is a generalization
of pivot that can handle duplicate values for one index/column pair.

.. _reshaping.stacking:

Reshaping by stacking and unstacking
Expand Down Expand Up @@ -698,10 +703,103 @@ handling of NaN:
In [3]: np.unique(x, return_inverse=True)[::-1]
Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))


.. note::
If you just want to handle one column as a categorical variable (like R's factor),
you can use ``df["cat_col"] = pd.Categorical(df["col"])`` or
``df["cat_col"] = df["col"].astype("category")``. For full docs on :class:`~pandas.Categorical`,
see the :ref:`Categorical introduction <categorical>` and the
:ref:`API documentation <api.categorical>`.

Examples
--------

In this section, we will review frequently asked questions and examples. The
column names and relevant column values are named to correspond with how this
DataFrame will be pivoted in the answers below.

.. ipython:: python

np.random.seed([3, 1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
Copy link
Contributor

@jreback jreback May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do

In [12]: cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
Out[12]: 
       0     1      2     3
0   key1  row3  item2  col0
1   key0  row2  item1  col4
2   key1  row1  item0  col2
3   key1  row1  item0  col1
4   key0  row3  item1  col2
5   key1  row0  item2  col4
6   key2  row2  item0  col3
7   key2  row0  item2  col2
8   key1  row1  item0  col1
9   key0  row4  item0  col4
10  key0  row0  item1  col2
11  key0  row4  item1  col4
12  key0  row4  item2  col1
13  key1  row1  item1  col1
14  key1  row0  item2  col4
15  key2  row2  item1  col0
16  key2  row2  item2  col0
17  key0  row3  item0  col2
18  key1  row0  item1  col4
19  key0  row3  item1  col2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Refactored a bit.

df = cols + pd.DataFrame((np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but can add the columns to the constructor and get rid of the line below

df.columns = cols
df = df.join(pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nit but I think it would be better to use pd.concat instead of join here


df

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to have these as Question, rather just make an informative title.

Pivoting with Single Aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Suppose we wanted to pivot ``df`` such that the ``col`` values are columns,
``row`` values are the index, and the mean of ``val0`` are the values? In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a question, so replace ? with .

particular, the resulting DataFrame should look like:

.. code-block:: ipython

col col0 col1 col2 col3 col4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a ipython block

row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24

This solution uses :func:`~pandas.pivot_table`. Also note that
``aggfunc='mean'`` is the default. It is included here to be explicit.

.. ipython:: python

df.pivot_table(
values='val0', index='row', columns='col', aggfunc='mean')

Note that we can also replace the missing values by using the ``fill_value``
parameter.

.. ipython:: python

df.pivot_table(
values='val0', index='row', columns='col', aggfunc='mean', fill_value=0)

Also note that we can pass in other aggregation functions as well. For example,
we can also pass in ``sum``.

.. ipython:: python

df.pivot_table(
values='val0', index='row', columns='col', aggfunc='sum', fill_value=0)

Another aggregation we can do is calculate the frequency in which the columns
and rows occur together a.k.a. "cross tabulation". To do this, we can pass
``size`` to the ``aggfunc`` parameter.

.. ipython:: python

df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

Pivoting with Multiple Aggregations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can also perform multiple aggregations. For example, to perform both a
``sum`` and ``mean``, we can pass in a list to the ``aggfunc`` argument.

.. ipython:: python

df.pivot_table(
values='val0', index='row', columns='col', aggfunc=['mean', 'sum'])

Note to aggregate over multiple value columns, we can pass in a list to the
``values`` parameter.

.. ipython:: python

df.pivot_table(
values=['val0', 'val1'], index='row', columns='col', aggfunc=['mean'])

Note to subdivide over multiple columns we can pass in a list to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for readability we don't need to start each of these with "Note"

``columns`` parameter.

.. ipython:: python

df.pivot_table(
values=['val0'], index='row', columns=['item', 'col'], aggfunc=['mean'])
72 changes: 47 additions & 25 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -5484,50 +5484,72 @@ def pivot(self, index=None, columns=None, values=None):
... "C": ["small", "large", "large", "small",
... "small", "large", "small", "small",
... "large"],
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
... "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
... "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
>>> df
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
A B C D E
0 foo one small 1 2
1 foo one large 2 4
2 foo one large 2 5
3 foo two small 3 5
4 foo two small 3 6
5 bar one large 4 6
6 bar one small 5 8
7 bar two small 6 9
8 bar two large 7 9

This first example aggregates values by taking the sum.

>>> table = pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
>>> table
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
bar one 4 5
two 7 6
foo one 4 1
two NaN 6

We can also fill missing values using the `fill_value` parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth calling out in this example that providing the fill_value has preserved the int dtype, instead of casting to float as np.nan would


>>> table = pivot_table(df, values='D', index=['A', 'B'],
... columns=['C'], aggfunc=np.sum)
... columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
bar one 4 5
two 7 6
foo one 4 1
two 0 6

The next example aggregates by taking the mean across multiple columns.

>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
... aggfunc={'D': np.mean,
... 'E': np.mean})
>>> table
D E
mean mean
A C
bar large 5.500000 7.500000
small 5.500000 8.500000
foo large 2.000000 4.500000
small 2.333333 4.333333

We can also calculate multiple types of aggregations for any given
value column.

>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
... aggfunc={'D': np.mean,
... 'E': [min, max, np.mean]})
>>> table
D E
mean max median min
mean max mean min
A C
bar large 5.500000 16 14.5 13
small 5.500000 15 14.5 14
foo large 2.000000 10 9.5 9
small 2.333333 12 11.0 8
bar large 5.500000 9 7.500000 6
small 5.500000 9 8.500000 8
foo large 2.000000 5 4.500000 4
small 2.333333 6 4.333333 2

Returns
-------
Expand Down