Skip to content

Commit

Permalink
ENH: Implement DataFrame.astype('category')
Browse files Browse the repository at this point in the history
  • Loading branch information
jschendel committed Feb 26, 2018
1 parent 92dbc78 commit 4c51064
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 28 deletions.
101 changes: 83 additions & 18 deletions doc/source/categorical.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,26 @@ The categorical data type is useful in the following cases:
* As a signal to other Python libraries that this column should be treated as a categorical
variable (e.g. to use suitable statistical methods or plot types).

.. note::

In contrast to R's `factor` function, categorical data is not converting input values to
strings and categories will end up the same data type as the original values.

.. note::

In contrast to R's `factor` function, there is currently no way to assign/change labels at
creation time. Use `categories` to change the categories after creation time.

See also the :ref:`API docs on categoricals<api.categorical>`.

.. _categorical.objectcreation:

Object Creation
---------------

Series Creation
~~~~~~~~~~~~~~~

Categorical ``Series`` or columns in a ``DataFrame`` can be created in several ways:

By specifying ``dtype="category"`` when constructing a ``Series``:
Expand Down Expand Up @@ -77,7 +92,7 @@ discrete bins. See the :ref:`example on tiling <reshaping.tile.cut>` in the docs
df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head(10)
By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to a `DataFrame`.
By passing a :class:`pandas.Categorical` object to a ``Series`` or assigning it to a ``DataFrame``.

.. ipython:: python
Expand All @@ -89,6 +104,56 @@ By passing a :class:`pandas.Categorical` object to a `Series` or assigning it to
df["B"] = raw_cat
df
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:

.. ipython:: python
df.dtypes
DataFrame Creation
~~~~~~~~~~~~~~~~~~

Columns in a ``DataFrame`` can be batch converted to categorical, either at the time of construction
or after construction. The conversion to categorical is done on a column by column basis; labels present
in a one column will not be carried over and used as categories in another column.

Columns can be batch converted by specifying ``dtype="category"`` when constructing a ``DataFrame``:

.. ipython:: python
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category")
df.dtypes
Note that the categories present in each column differ; since the conversion is done on a column by column
basis, only labels present in a given column are categories:

.. ipython:: python
df['A']
df['B']
.. versionadded:: 0.23.0

Similarly, columns in an existing ``DataFrame`` can be batch converted using :meth:`DataFrame.astype`:

.. ipython:: python
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
df_cat = df.astype('category')
df_cat.dtypes
This conversion is likewise done on a column by column basis:

.. ipython:: python
df_cat['A']
df_cat['B']
Controlling Behavior
~~~~~~~~~~~~~~~~~~~~

In the examples above where we passed ``dtype='category'``, we used the default
behavior:

Expand All @@ -108,21 +173,30 @@ of :class:`~pandas.api.types.CategoricalDtype`.
s_cat = s.astype(cat_type)
s_cat
Categorical data has a specific ``category`` :ref:`dtype <basics.dtypes>`:
Similarly, a ``CategoricalDtype`` can be used with a ``DataFrame`` to ensure that categories
are consistent among all columns.

.. ipython:: python
df.dtypes
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
cat_type = CategoricalDtype(categories=list('abcd'),
ordered=True)
df_cat = df.astype(cat_type)
df_cat['A']
df_cat['B']
.. note::
If you already have `codes` and `categories`, you can use the
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
during normal constructor mode:

In contrast to R's `factor` function, categorical data is not converting input values to
strings and categories will end up the same data type as the original values.
.. ipython:: python
.. note::
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
In contrast to R's `factor` function, there is currently no way to assign/change labels at
creation time. Use `categories` to change the categories after creation time.
Regaining Original Data
~~~~~~~~~~~~~~~~~~~~~~~

To get back to the original ``Series`` or NumPy array, use
``Series.astype(original_dtype)`` or ``np.asarray(categorical)``:
Expand All @@ -136,15 +210,6 @@ To get back to the original ``Series`` or NumPy array, use
s2.astype(str)
np.asarray(s2)
If you already have `codes` and `categories`, you can use the
:func:`~pandas.Categorical.from_codes` constructor to save the factorize step
during normal constructor mode:

.. ipython:: python
splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
.. _categorical.categoricaldtype:

CategoricalDtype
Expand Down
32 changes: 32 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,38 @@ The :func:`DataFrame.assign` now accepts dependent keyword arguments for python

df.assign(A=df.A+1, C= lambda df: df.A* -1)


.. _whatsnew_0230.enhancements.astype_category:

``DataFrame.astype`` performs columnwise conversion to ``Categorical``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:meth:`DataFrame.astype` can now perform columnwise conversion to ``Categorical`` by supplying the string ``'category'`` or a :class:`~pandas.api.types.CategoricalDtype`.
Previously, attempting this would raise a ``NotImplementedError``. (:issue:`18099`)

Supplying the string ``'category'`` performs columnwise conversion, with only labels appearing in a given column set as categories:

.. ipython:: python

df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
df = df.astype('category')
df['A'].dtype
df['B'].dtype


Supplying a ``CategoricalDtype`` will make the categories in each column consistent with the supplied dtype:

.. ipython:: python

from pandas.api.types import CategoricalDtype
df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')})
cdt = CategoricalDtype(categories=list('abcd'), ordered=True)
df = df.astype(cdt)
df['A'].dtype
df['B'].dtype

See the :ref:`categorical.objectcreation` section of the documentation for more details and examples.

.. _whatsnew_0230.enhancements.other:

Other Enhancements
Expand Down
9 changes: 7 additions & 2 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
is_number,
is_integer, is_bool,
is_bool_dtype,
is_categorical_dtype,
is_numeric_dtype,
is_datetime64_dtype,
is_timedelta64_dtype,
Expand Down Expand Up @@ -4429,14 +4430,18 @@ def astype(self, dtype, copy=True, errors='raise', **kwargs):
if col_name not in self:
raise KeyError('Only a column name can be used for the '
'key in a dtype mappings argument.')
from pandas import concat
results = []
for col_name, col in self.iteritems():
if col_name in dtype:
results.append(col.astype(dtype[col_name], copy=copy))
else:
results.append(results.append(col.copy() if copy else col))
return concat(results, axis=1, copy=False)
return pd.concat(results, axis=1, copy=False)

elif is_categorical_dtype(dtype) and self.ndim > 1:
# GH 18099: columnwise conversion to categorical
results = (self[col].astype(dtype, copy=copy) for col in self)
return pd.concat(results, axis=1, copy=False)

# else, only a single dtype is given
new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
Expand Down
25 changes: 17 additions & 8 deletions pandas/tests/frame/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@

import numpy as np
from pandas import (DataFrame, Series, date_range, Timedelta, Timestamp,
compat, concat, option_context)
Categorical, compat, concat, option_context)
from pandas.compat import u
from pandas import _np_version_under1p14

from pandas.core.dtypes.dtypes import DatetimeTZDtype
from pandas.core.dtypes.dtypes import DatetimeTZDtype, CategoricalDtype
from pandas.tests.frame.common import TestData
from pandas.util.testing import (assert_series_equal,
assert_frame_equal,
Expand Down Expand Up @@ -619,12 +619,21 @@ def test_astype_duplicate_col(self):
expected = concat([a1_str, b, a2_str], axis=1)
assert_frame_equal(result, expected)

@pytest.mark.parametrize('columns', [['x'], ['x', 'y'], ['x', 'y', 'z']])
def test_categorical_astype_ndim_raises(self, columns):
# GH 18004
msg = '> 1 ndim Categorical are not supported at this time'
with tm.assert_raises_regex(NotImplementedError, msg):
DataFrame(columns=columns).astype('category')
@pytest.mark.parametrize('dtype', [
'category',
CategoricalDtype(),
CategoricalDtype(ordered=True),
CategoricalDtype(ordered=False),
CategoricalDtype(categories=list('abcdef')),
CategoricalDtype(categories=list('edba'), ordered=False),
CategoricalDtype(categories=list('edcb'), ordered=True)], ids=repr)
def test_astype_categorical(self, dtype):
# GH 18099
d = {'A': list('abbc'), 'B': list('bccd'), 'C': list('cdde')}
df = DataFrame(d)
result = df.astype(dtype)
expected = DataFrame({k: Categorical(d[k], dtype=dtype) for k in d})
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("cls", [
pd.api.types.CategoricalDtype,
Expand Down

0 comments on commit 4c51064

Please sign in to comment.