Skip to content

Commit

Permalink
ENH: Allow storing ExtensionArrays in containers (#19520)
Browse files Browse the repository at this point in the history
* ENH: non-interval changes

* COMPAT: py2 Super

* BUG: Use original object for extension array

* Consistent boxing / unboxing

NumPy compat

* 32-bit compat

* Add a test array

* linting

* Default __iter__

* Tests for value_counts

* Implement value_counts

* Py2 compat

* Fixed dropna

* Test fixups

* Started setitem

* REF/Clean: Internal / External values

* Move to index base

* Setitem tests, decimal example

* Compat

* Fixed extension block tests.

The only "API change" was that you can't just inherit from
NonConsolidatableMixin, which is OK since

1. it's a mixin
2. geopandas also inherits from Block

* Clarify binop tests

Make it clearer which bit might raise

* TST: Removed ops tests

* Cleanup unique handling

* Simplify object concat

* Use values for intersection

I think eventually we'll want to ndarray_values for this, but it'll
require a bit more work to support. Currently, using ndarary_values
causes occasional failures on categorical.

* hmm

* More failing tests

* remove bad test

* better setitem

* Dropna works.

* Restore xfail test

* Test Categorical

* Xfail setitem tests

* TST: Skip JSON tests on py2

* Additional testing

* More tests

* ndarray_values

* API: Default ExtensionArray.astype

(cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a)
(cherry picked from commit fbf0a06)

* Simplify concat_as_object

* Py2 compat

(cherry picked from commit b20e12c)

* Set-ops ugliness

* better docstrings

* tolist

* linting

* Moved dtypes

(cherry picked from commit d136227)

* clean

* cleanup

* NumPy compat

* Use base _values for CategoricalIndex

* Update dev docs

* cleanup

* cleanup

(cherry picked from commit 2425621)

* cleanup

* Linting

* Precision in tests

* Linting

* Move to extension

* Push _ndarray_values to ExtensionArray

Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA.
Subclasses like Categorical can override it as they see fit.

* Clean up tolist

* Move test locations

* Fixed test

* REF: Update per comments

* lint

* REF: Use _values for size and shape

* PERF: Implement size, shape for IntervalIndex

* PERF: Avoid materializing values for PeriodIndex shape, size

* Cleanup

* Override nbytes

* Remove unused change

* Docs

* Test cleanpu

* Always set PANDAS_TESTING_MODE

* Revert "Always set PANDAS_TESTING_MODE"

This reverts commit a312ba5.

* Explicitly catch warnings or not

* fastparquet warnings

* Unicode literals strikes again.

Only catch fp warning for newer numpy

* Restore circle env var

* More parquet test catching

* No stacklevel

* Lower bound on FP

* Exact bound for FP

* Don't use fastpath for ExtensionBlock make_block

* Consistently use _values

* TST: Additional constructor tests

* CLN: de-nested a bit

* _fill_value handling

* Handle user provided dtype in constructors.

When the dtype matches, we allow it to proceed.

When the dtype would require coercion, we raise.

* Document ExtensionBlock._maybe_coerce_values

Also changes to use _values as we should

* Created ABCExtensionArray

* TST: Tests for is_object_dtype and is_string_dtype and EAs

* fixup! Handle user provided dtype in constructors.

* Doc for setitem

* Split base tests

* Revert test_parquet changes

* API: Removed _fill_value from the interface

* Push coercion to extension dtype till later

* Linting

* ERR: Better error message for coercion to 3rd party dtypes

* CLN: Make take_nd EA aware

* Revert sparse changes

* Other _typ for ABCExtensionArray

* Test cleanup and expansion.

Tests for concating and aligning frames

* Copy if copy

* TST: remove self param for fixture

* Remove unnescessary EA handling in Series ctor

* API: Removed value_counts

Moved setitem notes to comment

* More doc notes

* Handle expanding a DataFrame with an EA

* Added ExtensionDtype.__eq__

Support for astype

* linting

* REF: is_dtype_equal refactor

Moved from PandasExtensionDtype to ExtensionDtype with one modification:
catch TypeError explicitly.

* Remove reference to dtype being a class

* move

* Moved sparse check to take_nd

* Docstring

* Split tests

* Revert index change

* Copy changes

* Simplify EA implementation names

comments for object vs. str missing values

* Linting
  • Loading branch information
TomAugspurger authored Feb 23, 2018
1 parent 0176f6e commit 01e99de
Show file tree
Hide file tree
Showing 32 changed files with 1,276 additions and 130 deletions.
26 changes: 19 additions & 7 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,12 @@
is_unsigned_integer_dtype, is_signed_integer_dtype,
is_integer_dtype, is_complex_dtype,
is_object_dtype,
is_extension_array_dtype,
is_categorical_dtype, is_sparse,
is_period_dtype,
is_numeric_dtype, is_float_dtype,
is_bool_dtype, needs_i8_conversion,
is_categorical, is_datetimetz,
is_datetimetz,
is_datetime64_any_dtype, is_datetime64tz_dtype,
is_timedelta64_dtype, is_interval_dtype,
is_scalar, is_list_like,
Expand Down Expand Up @@ -547,7 +548,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,
if is_categorical_dtype(values) or is_sparse(values):

# handle Categorical and sparse,
result = Series(values).values.value_counts(dropna=dropna)
result = Series(values)._values.value_counts(dropna=dropna)
result.name = name
counts = result.values

Expand Down Expand Up @@ -1292,10 +1293,13 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
"""
Specialized Cython take which sets NaN values in one pass
This dispatches to ``take`` defined on ExtensionArrays. It does not
currently dispatch to ``SparseArray.take`` for sparse ``arr``.
Parameters
----------
arr : ndarray
Input array
arr : array-like
Input array.
indexer : ndarray
1-D array of indices to take, subarrays corresponding to -1 value
indicies are filed with fill_value
Expand All @@ -1315,17 +1319,25 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
If False, indexer is assumed to contain no -1 values so no filling
will be done. This short-circuits computation of a mask. Result is
undefined if allow_fill == False and -1 is present in indexer.
Returns
-------
subarray : array-like
May be the same type as the input, or cast to an ndarray.
"""

# TODO(EA): Remove these if / elifs as datetimeTZ, interval, become EAs
# dispatch to internal type takes
if is_categorical(arr):
return arr.take_nd(indexer, fill_value=fill_value,
allow_fill=allow_fill)
if is_extension_array_dtype(arr):
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
elif is_datetimetz(arr):
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
elif is_interval_dtype(arr):
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)

if is_sparse(arr):
arr = arr.get_values()

if indexer is None:
indexer = np.arange(arr.shape[axis], dtype=np.int64)
dtype, fill_value = arr.dtype, arr.dtype.type()
Expand Down
90 changes: 69 additions & 21 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,13 @@ class ExtensionArray(object):
* isna
* take
* copy
* _formatting_values
* _concat_same_type
Some additional methods are required to satisfy pandas' internal, private
Some additional methods are available to satisfy pandas' internal, private
block API.
* _concat_same_type
* _can_hold_na
* _formatting_values
This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
Expand All @@ -53,13 +52,14 @@ class ExtensionArray(object):
Extension arrays should be able to be constructed with instances of
the class, i.e. ``ExtensionArray(extension_array)`` should return
an instance, not error.
Additionally, certain methods and interfaces are required for proper
this array to be properly stored inside a ``DataFrame`` or ``Series``.
"""
# '_typ' is for pandas.core.dtypes.generic.ABCExtensionArray.
# Don't override this.
_typ = 'extension'
# ------------------------------------------------------------------------
# Must be a Sequence
# ------------------------------------------------------------------------

def __getitem__(self, item):
# type (Any) -> Any
"""Select a subset of self.
Expand Down Expand Up @@ -92,7 +92,46 @@ def __getitem__(self, item):
raise AbstractMethodError(self)

def __setitem__(self, key, value):
# type: (Any, Any) -> None
# type: (Union[int, np.ndarray], Any) -> None
"""Set one or more values inplace.
This method is not required to satisfy the pandas extension array
interface.
Parameters
----------
key : int, ndarray, or slice
When called from, e.g. ``Series.__setitem__``, ``key`` will be
one of
* scalar int
* ndarray of integers.
* boolean ndarray
* slice object
value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
value or values to be set of ``key``.
Returns
-------
None
"""
# Some notes to the ExtensionArray implementor who may have ended up
# here. While this method is not required for the interface, if you
# *do* choose to implement __setitem__, then some semantics should be
# observed:
#
# * Setting multiple values : ExtensionArrays should support setting
# multiple values at once, 'key' will be a sequence of integers and
# 'value' will be a same-length sequence.
#
# * Broadcasting : For a sequence 'key' and a scalar 'value',
# each position in 'key' should be set to 'value'.
#
# * Coercion : Most users will expect basic coercion to work. For
# example, a string like '2018-01-01' is coerced to a datetime
# when setting on a datetime64ns array. In general, if the
# __init__ method coerces that value, then so should __setitem__
raise NotImplementedError(_not_implemented_message.format(
type(self), '__setitem__')
)
Expand All @@ -107,6 +146,16 @@ def __len__(self):
# type: () -> int
raise AbstractMethodError(self)

def __iter__(self):
"""Iterate over elements of the array.
"""
# This needs to be implemented so that pandas recognizes extension
# arrays as list-like. The default implementation makes successive
# calls to ``__getitem__``, which may be slower than necessary.
for i in range(len(self)):
yield self[i]

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
Expand All @@ -132,9 +181,9 @@ def nbytes(self):
# type: () -> int
"""The number of bytes needed to store this object in memory.
If this is expensive to compute, return an approximate lower bound
on the number of bytes needed.
"""
# If this is expensive to compute, return an approximate lower bound
# on the number of bytes needed.
raise AbstractMethodError(self)

# ------------------------------------------------------------------------
Expand Down Expand Up @@ -184,8 +233,8 @@ def take(self, indexer, allow_fill=True, fill_value=None):
will be done. This short-circuits computation of a mask. Result is
undefined if allow_fill == False and -1 is present in indexer.
fill_value : any, default None
Fill value to replace -1 values with. By default, this uses
the missing value sentinel for this type, ``self._fill_value``.
Fill value to replace -1 values with. If applicable, this should
use the sentinel missing value for this type.
Notes
-----
Expand All @@ -198,17 +247,20 @@ def take(self, indexer, allow_fill=True, fill_value=None):
Examples
--------
Suppose the extension array somehow backed by a NumPy structured array
and that the underlying structured array is stored as ``self.data``.
Then ``take`` may be written as
Suppose the extension array is backed by a NumPy array stored as
``self.data``. Then ``take`` may be written as
.. code-block:: python
def take(self, indexer, allow_fill=True, fill_value=None):
mask = indexer == -1
result = self.data.take(indexer)
result[mask] = self._fill_value
result[mask] = np.nan # NA for this type
return type(self)(result)
See Also
--------
numpy.take
"""
raise AbstractMethodError(self)

Expand All @@ -230,17 +282,12 @@ def copy(self, deep=False):
# ------------------------------------------------------------------------
# Block-related methods
# ------------------------------------------------------------------------
@property
def _fill_value(self):
# type: () -> Any
"""The missing value for this type, e.g. np.nan"""
return None

def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
return np.array(self)

@classmethod
def _concat_same_type(cls, to_concat):
Expand All @@ -257,6 +304,7 @@ def _concat_same_type(cls, to_concat):
"""
raise AbstractMethodError(cls)

@property
def _can_hold_na(self):
# type: () -> bool
"""Whether your array can hold missing values. True by default.
Expand Down
57 changes: 47 additions & 10 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
"""Extend pandas with custom array types"""
import numpy as np

from pandas import compat
from pandas.errors import AbstractMethodError


Expand All @@ -23,6 +26,32 @@ class ExtensionDtype(object):
def __str__(self):
return self.name

def __eq__(self, other):
"""Check whether 'other' is equal to self.
By default, 'other' is considered equal if
* it's a string matching 'self.name'.
* it's an instance of this type.
Parameters
----------
other : Any
Returns
-------
bool
"""
if isinstance(other, compat.string_types):
return other == self.name
elif isinstance(other, type(self)):
return True
else:
return False

def __ne__(self, other):
return not self.__eq__(other)

@property
def type(self):
# type: () -> type
Expand Down Expand Up @@ -102,11 +131,12 @@ def construct_from_string(cls, string):

@classmethod
def is_dtype(cls, dtype):
"""Check if we match 'dtype'
"""Check if we match 'dtype'.
Parameters
----------
dtype : str or dtype
dtype : object
The object to check.
Returns
-------
Expand All @@ -118,12 +148,19 @@ def is_dtype(cls, dtype):
1. ``cls.construct_from_string(dtype)`` is an instance
of ``cls``.
2. 'dtype' is ``cls`` or a subclass of ``cls``.
2. ``dtype`` is an object and is an instance of ``cls``
3. ``dtype`` has a ``dtype`` attribute, and any of the above
conditions is true for ``dtype.dtype``.
"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
return issubclass(dtype, cls)
dtype = getattr(dtype, 'dtype', dtype)

if isinstance(dtype, np.dtype):
return False
elif dtype is None:
return False
elif isinstance(dtype, cls):
return True
try:
return cls.construct_from_string(dtype) is not None
except TypeError:
return False
2 changes: 1 addition & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1708,9 +1708,9 @@ def is_extension_array_dtype(arr_or_dtype):
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, (ABCIndexClass, ABCSeries)):
arr_or_dtype = arr_or_dtype._values

return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))


Expand Down
25 changes: 0 additions & 25 deletions pandas/core/dtypes/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,6 @@ def __hash__(self):
raise NotImplementedError("sub-classes should implement an __hash__ "
"method")

def __eq__(self, other):
raise NotImplementedError("sub-classes should implement an __eq__ "
"method")

def __ne__(self, other):
return not self.__eq__(other)

def __getstate__(self):
# pickle support; we don't want to pickle the cache
return {k: getattr(self, k, None) for k in self._metadata}
Expand All @@ -82,24 +75,6 @@ def reset_cache(cls):
""" clear the cache """
cls._cache = {}

@classmethod
def is_dtype(cls, dtype):
""" Return a boolean if the passed type is an actual dtype that
we can match (via string or type)
"""
if hasattr(dtype, 'dtype'):
dtype = dtype.dtype
if isinstance(dtype, np.dtype):
return False
elif dtype is None:
return False
elif isinstance(dtype, cls):
return True
try:
return cls.construct_from_string(dtype) is not None
except:
return False


class CategoricalDtypeType(type):
"""
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/dtypes/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ def _check(cls, inst):
ABCDateOffset = create_pandas_abc_type("ABCDateOffset", "_typ",
("dateoffset",))
ABCInterval = create_pandas_abc_type("ABCInterval", "_typ", ("interval", ))
ABCExtensionArray = create_pandas_abc_type("ABCExtensionArray", "_typ",
("extension", "categorical",))


class _ABCGeneric(type):
Expand Down
Loading

0 comments on commit 01e99de

Please sign in to comment.