ENH: Allow storing ExtensionArrays in containers (#19520)

* ENH: non-interval changes * COMPAT: py2 Super * BUG: Use original object for extension array * Consistent boxing / unboxing NumPy compat * 32-bit compat * Add a test array * linting * Default __iter__ * Tests for value_counts * Implement value_counts * Py2 compat * Fixed dropna * Test fixups * Started setitem * REF/Clean: Internal / External values * Move to index base * Setitem tests, decimal example * Compat * Fixed extension block tests. The only "API change" was that you can't just inherit from NonConsolidatableMixin, which is OK since 1. it's a mixin 2. geopandas also inherits from Block * Clarify binop tests Make it clearer which bit might raise * TST: Removed ops tests * Cleanup unique handling * Simplify object concat * Use values for intersection I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical. * hmm * More failing tests * remove bad test * better setitem * Dropna works. * Restore xfail test * Test Categorical * Xfail setitem tests * TST: Skip JSON tests on py2 * Additional testing * More tests * ndarray_values * API: Default ExtensionArray.astype (cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06) * Simplify concat_as_object * Py2 compat (cherry picked from commit b20e12c) * Set-ops ugliness * better docstrings * tolist * linting * Moved dtypes (cherry picked from commit d136227) * clean * cleanup * NumPy compat * Use base _values for CategoricalIndex * Update dev docs * cleanup * cleanup (cherry picked from commit 2425621) * cleanup * Linting * Precision in tests * Linting * Move to extension * Push _ndarray_values to ExtensionArray Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA. Subclasses like Categorical can override it as they see fit. * Clean up tolist * Move test locations * Fixed test * REF: Update per comments * lint * REF: Use _values for size and shape * PERF: Implement size, shape for IntervalIndex * PERF: Avoid materializing values for PeriodIndex shape, size * Cleanup * Override nbytes * Remove unused change * Docs * Test cleanpu * Always set PANDAS_TESTING_MODE * Revert "Always set PANDAS_TESTING_MODE" This reverts commit a312ba5. * Explicitly catch warnings or not * fastparquet warnings * Unicode literals strikes again. Only catch fp warning for newer numpy * Restore circle env var * More parquet test catching * No stacklevel * Lower bound on FP * Exact bound for FP * Don't use fastpath for ExtensionBlock make_block * Consistently use _values * TST: Additional constructor tests * CLN: de-nested a bit * _fill_value handling * Handle user provided dtype in constructors. When the dtype matches, we allow it to proceed. When the dtype would require coercion, we raise. * Document ExtensionBlock._maybe_coerce_values Also changes to use _values as we should * Created ABCExtensionArray * TST: Tests for is_object_dtype and is_string_dtype and EAs * fixup! Handle user provided dtype in constructors. * Doc for setitem * Split base tests * Revert test_parquet changes * API: Removed _fill_value from the interface * Push coercion to extension dtype till later * Linting * ERR: Better error message for coercion to 3rd party dtypes * CLN: Make take_nd EA aware * Revert sparse changes * Other _typ for ABCExtensionArray * Test cleanup and expansion. Tests for concating and aligning frames * Copy if copy * TST: remove self param for fixture * Remove unnescessary EA handling in Series ctor * API: Removed value_counts Moved setitem notes to comment * More doc notes * Handle expanding a DataFrame with an EA * Added ExtensionDtype.__eq__ Support for astype * linting * REF: is_dtype_equal refactor Moved from PandasExtensionDtype to ExtensionDtype with one modification: catch TypeError explicitly. * Remove reference to dtype being a class * move * Moved sparse check to take_nd * Docstring * Split tests * Revert index change * Copy changes * Simplify EA implementation names comments for object vs. str missing values * Linting
pandas-dev · Feb 23, 2018 · 01e99de · 01e99de
1 parent 0176f6e
commit 01e99de
Show file tree

Hide file tree

Showing 32 changed files with 1,276 additions and 130 deletions.
diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
@@ -15,11 +15,12 @@
     is_unsigned_integer_dtype, is_signed_integer_dtype,
     is_integer_dtype, is_complex_dtype,
     is_object_dtype,
+    is_extension_array_dtype,
     is_categorical_dtype, is_sparse,
     is_period_dtype,
     is_numeric_dtype, is_float_dtype,
     is_bool_dtype, needs_i8_conversion,
-    is_categorical, is_datetimetz,
+    is_datetimetz,
     is_datetime64_any_dtype, is_datetime64tz_dtype,
     is_timedelta64_dtype, is_interval_dtype,
     is_scalar, is_list_like,
@@ -547,7 +548,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,
         if is_categorical_dtype(values) or is_sparse(values):
 
             # handle Categorical and sparse,
-            result = Series(values).values.value_counts(dropna=dropna)
+            result = Series(values)._values.value_counts(dropna=dropna)
             result.name = name
             counts = result.values
 
@@ -1292,10 +1293,13 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
     """
     Specialized Cython take which sets NaN values in one pass
 
+    This dispatches to ``take`` defined on ExtensionArrays. It does not
+    currently dispatch to ``SparseArray.take`` for sparse ``arr``.
+
     Parameters
     ----------
-    arr : ndarray
-        Input array
+    arr : array-like
+        Input array.
     indexer : ndarray
         1-D array of indices to take, subarrays corresponding to -1 value
         indicies are filed with fill_value
@@ -1315,17 +1319,25 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
         If False, indexer is assumed to contain no -1 values so no filling
         will be done.  This short-circuits computation of a mask.  Result is
         undefined if allow_fill == False and -1 is present in indexer.
+
+    Returns
+    -------
+    subarray : array-like
+        May be the same type as the input, or cast to an ndarray.
     """
 
+    # TODO(EA): Remove these if / elifs as datetimeTZ, interval, become EAs
     # dispatch to internal type takes
-    if is_categorical(arr):
-        return arr.take_nd(indexer, fill_value=fill_value,
-                           allow_fill=allow_fill)
+    if is_extension_array_dtype(arr):
+        return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
     elif is_datetimetz(arr):
         return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
     elif is_interval_dtype(arr):
         return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
 
+    if is_sparse(arr):
+        arr = arr.get_values()
+
     if indexer is None:
         indexer = np.arange(arr.shape[axis], dtype=np.int64)
         dtype, fill_value = arr.dtype, arr.dtype.type()

diff --git a/pandas/core/arrays/base.py b/pandas/core/arrays/base.py
@@ -25,14 +25,13 @@ class ExtensionArray(object):
     * isna
     * take
     * copy
-    * _formatting_values
     * _concat_same_type
 
-    Some additional methods are required to satisfy pandas' internal, private
+    Some additional methods are available to satisfy pandas' internal, private
     block API.
 
-    * _concat_same_type
     * _can_hold_na
+    * _formatting_values
 
     This class does not inherit from 'abc.ABCMeta' for performance reasons.
     Methods and properties required by the interface raise
@@ -53,13 +52,14 @@ class ExtensionArray(object):
     Extension arrays should be able to be constructed with instances of
     the class, i.e. ``ExtensionArray(extension_array)`` should return
     an instance, not error.
-
-    Additionally, certain methods and interfaces are required for proper
-    this array to be properly stored inside a ``DataFrame`` or ``Series``.
     """
+    # '_typ' is for pandas.core.dtypes.generic.ABCExtensionArray.
+    # Don't override this.
+    _typ = 'extension'
     # ------------------------------------------------------------------------
     # Must be a Sequence
     # ------------------------------------------------------------------------
+
     def __getitem__(self, item):
         # type (Any) -> Any
         """Select a subset of self.
@@ -92,7 +92,46 @@ def __getitem__(self, item):
         raise AbstractMethodError(self)
 
     def __setitem__(self, key, value):
-        # type: (Any, Any) -> None
+        # type: (Union[int, np.ndarray], Any) -> None
+        """Set one or more values inplace.
+
+        This method is not required to satisfy the pandas extension array
+        interface.
+
+        Parameters
+        ----------
+        key : int, ndarray, or slice
+            When called from, e.g. ``Series.__setitem__``, ``key`` will be
+            one of
+
+            * scalar int
+            * ndarray of integers.
+            * boolean ndarray
+            * slice object
+
+        value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
+            value or values to be set of ``key``.
+
+        Returns
+        -------
+        None
+        """
+        # Some notes to the ExtensionArray implementor who may have ended up
+        # here. While this method is not required for the interface, if you
+        # *do* choose to implement __setitem__, then some semantics should be
+        # observed:
+        #
+        # * Setting multiple values : ExtensionArrays should support setting
+        #   multiple values at once, 'key' will be a sequence of integers and
+        #  'value' will be a same-length sequence.
+        #
+        # * Broadcasting : For a sequence 'key' and a scalar 'value',
+        #   each position in 'key' should be set to 'value'.
+        #
+        # * Coercion : Most users will expect basic coercion to work. For
+        #   example, a string like '2018-01-01' is coerced to a datetime
+        #   when setting on a datetime64ns array. In general, if the
+        #   __init__ method coerces that value, then so should __setitem__
         raise NotImplementedError(_not_implemented_message.format(
             type(self), '__setitem__')
         )
@@ -107,6 +146,16 @@ def __len__(self):
         # type: () -> int
         raise AbstractMethodError(self)
 
+    def __iter__(self):
+        """Iterate over elements of the array.
+
+        """
+        # This needs to be implemented so that pandas recognizes extension
+        # arrays as list-like. The default implementation makes successive
+        # calls to ``__getitem__``, which may be slower than necessary.
+        for i in range(len(self)):
+            yield self[i]
+
     # ------------------------------------------------------------------------
     # Required attributes
     # ------------------------------------------------------------------------
@@ -132,9 +181,9 @@ def nbytes(self):
         # type: () -> int
         """The number of bytes needed to store this object in memory.
 
-        If this is expensive to compute, return an approximate lower bound
-        on the number of bytes needed.
         """
+        # If this is expensive to compute, return an approximate lower bound
+        # on the number of bytes needed.
         raise AbstractMethodError(self)
 
     # ------------------------------------------------------------------------
@@ -184,8 +233,8 @@ def take(self, indexer, allow_fill=True, fill_value=None):
             will be done. This short-circuits computation of a mask. Result is
             undefined if allow_fill == False and -1 is present in indexer.
         fill_value : any, default None
-            Fill value to replace -1 values with. By default, this uses
-            the missing value sentinel for this type, ``self._fill_value``.
+            Fill value to replace -1 values with. If applicable, this should
+            use the sentinel missing value for this type.
 
         Notes
         -----
@@ -198,17 +247,20 @@ def take(self, indexer, allow_fill=True, fill_value=None):
 
         Examples
         --------
-        Suppose the extension array somehow backed by a NumPy structured array
-        and that the underlying structured array is stored as ``self.data``.
-        Then ``take`` may be written as
+        Suppose the extension array is backed by a NumPy array stored as
+        ``self.data``. Then ``take`` may be written as
 
         .. code-block:: python
 
            def take(self, indexer, allow_fill=True, fill_value=None):
                mask = indexer == -1
                result = self.data.take(indexer)
-               result[mask] = self._fill_value
+               result[mask] = np.nan  # NA for this type
                return type(self)(result)
+
+        See Also
+        --------
+        numpy.take
         """
         raise AbstractMethodError(self)
 
@@ -230,17 +282,12 @@ def copy(self, deep=False):
     # ------------------------------------------------------------------------
     # Block-related methods
     # ------------------------------------------------------------------------
-    @property
-    def _fill_value(self):
-        # type: () -> Any
-        """The missing value for this type, e.g. np.nan"""
-        return None
 
     def _formatting_values(self):
         # type: () -> np.ndarray
         # At the moment, this has to be an array since we use result.dtype
         """An array of values to be printed in, e.g. the Series repr"""
-        raise AbstractMethodError(self)
+        return np.array(self)
 
     @classmethod
     def _concat_same_type(cls, to_concat):
@@ -257,6 +304,7 @@ def _concat_same_type(cls, to_concat):
         """
         raise AbstractMethodError(cls)
 
+    @property
     def _can_hold_na(self):
         # type: () -> bool
         """Whether your array can hold missing values. True by default.

diff --git a/pandas/core/dtypes/base.py b/pandas/core/dtypes/base.py
@@ -1,4 +1,7 @@
 """Extend pandas with custom array types"""
+import numpy as np
+
+from pandas import compat
 from pandas.errors import AbstractMethodError
 
 
@@ -23,6 +26,32 @@ class ExtensionDtype(object):
     def __str__(self):
         return self.name
 
+    def __eq__(self, other):
+        """Check whether 'other' is equal to self.
+
+        By default, 'other' is considered equal if
+
+        * it's a string matching 'self.name'.
+        * it's an instance of this type.
+
+        Parameters
+        ----------
+        other : Any
+
+        Returns
+        -------
+        bool
+        """
+        if isinstance(other, compat.string_types):
+            return other == self.name
+        elif isinstance(other, type(self)):
+            return True
+        else:
+            return False
+
+    def __ne__(self, other):
+        return not self.__eq__(other)
+
     @property
     def type(self):
         # type: () -> type
@@ -102,11 +131,12 @@ def construct_from_string(cls, string):
 
     @classmethod
     def is_dtype(cls, dtype):
-        """Check if we match 'dtype'
+        """Check if we match 'dtype'.
 
         Parameters
         ----------
-        dtype : str or dtype
+        dtype : object
+            The object to check.
 
         Returns
         -------
@@ -118,12 +148,19 @@ def is_dtype(cls, dtype):
 
         1. ``cls.construct_from_string(dtype)`` is an instance
            of ``cls``.
-        2. 'dtype' is ``cls`` or a subclass of ``cls``.
+        2. ``dtype`` is an object and is an instance of ``cls``
+        3. ``dtype`` has a ``dtype`` attribute, and any of the above
+           conditions is true for ``dtype.dtype``.
         """
-        if isinstance(dtype, str):
-            try:
-                return isinstance(cls.construct_from_string(dtype), cls)
-            except TypeError:
-                return False
-        else:
-            return issubclass(dtype, cls)
+        dtype = getattr(dtype, 'dtype', dtype)
+
+        if isinstance(dtype, np.dtype):
+            return False
+        elif dtype is None:
+            return False
+        elif isinstance(dtype, cls):
+            return True
+        try:
+            return cls.construct_from_string(dtype) is not None
+        except TypeError:
+            return False
diff --git a/pandas/core/dtypes/common.py b/pandas/core/dtypes/common.py
@@ -1708,9 +1708,9 @@ def is_extension_array_dtype(arr_or_dtype):
     """
     from pandas.core.arrays import ExtensionArray
 
-    # we want to unpack series, anything else?
     if isinstance(arr_or_dtype, (ABCIndexClass, ABCSeries)):
         arr_or_dtype = arr_or_dtype._values
+
     return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))
 
 

diff --git a/pandas/core/dtypes/dtypes.py b/pandas/core/dtypes/dtypes.py
@@ -66,13 +66,6 @@ def __hash__(self):
         raise NotImplementedError("sub-classes should implement an __hash__ "
                                   "method")
 
-    def __eq__(self, other):
-        raise NotImplementedError("sub-classes should implement an __eq__ "
-                                  "method")
-
-    def __ne__(self, other):
-        return not self.__eq__(other)
-
     def __getstate__(self):
         # pickle support; we don't want to pickle the cache
         return {k: getattr(self, k, None) for k in self._metadata}
@@ -82,24 +75,6 @@ def reset_cache(cls):
         """ clear the cache """
         cls._cache = {}
 
-    @classmethod
-    def is_dtype(cls, dtype):
-        """ Return a boolean if the passed type is an actual dtype that
-        we can match (via string or type)
-        """
-        if hasattr(dtype, 'dtype'):
-            dtype = dtype.dtype
-        if isinstance(dtype, np.dtype):
-            return False
-        elif dtype is None:
-            return False
-        elif isinstance(dtype, cls):
-            return True
-        try:
-            return cls.construct_from_string(dtype) is not None
-        except:
-            return False
-
 
 class CategoricalDtypeType(type):
     """

diff --git a/pandas/core/dtypes/generic.py b/pandas/core/dtypes/generic.py
@@ -57,6 +57,8 @@ def _check(cls, inst):
 ABCDateOffset = create_pandas_abc_type("ABCDateOffset", "_typ",
                                        ("dateoffset",))
 ABCInterval = create_pandas_abc_type("ABCInterval", "_typ", ("interval", ))
+ABCExtensionArray = create_pandas_abc_type("ABCExtensionArray", "_typ",
+                                           ("extension", "categorical",))
 
 
 class _ABCGeneric(type):