Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow storing ExtensionArrays in containers #19520

Merged
merged 139 commits into from
Feb 23, 2018
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
b7069ef
ENH: non-interval changes
TomAugspurger Feb 2, 2018
9cd92c7
COMPAT: py2 Super
TomAugspurger Feb 3, 2018
9211bbd
BUG: Use original object for extension array
TomAugspurger Feb 3, 2018
80f83a6
Consistent boxing / unboxing
TomAugspurger Feb 3, 2018
ca004d8
32-bit compat
TomAugspurger Feb 3, 2018
5d4a686
Add a test array
TomAugspurger Feb 5, 2018
9f4ad42
linting
TomAugspurger Feb 5, 2018
b1db4e8
Default __iter__
TomAugspurger Feb 5, 2018
00d6bb3
Tests for value_counts
TomAugspurger Feb 5, 2018
1608e3d
Implement value_counts
TomAugspurger Feb 5, 2018
52e2180
Py2 compat
TomAugspurger Feb 5, 2018
e6d06e2
Fixed dropna
TomAugspurger Feb 5, 2018
d356f19
Test fixups
TomAugspurger Feb 6, 2018
a6ae340
Started setitem
TomAugspurger Feb 6, 2018
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
05eb9f3
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 6, 2018
82fd0c6
Setitem tests, decimal example
TomAugspurger Feb 7, 2018
8b1e7d6
Compat
TomAugspurger Feb 7, 2018
10af4b6
Fixed extension block tests.
TomAugspurger Feb 7, 2018
cd5f1eb
Clarify binop tests
TomAugspurger Feb 7, 2018
0a9d9fd
TST: Removed ops tests
TomAugspurger Feb 7, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
ffa888a
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
b15ecac
More failing tests
TomAugspurger Feb 7, 2018
88b8f4f
remove bad test
TomAugspurger Feb 7, 2018
349ac1a
better setitem
TomAugspurger Feb 8, 2018
27ab045
Dropna works.
TomAugspurger Feb 8, 2018
8358fb1
Restore xfail test
TomAugspurger Feb 8, 2018
8ef34a9
Test Categorical
TomAugspurger Feb 8, 2018
340d11b
Xfail setitem tests
TomAugspurger Feb 8, 2018
8297888
TST: Skip JSON tests on py2
TomAugspurger Feb 8, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
fc337de
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
6abe9da
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
c5db5da
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
0b112f2
cleanup
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
b556f83
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
268aabc
Linting
TomAugspurger Feb 9, 2018
d671259
Move to extension
TomAugspurger Feb 9, 2018
cb740ed
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
0ec6958
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
4d08218
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
b4de5c4
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
3a5118b
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 13, 2018
1e8e87e
Remove unused change
TomAugspurger Feb 13, 2018
0f5e4f0
Docs
TomAugspurger Feb 13, 2018
1436d1d
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 13, 2018
c4dab88
Test cleanpu
TomAugspurger Feb 13, 2018
a312ba5
Always set PANDAS_TESTING_MODE
TomAugspurger Feb 13, 2018
758689f
Revert "Always set PANDAS_TESTING_MODE"
TomAugspurger Feb 13, 2018
02c3d40
Explicitly catch warnings or not
TomAugspurger Feb 13, 2018
9e17037
fastparquet warnings
TomAugspurger Feb 13, 2018
4599db4
Unicode literals strikes again.
TomAugspurger Feb 14, 2018
d34d9ca
Restore circle env var
TomAugspurger Feb 14, 2018
29d2528
More parquet test catching
TomAugspurger Feb 14, 2018
905d72e
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 14, 2018
412c951
No stacklevel
TomAugspurger Feb 14, 2018
78834f1
Lower bound on FP
TomAugspurger Feb 15, 2018
f8eac55
Exact bound for FP
TomAugspurger Feb 15, 2018
f09c863
Don't use fastpath for ExtensionBlock make_block
TomAugspurger Feb 15, 2018
cedb63d
Consistently use _values
TomAugspurger Feb 16, 2018
cae2c26
TST: Additional constructor tests
TomAugspurger Feb 16, 2018
8088096
CLN: de-nested a bit
TomAugspurger Feb 16, 2018
8aed325
_fill_value handling
TomAugspurger Feb 16, 2018
453728a
Handle user provided dtype in constructors.
TomAugspurger Feb 16, 2018
cc13c8d
Document ExtensionBlock._maybe_coerce_values
TomAugspurger Feb 16, 2018
f90ac07
Created ABCExtensionArray
TomAugspurger Feb 16, 2018
4a03b26
TST: Tests for is_object_dtype and is_string_dtype and EAs
TomAugspurger Feb 16, 2018
635223f
fixup! Handle user provided dtype in constructors.
TomAugspurger Feb 17, 2018
cf423a7
Doc for setitem
TomAugspurger Feb 17, 2018
2d1a66c
Split base tests
TomAugspurger Feb 17, 2018
c849865
Revert test_parquet changes
TomAugspurger Feb 17, 2018
c3ec822
API: Removed _fill_value from the interface
TomAugspurger Feb 17, 2018
f4cf45c
Push coercion to extension dtype till later
TomAugspurger Feb 17, 2018
9c5d479
Linting
TomAugspurger Feb 17, 2018
1175c0d
ERR: Better error message for coercion to 3rd party dtypes
TomAugspurger Feb 17, 2018
c816d99
CLN: Make take_nd EA aware
TomAugspurger Feb 17, 2018
9c9f59e
Revert sparse changes
TomAugspurger Feb 17, 2018
08af9a3
Other _typ for ABCExtensionArray
TomAugspurger Feb 17, 2018
2e992f7
Test cleanup and expansion.
TomAugspurger Feb 17, 2018
cc5cc3e
Copy if copy
TomAugspurger Feb 17, 2018
704ee67
TST: remove self param for fixture
TomAugspurger Feb 18, 2018
c262968
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 19, 2018
8bf0334
Remove unnescessary EA handling in Series ctor
TomAugspurger Feb 20, 2018
c8d88da
API: Removed value_counts
TomAugspurger Feb 20, 2018
24f3b60
More doc notes
TomAugspurger Feb 20, 2018
50bd5dd
Handle expanding a DataFrame with an EA
TomAugspurger Feb 20, 2018
879bc84
Added ExtensionDtype.__eq__
TomAugspurger Feb 20, 2018
33c9d1f
linting
TomAugspurger Feb 21, 2018
f07c166
REF: is_dtype_equal refactor
TomAugspurger Feb 21, 2018
79d43b1
Remove reference to dtype being a class
TomAugspurger Feb 21, 2018
91c629b
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 21, 2018
a1ebf53
move
TomAugspurger Feb 22, 2018
aa57cad
Moved sparse check to take_nd
TomAugspurger Feb 22, 2018
c82748c
Docstring
TomAugspurger Feb 22, 2018
e919343
Split tests
TomAugspurger Feb 22, 2018
1ea74da
Revert index change
TomAugspurger Feb 22, 2018
0c41a34
Copy changes
TomAugspurger Feb 22, 2018
009bece
Simplify EA implementation names
TomAugspurger Feb 22, 2018
ea5562b
Linting
TomAugspurger Feb 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""An interface for extending pandas with custom arrays."""
import numpy as np

from pandas.errors import AbstractMethodError

_not_implemented_message = "{} does not implement {}."
Expand All @@ -18,19 +20,20 @@ class ExtensionArray(object):

* __getitem__
* __len__
* __iter__
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche we removed this earlier since collections implementing __len__ and __getitem__ are iterable. But (strangely?) isinstance(arr, collections.Iterable) is False, so these aren't considered "list-like" unless they actually implement __iter__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we use this somewhere? (the expectation that isinstance(arr, collections.Iterable) is True)

As the docs https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable says this is not a correct way to test that something is iterable, exactly for the reason you just mentioned.

Anyway, if we add it, I would add a default implementation (return iter(self) ?) and not require subclasses to provide one manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But do we use this somewhere?

Yes, sorry. We use it in Series.__init__, via is_list_like(data).

Will add the default implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored. That's unfortunate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could fall back to

for i in range(len(self)):
   yield self[i]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry. We use it in Series.init, via is_list_like(data).

Or we could fix is_list_like ..

Wait, iter(self) will cause infinite recursion, and we can't assume anything about how the data is stored.

Why would it cause infinite recursion?
With iter(self) we don't assume anything about how the data is stored, I don't know how iter exactly works under the hood, but I assume it just uses __getitem__ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could fix is_list_like

Prior to #17654 we did that, but it was fragile to things like str (the class, not an instance) being considered iterable. Not sure if changing it back is an improvement.

Why would it cause infinite recursion?

def __iter__(self):
    return iter(self)

will call self.__iter__, which calls self.__iter__, ...

Copy link
Member

@jorisvandenbossche jorisvandenbossche Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes of course :-)
Anyhow, the other way is a fine default implementation

* dtype
* nbytes
* isna
* take
* copy
* _formatting_values
* _concat_same_type

Some additional methods are required to satisfy pandas' internal, private
block API.

* _concat_same_type
* _can_hold_na
* _formatting_values
* _fill_value

This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
Expand All @@ -51,9 +54,6 @@ class ExtensionArray(object):
Extension arrays should be able to be constructed with instances of
the class, i.e. ``ExtensionArray(extension_array)`` should return
an instance, not error.

Additionally, certain methods and interfaces are required for proper
this array to be properly stored inside a ``DataFrame`` or ``Series``.
"""
# ------------------------------------------------------------------------
# Must be a Sequence
Expand Down Expand Up @@ -177,9 +177,9 @@ def take(self, indexer, allow_fill=True, fill_value=None):

Examples
--------
Suppose the extension array somehow backed by a NumPy structured array
and that the underlying structured array is stored as ``self.data``.
Then ``take`` may be written as
Suppose the extension array somehow backed by a NumPy array and that
the underlying structured array is stored as ``self.data``. Then
``take`` may be written as

.. code-block:: python

Expand Down Expand Up @@ -219,7 +219,7 @@ def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
return np.array(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, this should be implemented only by subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very sensible default IMO, subclasses can always override if needed


@classmethod
def _concat_same_type(cls, to_concat):
Expand Down
12 changes: 9 additions & 3 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""Extend pandas with custom array types"""
import inspect

from pandas.errors import AbstractMethodError


Expand Down Expand Up @@ -106,7 +108,8 @@ def is_dtype(cls, dtype):

Parameters
----------
dtype : str or dtype
dtype : str, object, or type
The dtype to check.

Returns
-------
Expand All @@ -118,12 +121,15 @@ def is_dtype(cls, dtype):

1. ``cls.construct_from_string(dtype)`` is an instance
of ``cls``.
2. 'dtype' is ``cls`` or a subclass of ``cls``.
2. ``dtype`` is an object and is an instance of ``cls``
3. 'dtype' is a class and is ``cls`` or a subclass of ``cls``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this have double-ticks?

"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
elif inspect.isclass(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have 2 conditions the same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issubclass vs. isinstance :) Want to be able to handle .is_dtype(ExtensionDtype) and .is_dtype(ExtensionDtype()).

return issubclass(dtype, cls)
else:
return isinstance(dtype, cls)
3 changes: 2 additions & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1708,9 +1708,10 @@ def is_extension_array_dtype(arr_or_dtype):
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
arr_or_dtype = arr_or_dtype._values
elif isinstance(arr_or_dtype, ABCIndexClass):
arr_or_dtype = arr_or_dtype._as_best_array()
return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))


Expand Down
32 changes: 21 additions & 11 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@
is_datetimelike_v_numeric, is_float_dtype,
is_datetime64_dtype, is_datetime64tz_dtype,
is_timedelta64_dtype, is_interval_dtype,
is_complex_dtype, is_categorical_dtype,
is_complex_dtype,
is_string_like_dtype, is_bool_dtype,
is_integer_dtype, is_dtype_equal,
is_extension_array_dtype,
needs_i8_conversion, _ensure_object,
pandas_dtype,
is_scalar,
Expand Down Expand Up @@ -52,12 +53,15 @@ def isna(obj):


def _isna_new(obj):
from ..arrays import ExtensionArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't do this. rather define an ABCExtensionArray. instead. This is the purpose of these, to avoid smell like this.


if is_scalar(obj):
return libmissing.checknull(obj)
# hack (for now) because MI registers as ndarray
elif isinstance(obj, ABCMultiIndex):
raise NotImplementedError("isna is not defined for MultiIndex")
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass)):
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass,
ExtensionArray)):
return _isna_ndarraylike(obj)
elif isinstance(obj, ABCGeneric):
return obj._constructor(obj._data.isna(func=isna))
Expand Down Expand Up @@ -124,17 +128,20 @@ def _use_inf_as_na(key):


def _isna_ndarraylike(obj):

values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if is_categorical_dtype(values):
from pandas import Categorical
if not isinstance(values, Categorical):
values = values.values
result = values.isna()
elif is_interval_dtype(values):
if is_extension_array_dtype(obj):
if isinstance(obj, ABCIndexClass):
values = obj._as_best_array()
elif isinstance(obj, ABCSeries):
values = obj._values
else:
values = obj
result = values.isna()
elif is_string_dtype(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already there, but I find the use of is_string_dtype rather confusing here, as intervals are not strings at all (but it just checks for string or object dtype ..).
But maybe can do

elif is_interval_dtype(values):
    ...
elif is string_dtype(values):
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, pls fix is_string_dtype (to exclude interval as well as period dtypes (you might be able to say object and not extension_type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix itself once IntervalArray is a proper extension type. Added tests to confirm that extension types are not True for is_string_dtype and is_object_dtype.

if is_interval_dtype(values):
# TODO(IntervalArray): remove this if block
from pandas import IntervalIndex
result = IntervalIndex(obj).isna()
else:
Expand Down Expand Up @@ -406,4 +413,7 @@ def remove_na_arraylike(arr):
"""
Return array-like containing only true/non-NaN values, possibly empty.
"""
return arr[notna(lib.values_from_object(arr))]
if is_extension_array_dtype(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately values_from_object just calls .get_values() on the object if it exists. This smells like something is missing from the EA interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if / elif can be avoided. values_from_object gets and calls arr.get_values if defined, which returns an ndarray. But we need to call notna on the EA itself, since it has special missing value handling.

return arr[notna(arr)]
else:
return arr[notna(lib.values_from_object(arr))]
20 changes: 9 additions & 11 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
is_categorical_dtype,
is_object_dtype,
is_extension_type,
is_extension_array_dtype,
is_datetimetz,
is_datetime64_any_dtype,
is_datetime64tz_dtype,
Expand Down Expand Up @@ -71,7 +72,7 @@
create_block_manager_from_arrays,
create_block_manager_from_blocks)
from pandas.core.series import Series
from pandas.core.arrays import Categorical
from pandas.core.arrays import Categorical, ExtensionArray
import pandas.core.algorithms as algorithms
from pandas.compat import (range, map, zip, lrange, lmap, lzip, StringIO, u,
OrderedDict, raise_with_traceback)
Expand Down Expand Up @@ -511,7 +512,7 @@ def _get_axes(N, K, index=index, columns=columns):
index, columns = _get_axes(len(values), 1)
return _arrays_to_mgr([values], columns, index, columns,
dtype=dtype)
elif is_datetimetz(values):
elif (is_datetimetz(values) or is_extension_array_dtype(values)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the dtype is an extension dtype? (like the check above for categorical)
But certainly fine with not supporting it for now.

Similar for Series, we need to either define what pd.Series(values, dtype=SomeExtensionDtype()) does, or either error on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [7]: pd.Series([0, 1, 2], dtype=cyberpandas.IPType())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-7fc01964da1d> in <module>()
----> 1 pd.Series([0, 1, 2], dtype=cyberpandas.IPType())

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    246             else:
    247                 data = _sanitize_array(data, index, dtype, copy,
--> 248                                        raise_cast_failure=True)
    249
    250                 data = SingleBlockManager(data, index, fastpath=True)

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   3214         if dtype is not None:
   3215             try:
-> 3216                 subarr = _try_cast(data, False)
   3217             except Exception:
   3218                 if raise_cast_failure:  # pragma: no cover

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _try_cast(arr, take_fast_path)
   3168             subarr = maybe_cast_to_datetime(arr, dtype)
   3169             if not is_extension_type(subarr):
-> 3170                 subarr = np.array(subarr, dtype=dtype, copy=copy)
   3171         except (ValueError, TypeError):
   3172             if is_categorical_dtype(dtype):

TypeError: data type not understood

How should we handle this? I'm fine with raising with a better error message. We don't know how to cast from (arbitrary array, extension_dtype) -> extension array.

Although... maybe we could support Series(extension_array, dtype=extension_dtype).

Oh hey, that "works"

In [10]: pd.Series(cyberpandas.IPArray([0, 1, 2]), dtype=cyberpandas.IPType())
Out[10]:
0    0.0.0.0
1    0.0.0.1
2    0.0.0.2
dtype: ip

But it only works since we ignore the dtype entirely :)

I will raise if data.dtype is not dtype equal with dtype I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for things like astype(SomeExtensionDtype) we will need an informative error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't a dateimetz an extension type now? why can't you remove the first part of the clauase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not yet. Just Categorical so far. I have Interval and Period started, but no PRs yet.

# GH19157
if columns is None:
columns = [0]
Expand Down Expand Up @@ -2796,15 +2797,15 @@ def reindexer(value):
# now align rows
value = reindexer(value).T

elif isinstance(value, Categorical):
elif isinstance(value, ExtensionArray):
value = value.copy()

elif isinstance(value, Index) or is_sequence(value):
from pandas.core.series import _sanitize_index

# turn me into an ndarray
value = _sanitize_index(value, self.index, copy=False)
if not isinstance(value, (np.ndarray, Index)):
if not isinstance(value, (np.ndarray, Index, ExtensionArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use is_array_like here (and maybe in other places)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rust remove this change until we support array-backed indexes.

if isinstance(value, list) and len(value) > 0:
value = maybe_convert_platform(value)
else:
Expand All @@ -2826,7 +2827,7 @@ def reindexer(value):
value = maybe_cast_to_datetime(value, value.dtype)

# return internal types directly
if is_extension_type(value):
if is_extension_type(value) or is_extension_array_dtype(value):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unfortunate intermediate stage where some of our internal extension types (sparse, datetime w/ tz) are not yet actual extension array types. We'll be able to clean this up this eventually.

return value

# broadcast across multiple columns if necessary
Expand Down Expand Up @@ -3355,12 +3356,9 @@ class max type
new_obj = self.copy()

def _maybe_casted_values(index, labels=None):
if isinstance(index, PeriodIndex):
values = index.astype(object).values
elif isinstance(index, DatetimeIndex) and index.tz is not None:
values = index
else:
values = index.values
values = index._as_best_array()
# TODO: Check if nescessary...
if not isinstance(index, (PeriodIndex, DatetimeIndex)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to what happened before in the else branch. Seems like it's doing some type inference / coercion.

if values.dtype == np.object_:
values = lib.maybe_convert_objects(values)

Expand Down
34 changes: 33 additions & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from pandas import compat

from pandas.core.accessor import CachedAccessor
from pandas.core.arrays import ExtensionArray
from pandas.core.dtypes.generic import (
ABCSeries, ABCDataFrame,
ABCMultiIndex,
Expand Down Expand Up @@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a need for a method like this. It may be good to add to Series as well, and maybe make public.

Essentially, we want a clear way to get the "best" .values. For Indexes with ExtensionArray types (Categorical for now, Interval, Period soon) that's the extension array. Otherwise, it's an ndarray (or I think `DatetimeIndex for datetime w/ tz right now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this will also depend on decision if we will change return value for .values for series with period and interval objects)
For Series we already have something like ._values. Won't that be the equivalent there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Maybe I should re-use that convention.

Unfortunately, DatetimeIndex[tz]._values returns an ndarray, w/o the TZ, where it "should" return a DatetimeIndex (and eventually an ExtensionArray). I'll see how badly changing that breaks things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ths should be equiv of _values. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #19548

# type: () -> Union[ExtensionArray, ndarary]
"""Return the underlying values as the best array type.

Indexes backed by ExtensionArrays will return the ExtensionArray.
Otherwise, an ndarray is returned.

Examples
--------
>>> pd.Index([0, 1, 2])._as_best_array()
array([0, 1, 2])

>>> pd.CategoricalIndex(['a', 'a', 'b'])._as_best_array()
[a, a, b]
Categories (2, object): [a, b]

>>> pd.IntervalIndex.from_breaks([0, 1, 2])._as_best_array()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntervalArray([(0, 1], (1, 2]])
"""
# We need this since CategoricalIndex.values -> Categorical
# but IntervalIndex.values -> ndarray[object]
# TODO: IntervalIndex defines _array_values. Would be nice to
# have an unambiguous way of getting an ndarray (or just use asarray?)
return self.values

_index_shared_docs['astype'] = """
Create an Index with values cast to dtypes. The class of a new Index
is determined by dtype. When conversion is impossible, a ValueError
Expand Down Expand Up @@ -1946,6 +1972,12 @@ def _format_with_header(self, header, na_rep='NaN', **kwargs):

if is_categorical_dtype(values.dtype):
values = np.array(values)

elif isinstance(values, ExtensionArray):
# This is still un-exercised within pandas, since all our
# extension dtypes have custom indexes.
values = values._formatting_values()

elif is_object_dtype(values.dtype):
values = lib.maybe_convert_objects(values, safe=1)

Expand Down Expand Up @@ -2525,7 +2557,7 @@ def get_value(self, series, key):
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, Index) and is_scalar(key):
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_array_like might work here

try:
return s[key]
except (IndexError, ValueError):
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,9 @@ def get_values(self):
""" return the underlying data as an ndarray """
return self._data.get_values()

def _as_best_array(self):
return self._data

def tolist(self):
return self._data.tolist()

Expand Down
7 changes: 7 additions & 0 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1034,6 +1034,13 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
# no-tz -> ndarray
# tz -> DatetimeIndex (for now)
if self.tz is not None:
return self
return self.values

def to_pydatetime(self):
"""
Return DatetimeIndex as object ndarray of datetime.datetime objects
Expand Down
29 changes: 22 additions & 7 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,10 @@
is_null_datelike_scalar)
import pandas.core.dtypes.concat as _concat

from pandas.core.dtypes.generic import ABCSeries, ABCDatetimeIndex
from pandas.core.dtypes.generic import (
ABCSeries,
ABCDatetimeIndex,
ABCIndexClass)
import pandas.core.common as com
import pandas.core.algorithms as algos

Expand Down Expand Up @@ -1854,6 +1857,20 @@ class ExtensionBlock(NonConsolidatableMixIn, Block):

ExtensionArrays are limited to 1-D.
"""

def __init__(self, values, placement, ndim=None):
values = self._maybe_coerce_values(values)
super(ExtensionBlock, self).__init__(values, placement, ndim)

def _maybe_coerce_values(self, values):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #19492

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this called anywhere else? (e.g. IOW I think can eliminate some code if maybe this was defined in the superclass)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is only called in __init__, and was made into a separate method on your question. It's not defined in a superclass, so cannot be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19492

Eventually Block.init will call self._maybe_coerce_values and we can remove every subclasses's init. But there's some other work to be done first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

# Unboxes Series / Index
# Doesn't change any underlying dtypes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just _as_best_array() (or new name), this kind of if logic is polluting (yes its all over the place, a major goal of this ExtensionArray is to clean this up, I would argue this is THE goal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a real doc-string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, since there's no parent.

if isinstance(values, ABCSeries):
values = values.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values._values ?
(is this actually hit?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, via pd.Series(pd.Series(extension_array)). Will make sure we have tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's hit in DataFrame(series[extension_array]), but I didn't start that test with test_. Fixing now.

elif isinstance(values, ABCIndexClass):
values = values._as_best_array()
return values

@property
def _holder(self):
# For extension blocks, the holder is values-dependent.
Expand Down Expand Up @@ -4101,7 +4118,8 @@ def set(self, item, value, check=False):
# FIXME: refactor, clearly separate broadcasting & zip-like assignment
# can prob also fix the various if tests for sparse/categorical

value_is_extension_type = is_extension_type(value)
value_is_extension_type = (is_extension_type(value) or
is_extension_array_dtype(value))

# categorical/spares/datetimetz
if value_is_extension_type:
Expand Down Expand Up @@ -4834,13 +4852,10 @@ def form_blocks(arrays, names, axes):
if len(items_dict['ExtensionBlock']):

external_blocks = []

for i, _, array in items_dict['ExtensionBlock']:
if isinstance(array, ABCSeries):
array = array.values
# Allow our internal arrays to chose their block type.
block_type = getattr(array, '_block_type', ExtensionBlock)
external_blocks.append(
make_block(array, klass=block_type,
make_block(array, klass=ExtensionBlock,
fastpath=True, placement=[i]))
blocks.extend(external_blocks)

Expand Down
Loading