Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array Interface and Categorical internals Refactor #19268

Merged
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
2ef5216
REF: Define extension base classes
TomAugspurger Jan 15, 2018
57e8b0f
Updated for comments
TomAugspurger Jan 18, 2018
01bd42f
Remove metaclasses from PeriodDtype and IntervalDtype
TomAugspurger Jan 18, 2018
ce81706
Fixup form_blocks rebase
TomAugspurger Jan 18, 2018
87a70e3
Restore concat casting cat -> object
TomAugspurger Jan 18, 2018
8c61886
Remove _slice, clarify semantics around __getitem__
TomAugspurger Jan 19, 2018
cb41803
Document and use take.
TomAugspurger Jan 19, 2018
65d5a61
Clarify type, kind, init
TomAugspurger Jan 19, 2018
57c749b
Remove base
TomAugspurger Jan 19, 2018
6736b0f
API: Remove unused __iter__ and get_values
TomAugspurger Jan 21, 2018
e4acb59
API: Implement repr and str
TomAugspurger Jan 21, 2018
0e9337b
Merge remote-tracking branch 'upstream/master' into pandas-array-inte…
TomAugspurger Jan 26, 2018
df68f3b
Remove default value_counts for now
TomAugspurger Jan 26, 2018
2746a43
Fixed merge conflicts
TomAugspurger Jan 27, 2018
34d2b99
Remove implementation of construct_from_string
TomAugspurger Jan 27, 2018
a484d61
Example implementation of take
TomAugspurger Jan 27, 2018
04b2e72
Cleanup ExtensionBlock
TomAugspurger Jan 27, 2018
df0fa12
Merge remote-tracking branch 'upstream/master' into pandas-array-inte…
TomAugspurger Jan 27, 2018
e778053
Pass through ndim
TomAugspurger Jan 27, 2018
d15a722
Use series._values
TomAugspurger Jan 27, 2018
b5f736d
Removed repr, updated take doc
TomAugspurger Jan 27, 2018
240e8f6
Various cleanups
TomAugspurger Jan 28, 2018
f9b0b49
Handle get_values, to_dense, is_view
TomAugspurger Jan 28, 2018
7913186
Docs
TomAugspurger Jan 30, 2018
df18c3b
Remove is_extension, is_bool
TomAugspurger Jan 30, 2018
ab2f045
Sparse formatter
TomAugspurger Jan 30, 2018
520876f
Revert "Sparse formatter"
TomAugspurger Jan 30, 2018
4dfa39c
Unbox SparseSeries
TomAugspurger Jan 30, 2018
e252103
Added test for sparse consolidation
TomAugspurger Jan 30, 2018
7110b2a
Docs
TomAugspurger Jan 30, 2018
c59dca0
Merge remote-tracking branch 'upstream/master' into pandas-array-inte…
TomAugspurger Jan 31, 2018
fc688a5
Moved to errors
TomAugspurger Jan 31, 2018
fbc8466
Handle classmethods, properties
TomAugspurger Jan 31, 2018
030bb19
Use our AbstractMethodError
TomAugspurger Jan 31, 2018
0f4c2d7
Lint
TomAugspurger Jan 31, 2018
f9316e0
Cleanup
TomAugspurger Feb 1, 2018
9c06b13
Move ndim validation to a method.
TomAugspurger Feb 1, 2018
7d2cf9c
Try this
TomAugspurger Feb 1, 2018
afae8ae
Make ExtensionBlock._holder a property
TomAugspurger Feb 1, 2018
cd0997e
Make _holder a property for all
TomAugspurger Feb 1, 2018
1d6eb04
Refactored validate_ndim
TomAugspurger Feb 1, 2018
92aed49
fixup! Refactored validate_ndim
TomAugspurger Feb 1, 2018
34134f2
lint
TomAugspurger Feb 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
from .base import ExtensionArray # noqa
from .categorical import Categorical # noqa
194 changes: 194 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
"""An interface for extending pandas with custom arrays."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advocate leaving base.py open for (near-)future usage as pandas-internal base and putting the "use this if you want to write your own" file in e.g. extension.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for base.py since it's a base class for all extension arrays. I don't think that having ExtensionArray in arrays.base precludes having a pandas-internal base there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be publicly exposed through pd.api.extensions anyway I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding stuff to the public API is waiting on #19304

import abc

import numpy as np

from pandas.compat import add_metaclass


_not_implemented_message = "{} does not implement {}."


@add_metaclass(abc.ABCMeta)
class ExtensionArray(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any expected requirements for the constructor __init__?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should figure out what those are and document them. At the very least, we expected ExtensionArray(extension_array) to work correctly. I'll look for other assumptions we make. Or that could be pushed to another classmethod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also expect that ExtensionArray(), with no arguments, works so that subclasses don't have to implement construct_from_string.

Rather than imposing that on subclasses, we could require some kind of .empty alternative constructor.

"""Abstract base class for custom array types

Notes
-----
pandas will recognize instances of this class as proper arrays
with a custom type and will not attempt to coerce them to objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs much more detail about what is expected here (or at least some examples), e.g. 1-D array-like or whatever

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave general docs until this is actually working (the above sentence is at the moment not yet true), which will only be in follow-up PRs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit about 1-D and some high-level examples.


**Restrictions on your class constructor**

* Your class should be able to be constructed with no arguments,
i.e. ``ExtensionArray()`` returns an instance.
TODO: See comment in ``ExtensionDtype.construct_from_string``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is ExtensionArray() supposed to return? A length zero array? This should be clarified.

I don't understand the comment about construct_from_string. construct_from_sting currently returns an instance of the dtype object, not an array.

Also, in my opinion is somewhat poor API design for an empty constructor to automatically correspond to a certain result. I would rather have an explicit empty() class method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore my comments about ExtensionArray() . That was apparently vestigial from an earlier version of construct_from_string. Agreed that this would have been an awful design decision to force on subclasses :)

* Your class should be able to be constructed with instances of
our class, i.e. ``ExtensionArray(extension_array)`` should returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "our class" be "your class" ? Or should it be able to handle any ExtensionArray subclass (the first would be better IMO)

an instance.
"""
# ------------------------------------------------------------------------
# Must be a Sequence
# ------------------------------------------------------------------------
@abc.abstractmethod
def __getitem__(self, item):
# type (Any) -> Any
"""Select a subset of self

Notes
-----
As a sequence, __getitem__ should expect integer or slice ``key``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also boolean mask?


For slice ``key``, you should return an instance of yourself, even
if the slice is length 0 or 1.

For scalar ``key``, you may return a scalar suitable for your type.
The scalar need not be an instance or subclass of your array type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "need not be" enough? (compared to "should not be")
I mean, we won't run into problems in the internals in pandas by seeing arrays where we expect scalars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify this to say

    For scalar ``item``, you should return a scalar value suitable
    for your type. This should be an instance of ``self.dtype.type``.

My earlier phrasing was to explain that the return value for scalars needn't be the type of item that's actually stored in your array. E.g. for my IPAddress example, the array holds two uint64s, but a scalar slice returns an ipaddress.IPv4Address instance.

"""

def __setitem__(self, key, value):
# type: (Any, Any) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already use AbstractMethodError elsewhere in the code base, use instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setitem must be defined (it certainly does not need to actually set inplace), but since we use this is must appear as a mutable object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of having it here as NotImplementedError is because an ExtensionArray does not necessarily needs to support setting elements (being mutable), at least that's one possible decision (leave the decision up to the extension author).
The error will then just bubble up to the user if he/she tries to set something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I don't think we can assume that all extension arrays will implement it.

raise NotImplementedError(_not_implemented_message.format(
type(self), '__setitem__')
)

@abc.abstractmethod
def __iter__(self):
# type: () -> Iterator
pass

@abc.abstractmethod
def __len__(self):
# type: () -> int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document

pass

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
@property
def base(self):
"""The base array I am a view of. None by default."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example here?

Perhaps it would also help to explain how is this used by pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in Block.is_view, which AFAICT is only used for chained assignment?

If that's correct, then I think we're OK with saying this purely for compatibility with NumPy arrays, and has no effect. I've currently defined ExtensionArray.is_view to always be False, so I don't even make use of it in the changes so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).

However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also remove this. NumPy doesn't always maintain this properly, so it can't actually be essential.


@property
@abc.abstractmethod
def dtype(self):
# type: () -> ExtensionDtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples and document

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would an example even look like here? Defining an ExtensionDtype in the docstring?

"""An instance of 'ExtensionDtype'."""

@property
def shape(self):
# type: () -> Tuple[int, ...]
return (len(self),)

@property
def ndim(self):
# type: () -> int
"""Extension Arrays are only allowed to be 1-dimensional."""
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be tested on registration of the sub-type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "registration"? We could override ABC.register, but I don't think there's an (easy) way to validate this if they just subclass ExtensionArray.

If people want to mess with this, that's fine, their stuff just won't work with pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is that when you register things, we should actually test that the interface is respected. If we had final methods this would not be necessary, but if someone override ndim this is a problem.


@property
@abc.abstractmethod
def nbytes(self):
# type: () -> int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type comments come before the docstring: http://mypy.readthedocs.io/en/latest/python2.html

"""The number of bytes needed to store this object in memory."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should a user do if this is expensive or otherwise difficult to calculate properly? For example, if it's a numpy array with dtype=object.

We should probably note that it's OK for this to be an approximate answer (a lower bound) on the number of required bytes, and consider adding another method for the benefit of memory_usage().


# ------------------------------------------------------------------------
# Additional Methods
# ------------------------------------------------------------------------
@abc.abstractmethod
def isna(self):
# type: () -> np.ndarray
"""Boolean NumPy array indicating if each value is missing."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same length as self


# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
@abc.abstractmethod
def take(self, indexer, allow_fill=True, fill_value=None):
# type: (Sequence, bool, Optional[Any]) -> ExtensionArray
"""For slicing"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clarify what valid values of indexer are. Does -1 indicate a fill value?


@abc.abstractmethod
def copy(self, deep=False):
# type: (bool) -> ExtensionArray
"""Return a copy of the array."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document.


# ------------------------------------------------------------------------
# Block-related methods
# ------------------------------------------------------------------------
@property
def _fill_value(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to list this in the very top doc-string

# type: () -> Any
"""The missing value for this type, e.g. np.nan"""
return None

@abc.abstractmethod
def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""

@classmethod
@abc.abstractmethod
def _concat_same_type(cls, to_concat):
# type: (Sequence[ExtensionArray]) -> ExtensionArray
"""Concatenate multiple array

Parameters
----------
to_concat : sequence of this type

Returns
-------
ExtensionArray
"""

@abc.abstractmethod
def get_values(self):
# type: () -> np.ndarray
"""A NumPy array representing your data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify:

  • This is used for Series.values, but otherwise is not used internally by pandas (This is true, right?).
  • If I've implemented a custom scalar type, should this be an object array of containing all elements of that type? Or should I prefer something possibly lossy but faster? e.g., for IntervalArray, should I return an object array of Interval objects, or a structured array with two elements? It's OK if both options are fine depending on the circumstances, let's just give a little guidance.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used by Series.values. Series.values currently returns the array, like pd.Series(categorical) is a Categorical. get_values seems to be used in

  1. indexing, via _values_from_object, which seems undesirable
  2. nanops, via _values_from_object, again undesirable

So let's hold off on discussing this until I can understand it better. We could potentially remove it. And if extension arrays are iterable, then we could just as well use np.array(extension_array, dtype='object') whenever we absolutely need an ndarray of objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also used for eg concatting that ends up in object dtype, eg in case of concatting different types (a rather corner case).
And it is also similar to Series.get_values.

I think in general this is the "fallback numpy array" in case we need one and we need to be able to infer what to do with it (so typically the object array of custom scalars).

It could be an option to use this to enable (or enable good error message) operations. For example that those values are used when the user calls for the sum. And if you then have an object array of custom scalars, this will be handled gracefully.
Although (thinking while I am writing), this is in general also not ideal as we would want: a) the ability to let pandas operations dispatch to efficient implementations of the array itself (eg in hypothetical case you have a sum function on your custom array that does not need to construct object array of scalars) and b) the ability to disable numerical operations without the need to crystallize a full object array (as eg in geopandas this is very costly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general this is the "fallback numpy array" in case we need one and we need to be able to infer what to do with it (so typically the object array of custom scalars).

Yep. That sounds right.

let pandas operations dispatch to efficient implementations of the array itself [...] and b) the ability to disable numerical operations without the need to crystallize a full object array

Agreed on both accounts, though I think getting reductions, cumulative ops, and numeric ops to all work properly on extension arrays will require some careful thought an quite a bit of work. At the moment, I'm hoping to just avoid breaking things like df.mean() when you have a mix of numeric and extension columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19318 for notes about get_values being called during indexing, specifically Series.__getitem__. I think it can be avoided.

"""

def _can_hold_na(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to list this in the very top doc-string

# type: () -> bool
"""Whether your array can hold missing values. True by default.

Notes
-----
Setting this to false will optimize some operations like fillna.
"""
return True

def _slice(self, slicer):
# type: (Union[tuple, Sequence, int]) -> 'ExtensionArray'
"""Return a new array sliced by `slicer`.

Parameters
----------
slicer : slice or np.ndarray
If an array, it should just be a boolean mask

Returns
-------
array : ExtensionArray
Should return an ExtensionArray, even if ``self[slicer]``
would return a scalar.
"""
return type(self)(self[slicer])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default implementation is likely to fail for some obvious implementations. Perhaps we can have a constructor method _from_scalar() instead that converts a scalar into a length 1 array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can verify that this is always called with a slice object. In that case, __getitem__ will return an ExtensionArray, and we don't have to worry about the scalar case. Unless I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would try to get rid of this if possible, and just ask that __getitem__ can deal with this (of course, alternative is to add separate methods for different __getitem__ functionalities like _slice, but then also _mask, but I don't really see the advantage of this).


def value_counts(self, dropna=True):
"""Optional method for computing the histogram of the counts.

Parameters
----------
dropna : bool, default True
whether to exclude missing values from the computation

Returns
-------
counts : Series
"""
from pandas.core.algorithms import value_counts
mask = ~np.asarray(self.isna())
values = self[mask] # XXX: this imposes boolean indexing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a dedicated method for boolean indexing or document this as part of the expected interface for __getitem__.

return value_counts(np.asarray(values), dropna=dropna)
19 changes: 18 additions & 1 deletion pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
from pandas.util._validators import validate_bool_kwarg
from pandas.core.config import get_option

from .base import ExtensionArray


def _cat_compare_op(op):
def f(self, other):
Expand Down Expand Up @@ -149,7 +151,7 @@ def _maybe_to_categorical(array):
"""


class Categorical(PandasObject):
class Categorical(ExtensionArray, PandasObject):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having our internal arrays inherit from PandasObject, they also get a _constructor method. So we should either make sure this is never used (apart from in methods inside the array itself), or add this to the interface (my preference would be the first)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the methods in PandasObject needs to be ABC in the ExtensionArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I don't think all methods/attributes of PandasObject should be added to the public ExtensionArray (to keep those internal + to not clutter the ExtensionArray API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm consistently testing these changes against

  1. An implementation of IntervalArary: https://github.com/TomAugspurger/pandas/compare/pandas-array-interface-3...TomAugspurger:pandas-array-upstream+interval?expand=1
  2. A branch on pandas-ip: https://github.com/ContinuumIO/pandas-ip/tree/pandas-array-upstream-compat

Neither inherit from PandasObject at the moment, so we're OK.

"""
Represents a categorical variable in classic R / S-plus fashion

Expand Down Expand Up @@ -2131,6 +2133,21 @@ def repeat(self, repeats, *args, **kwargs):
return self._constructor(values=codes, categories=self.categories,
ordered=self.ordered, fastpath=True)

# Interface things
# can_hold_na, concat_same_type, formatting_values
@property
def _can_hold_na(self):
return True

@classmethod
def _concat_same_type(self, to_concat):
from pandas.core.dtypes.concat import _concat_categorical

return _concat_categorical(to_concat)

def _formatting_values(self):
return self

# The Series.cat accessor


Expand Down
101 changes: 101 additions & 0 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
"""Extend pandas with custom array types"""
import abc

from pandas.compat import add_metaclass


@add_metaclass(abc.ABCMeta)
class ExtensionDtype(object):
"""A custom data type for your array.
"""
@property
@abc.abstractmethod
def type(self):
# type: () -> type
"""The scalar type for your array, e.g. ``int`` or ``object``."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not encourage using object for user defined types :)


@property
def kind(self):
# type () -> str
"""A character code (one of 'biufcmMOSUV'), default 'O'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clarify how it's used. How is this useful?

Perhaps "This should match dtype.kind when arrays with this dtype are cast to numpy arrays"?


This should match the NumPy dtype used when your array is
converted to an ndarray, which is probably 'O' for object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify: (if your data type does not corespond to a built-in numpy dtype)


See Also
--------
numpy.dtype.kind
"""
return 'O'

@property
@abc.abstractmethod
def name(self):
# type: () -> str
"""A string identifying the data type.

Will be used for display in, e.g. ``Series.dtype``
"""

@property
def names(self):
# type: () -> Optional[List[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy has this for structured dtypes. Not sure if this is needed in the interface though (I don't think we will use it internally, on the other hand this is compatible with numpy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use it in DataFrame.__init__

        elif isinstance(data, (np.ndarray, Series, Index)):
           if data.dtype.names:

I could of course modify that, but perhaps as a followup. For now having a non-abstract names property seems OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example here (or maybe more expl)

"""Ordered list of field names, or None if there are no fields."""
return None

@classmethod
def construct_from_string(cls, string):
"""Attempt to construct this type from a string.

Parameters
----------
string : str

Returns
-------
self : instance of 'cls'

Raises
------
TypeError

Notes
-----
The default implementation checks if 'string' matches your
type's name. If so, it calls your class with no arguments.
"""
if string == cls.name:
# XXX: Better to mandate a ``.from_empty`` classmethod
# rather than imposing this on the constructor?
return cls()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, this requirement for the constructor should be documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to document this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's best to just remove the default implementation and make it abstract? I'll document current implementation as a possible default.

else:
raise TypeError("Cannot construct a '{}' from "
"'{}'".format(cls, string))

@classmethod
def is_dtype(cls, dtype):
"""Check if we match 'dtype'

Parameters
----------
dtype : str or dtype

Returns
-------
is_dtype : bool

Notes
-----
The default implementation is True if

1. 'dtype' is a string that returns true for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returns true" -> "does not raise" ?

``cls.construct_from_string``
2. 'dtype' is ``cls`` or a subclass of ``cls``.
"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
return issubclass(dtype, cls)
32 changes: 32 additions & 0 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1685,6 +1685,38 @@ def is_extension_type(arr):
return False


def is_extension_array_dtype(arr_or_dtype):
"""Check if an object is a pandas extension array type

Parameters
----------
arr_or_dtype : object

Returns
-------
bool

Notes
-----
This checks whether an object implements the pandas extension
array interface. In pandas, this includes:

* Categorical
* PeriodArray
* IntervalArray
* SparseArray

Third-party libraries may implement arrays or types satisfying
this interface as well.
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
arr_or_dtype = arr_or_dtype.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work if .values will return such a PeriodArray or IntervalArray, and I am not sure we already decided on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Series.values, what else would it return? An object-typed NumPy array? I think the ship has sailed on Series.values always being a NumPy array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Series._values for now? That gets the values of the block, and is certainly an ExtensionArray in case the series holds one.
The we can postpone the decision on what .values returns?

return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to call _get_dtype_type here, this can only have a result of ExtensionDtype and NOT ExtensionArray, which doesn't make any sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass this function an ExtensionArray subclass, you will get that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can only have a result of ExtensionDtype and NOT ExtensionArray, which doesn't make any sense.

The result is just True or False. The argument can be either an array or dtype.

I'm not sure that _get_dtype_or_type does what we want here. That grabs arr.dtype.type, which is a scalar like str or Interval or ipaddress.IPv4Address. What would we do with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not following what we do in all other cases that's my point. pls use _get_dtype_or_type

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not following what we do in all other cases that's my point.

That won't work unfortunately.

In [1]: import pandas as pd; import pandas_ip as ip

In [2]: arr = ip.IPAddress([1, 2, 3])

In [3]: pd.core.dtypes.common._get_dtype_type(arr)
Out[3]: pandas_ip.block.IPv4v6Base

IPv4v6Base isn't an instance of ExtensionType. It's the type scalars belong to.

In [4]: isinstance(arr[0], ip.block.IPv4v6Base)
Out[4]: True

In [5]: issubclass(ip.block.IPv4v6Base, pd.core.dtypes.base.ExtensionDtype)
Out[5]: False

_get_dtype_or_type works for our extension types, since if we get a CategoricalDtypeType we can say "this is a categorical". But we can't do that for arbitrary 3rd party types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then _get_dtype_or_type needs adjustment. This is the point of compatibility, there shouldn't be the need to have special cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would make get_dtype_or_type inconsistent, as it would no longer always return a dtype type, but in certain cases a dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_dtype_type does exactly what it's supposed to do, values.dtype.type. But that's not useful here!

What's the issue with the function as defined? I need a way to tell if an array or dtype is an ExtensionArray or ExtensionDtype. Someday, when Categorical, SparseArray, IntervalArray, PeriodArray, datetimetz, etc are extension arrays then all the special cases currently in _get_dtype_type and friends can be removed, but we aren't there yet. We're doing things in small steps.



def is_complex_dtype(arr_or_dtype):
"""
Check whether the provided array or dtype is of a complex dtype.
Expand Down
Loading