Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array Interface and Categorical internals Refactor #19268

Merged

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jan 16, 2018

(edit post categorical-move)

Rebased on master. Summary of the changes from master:

Added the ExtensionArray class
Categorical subclasses ExtensionArray

Implements the new methods for the interface (all private. No public API
changes)

Adapted the ExtensionDtype class to be the public ABC
a. Subclass that with PandasExtensionClass that does non-interface things
like reprs, caching, etc.
b. All our custom dtypes inherit from PandasExtensionClass, so they implement
the interface.
Internals Changes:
a. Added an ExtensionBlock. This will be a parent for our current custom
blocks, and the block type for all 3rd-party extension arrays.
Added a new is_extension_array_dtype method. I think this is nescessary
for now, until we've handled DatetimeTZ.

This isn't really a test of whether extension arrays work yet, since we're
still using Categorical for everything. I have a followup PR that implements
an IntervalArray that requires additional changes to, e.g., the constructors
so that things work. But all the changes from core/internals.py required to
make that work are present here.


  1. New class hierarchy in internals

Old:

class CategoricalBlock(NonConsolidatableMixin, ObjectBlock):
    pass

new:

class ExtensionBlock(NonConsolidatableMixin, Block):
   pass

class CategoricalBlock(ExtensionBlock):
    pass

Figuring out which methods of ObjectBlock were required on CategoricalBlock
wasn't trivial for me. I probably messed some up.

I think that eventually we can remove NonConsolidatableMixin, with the idea
that all non-consolidatable blocks are blocks for extension dtypes? That's true
today anyway.

Followup PRs:

  1. Making core/arrays/period.py and refactoring PeriodIndex
  2. Making core/arrays/interval.py and refactoring IntervalIndex
  3. Adding docs and generic tests like https://github.com/pandas-dev/pandas/pull/19174/files#diff-e448fe09dbe8aed468d89a4c90e65cff for our interface (once it's stabilized a bit).

@TomAugspurger TomAugspurger added Enhancement Internals Related to non-user accessible pandas implementation labels Jan 16, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Jan 16, 2018
@pep8speaks
Copy link

pep8speaks commented Jan 16, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 01, 2018 at 20:55 Hours UTC

@TomAugspurger TomAugspurger changed the title Pandas array interface 3 Array Interface and Categorical internals Refactor Jan 16, 2018
Should return an ExtensionArray, even if ``self[slicer]``
would return a scalar.
"""
# XXX: We could get rid of this *if* we require that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can almost get rid of _slice with a default implementation noted in the comments, but I think dimensionality reduction from array -> scalar could break things.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming slicer is a slice, in what cases can it give a scalar?
In principle this could just be handled in getitem ?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slice(0) or slice(0, 1) hit` this.

Ahh, but I didn't realize that NumPy always returned arrays from slices:

In [5]: np.array([0])[slice(0, 1)]
Out[5]: array([0])

So if we assume that as well, then we could get rid of _slice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can assume that if __getitem__ gets a slice object, it should return an instance of itself.

@property
def type(self):
"""Typically a metaclass inheriting from 'type' with no methods."""
return type(self.name, (), {})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what array.dtype.type is used for. This passes the test suite, but may break things like

array1.dtype.type is array2.dtype.type

since the object IDs will be different (I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NumPy, dtype.type is the corresponding scalar type, e.g.,

>>> np.dtype(np.float64).type
numpy.float64

I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. Would you say that object is a good default here? I'll work through a test array that uses something meaningful like numbers.Real or int. We do use values.dtype.type in a few places, like figuring out which block type to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a good default value here. Generally the right choice is to return the corresponding scalar type, e.g., Interval for IntervalDtype.

@@ -108,14 +109,15 @@ class Block(PandasObject):
def __init__(self, values, placement, ndim=None, fastpath=False):
if ndim is None:
ndim = values.ndim
elif values.ndim != ndim:
elif self._validate_ndim and values.ndim != ndim:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback would be curious to hear your thoughts here. Needed this so that ExtensionBlock could inherit from Block and call super().__init__. We could avoid this by setting .values, .mgr_locs, .ndim directly in ExtensionBlock.__init__, but I think it's best practice to always call your parent's init.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, see my above comment

# Placement must be converted to BlockPlacement so that we can check
# its length
if not isinstance(placement, BlockPlacement):
placement = BlockPlacement(placement)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it much clearer to just copy these two lines rather than calling mgr_locs.setter, but can revert to that if desired.

@@ -2360,23 +2443,13 @@ def is_view(self):
def to_dense(self):
return self.values.to_dense().view()

def convert(self, copy=True, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ExtensionBlock now.

@property
def array_dtype(self):
""" the dtype to return if I want to construct this block as an
array
"""
return np.object_

def _slice(self, slicer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ExtensionBlock now.

@@ -2468,7 +2541,8 @@ class DatetimeBlock(DatetimeLikeBlockMixin, Block):
_can_hold_na = True

def __init__(self, values, placement, fastpath=False, **kwargs):
if values.dtype != _NS_DTYPE:
if values.dtype != _NS_DTYPE and values.dtype.base != _NS_DTYPE:
# not datetime64 or datetime64tz
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was so NonConsolidatableMixin could call __init__. I think it's harmless.


# Methods we can (probably) ignore and just use Block's:

# * replace / replace_single
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any things on this @jreback? CategoricalBlock.replace used to be ObjectBlock.replace, but now it's just Block.replace, which is much simpler.

values = values.reshape((1,) + values.shape)
return values

def _can_hold_element(self, element):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And @jreback I'll defer to you on this one, since you know more about when this is called. I don't think we can just do return isinstance(element, self._holder), since element may need to be coerced. For an IP address, we want

ser = pd.Series(ip.IPAddress(['192.168.1.1']))
ser[0] = '192.168.1.0'

to work, and (I think) element would just be the str there.

@jorisvandenbossche
Copy link
Member

@TomAugspurger for reviewing, did you need to change a lot to Categorical implementation itself? (because you moved it hard to see ..)

@jreback
Copy link
Contributor

jreback commented Jan 16, 2018

i would much prefer that moves occur separately from changes

@TomAugspurger
Copy link
Contributor Author

I did keep the move in 4b06ae4

All the non-move changes to categorical are in a9e0972#diff-f3b2ea15ba728b55cab4a1acd97d996d

@TomAugspurger
Copy link
Contributor Author

But I can split the move into its own PR as well.

One thing: do we have code for pickle compat on module changes? The old pickle files expect a pandas.core..categorical.Categorical. I threw a shim in there, but would be nice to remove it.

@TomAugspurger
Copy link
Contributor Author

Splitting these PRs now (I'll make a new PR for the move and then rebase this one)

@TomAugspurger
Copy link
Contributor Author

#19269 for the move.

@TomAugspurger TomAugspurger force-pushed the pandas-array-interface-3 branch from c4ff28f to 2ef5216 Compare January 18, 2018 18:24
@codecov
Copy link

codecov bot commented Jan 18, 2018

Codecov Report

Merging #19268 into master will increase coverage by <.01%.
The diff coverage is 76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19268      +/-   ##
==========================================
+ Coverage   91.62%   91.62%   +<.01%     
==========================================
  Files         150      150              
  Lines       48726    48672      -54     
==========================================
- Hits        44643    44598      -45     
+ Misses       4083     4074       -9
Flag Coverage Δ
#multiple 89.99% <76%> (ø) ⬆️
#single 41.74% <62.85%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/common.py 93.41% <ø> (+0.49%) ⬆️
pandas/core/dtypes/dtypes.py 96.08% <100%> (-0.04%) ⬇️
pandas/core/arrays/__init__.py 100% <100%> (ø) ⬆️
pandas/core/dtypes/common.py 95.37% <100%> (+0.05%) ⬆️
pandas/core/dtypes/base.py 47.61% <47.61%> (ø)
pandas/core/arrays/base.py 56.66% <56.66%> (ø)
pandas/core/arrays/categorical.py 94.74% <66.66%> (-1.04%) ⬇️
pandas/core/internals.py 95.05% <85.39%> (-0.42%) ⬇️
pandas/errors/__init__.py 92.3% <85.71%> (-7.7%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb3b237...34134f2. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 18, 2018

Rebased on master. Summary of the changes from master:

  1. Added the ExtensionArray class
  2. Categorical subclasses ExtensionArray
  • Implements the new methods for the interface (all private. No public API
    changes)
  1. Adapted the ExtensionDtype class to be the public ABC
    a. Subclass that with PandasExtensionClass that does non-interface things
    like reprs, caching, etc.
    b. All our custom dtypes inherit from PandasExtensionClass, so they implement
    the interface.
  2. Internals Changes:
    a. Added an ExtensionBlock. This will be a parent for our current custom
    blocks, and the block type for all 3rd-party extension arrays.
  3. Added a new is_extension_array_dtype method. I think this is nescessary
    for now, until we've handled DatetimeTZ.

This isn't really a test of whether extension arrays work yet, since we're
still using CategoricalBlock for everything. I have a followup PR that implements
an IntervalArray that requires additional changes to, e.g., the constructors
so that things work. But all the changes from core/internals.py required to
make that work are present here.

@TomAugspurger
Copy link
Contributor Author

@shoyer, @chris-b1, @wesm if you're interested in the interface, this is the PR it's being defined in, though we may have to make refinements later.

I have followup PRs ready for IntervalArray and PeriodArray that build on this. Docs and such will come in the IntervalArray PR, once we have publicly visible changes.

# ------------------------------------------------------------------------
@property
def base(self):
"""The base array I am a view of. None by default."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example here?

Perhaps it would also help to explain how is this used by pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in Block.is_view, which AFAICT is only used for chained assignment?

If that's correct, then I think we're OK with saying this purely for compatibility with NumPy arrays, and has no effect. I've currently defined ExtensionArray.is_view to always be False, so I don't even make use of it in the changes so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).

However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also remove this. NumPy doesn't always maintain this properly, so it can't actually be essential.

@abc.abstractmethod
def take(self, indexer, allow_fill=True, fill_value=None):
# type: (Sequence, bool, Optional[Any]) -> ExtensionArray
"""For slicing"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clarify what valid values of indexer are. Does -1 indicate a fill value?


def take_nd(self, indexer, allow_fill=True, fill_value=None):
"""For slicing"""
# TODO: this isn't really nescessary for 1-D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this


@property
def is_sparse(self):
"""Whether your array is sparse. True by default."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: False by default :)

This should clarify what it means to be a sparse array. How does pandas treat sparse arrays differently?

I would consider dropping this if it isn't strictly necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's unnecessary.

Should return an ExtensionArray, even if ``self[slicer]``
would return a scalar.
"""
return type(self)(self[slicer])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default implementation is likely to fail for some obvious implementations. Perhaps we can have a constructor method _from_scalar() instead that converts a scalar into a length 1 array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can verify that this is always called with a slice object. In that case, __getitem__ will return an ExtensionArray, and we don't have to worry about the scalar case. Unless I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would try to get rid of this if possible, and just ask that __getitem__ can deal with this (of course, alternative is to add separate methods for different __getitem__ functionalities like _slice, but then also _mask, but I don't really see the advantage of this).



@add_metaclass(abc.ABCMeta)
class ExtensionArray(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any expected requirements for the constructor __init__?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should figure out what those are and document them. At the very least, we expected ExtensionArray(extension_array) to work correctly. I'll look for other assumptions we make. Or that could be pushed to another classmethod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also expect that ExtensionArray(), with no arguments, works so that subclasses don't have to implement construct_from_string.

Rather than imposing that on subclasses, we could require some kind of .empty alternative constructor.

def dtype(self):
"""An instance of 'ExtensionDtype'."""
# type: () -> ExtensionDtype
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop pass from all these methods. It's not needed (docstrings alone suffice).

@abc.abstractmethod
def nbytes(self):
"""The number of bytes needed to store this object in memory."""
# type: () -> int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type comments come before the docstring: http://mypy.readthedocs.io/en/latest/python2.html

@property
def type(self):
"""Typically a metaclass inheriting from 'type' with no methods."""
return type(self.name, (), {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NumPy, dtype.type is the corresponding scalar type, e.g.,

>>> np.dtype(np.float64).type
numpy.float64

I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.


@property
def kind(self):
"""A character code (one of 'biufcmMOSUV'), default 'O'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clarify how it's used. How is this useful?

Perhaps "This should match dtype.kind when arrays with this dtype are cast to numpy arrays"?


class ExtensionDtype(object):

class PandasExtensionDtype(ExtensionDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this not be just ExtensionDtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We define methods like __repr__, caching, etc. that are not part of the interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repr is reasonable to be part of the interface, caching I suppose is ok here

@@ -95,6 +96,7 @@ class Block(PandasObject):
is_object = False
is_categorical = False
is_sparse = False
is_extension = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you wouldn't do it like this. rather you would inherit from an ExtensionBlock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at what these is_ properties are used for, but all the other blocks had them so I added it for consistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not add things like this until / unless necessary

@@ -108,14 +109,15 @@ class Block(PandasObject):
def __init__(self, values, placement, ndim=None, fastpath=False):
if ndim is None:
ndim = values.ndim
elif values.ndim != ndim:
elif self._validate_ndim and values.ndim != ndim:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, see my above comment

@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
return blocks, mask


class ExtensionBlock(NonConsolidatableMixIn, Block):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should NOT be in this file, make a dedicated namespace. pandas.core.internals.block or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it not be in this file? All the other blocks are. Note that ExtensionBlock is not ever going to be public, only ExtensionArray and ExtensionDtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be in this file (if we find internals.py too long we can split it in multiple files, but let's do that in a separate PR to make reviewing here not harder).
@jreback ExtensionBlock will just the Block internally used for our own extension types (a bit like NonConsolidatableBlock is not the base class for those)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By splitting the file first, the diff here will not be smaller

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to test this out with GeoPandas one of the coming days, to give some more feedback

i.e. ``ExtensionArray()`` returns an instance.
TODO: See comment in ``ExtensionDtype.construct_from_string``
* Your class should be able to be constructed with instances of
our class, i.e. ``ExtensionArray(extension_array)`` should returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "our class" be "your class" ? Or should it be able to handle any ExtensionArray subclass (the first would be better IMO)


Notes
-----
As a sequence, __getitem__ should expect integer or slice ``key``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also boolean mask?

if the slice is length 0 or 1.

For scalar ``key``, you may return a scalar suitable for your type.
The scalar need not be an instance or subclass of your array type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "need not be" enough? (compared to "should not be")
I mean, we won't run into problems in the internals in pandas by seeing arrays where we expect scalars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify this to say

    For scalar ``item``, you should return a scalar value suitable
    for your type. This should be an instance of ``self.dtype.type``.

My earlier phrasing was to explain that the return value for scalars needn't be the type of item that's actually stored in your array. E.g. for my IPAddress example, the array holds two uint64s, but a scalar slice returns an ipaddress.IPv4Address instance.

# ------------------------------------------------------------------------
@property
def base(self):
"""The base array I am a view of. None by default."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).

However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?

Should return an ExtensionArray, even if ``self[slicer]``
would return a scalar.
"""
return type(self)(self[slicer])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would try to get rid of this if possible, and just ask that __getitem__ can deal with this (of course, alternative is to add separate methods for different __getitem__ functionalities like _slice, but then also _mask, but I don't really see the advantage of this).

-----
The default implementation is True if

1. 'dtype' is a string that returns true for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returns true" -> "does not raise" ?


# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
arr_or_dtype = arr_or_dtype.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work if .values will return such a PeriodArray or IntervalArray, and I am not sure we already decided on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Series.values, what else would it return? An object-typed NumPy array? I think the ship has sailed on Series.values always being a NumPy array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Series._values for now? That gets the values of the block, and is certainly an ExtensionArray in case the series holds one.
The we can postpone the decision on what .values returns?


Returns
-------
IntervalArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block?

@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
return blocks, mask


class ExtensionBlock(NonConsolidatableMixIn, Block):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should be in this file (if we find internals.py too long we can split it in multiple files, but let's do that in a separate PR to make reviewing here not harder).
@jreback ExtensionBlock will just the Block internally used for our own extension types (a bit like NonConsolidatableBlock is not the base class for those)

# ExtensionArrays must be iterable, so this works.
values = np.asarray(self.values)
if values.ndim == self.ndim - 1:
values = values.reshape((1,) + values.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? I know it currently like that in NonConsolidatableBlock, but do we ever expect the result to be a 2D array if this is holded in an DataFrame.
Eg datetimetz block returns here a DatetimeIndex for both .values and .get_values(). On the other hand, a categorical block does this reshaping and returns Categorical vs 2d object numpy array.

Further, do we need to do something with dtype arg?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for sparse. which is a step-child ATM. has to be dealt with

@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

I think that eventually we can remove NonConsolidatableMixin, with the idea
that all non-consolidatable blocks are blocks for extension dtypes? That's true
today anyway.

yes you should folk NonConsolidatedBlock into the ExtensionBlock infrastructure. maybe be slightly tricky right now as Sparse has some special casing.

@@ -149,7 +151,7 @@ def _maybe_to_categorical(array):
"""


class Categorical(PandasObject):
class Categorical(ExtensionArray, PandasObject):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the methods in PandasObject needs to be ABC in the ExtensionArray


class ExtensionDtype(object):

class PandasExtensionDtype(ExtensionDtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repr is reasonable to be part of the interface, caching I suppose is ok here

@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns):
return blocks, mask


class ExtensionBlock(NonConsolidatableMixIn, Block):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).

# ExtensionArrays must be iterable, so this works.
values = np.asarray(self.values)
if values.ndim == self.ndim - 1:
values = values.reshape((1,) + values.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for sparse. which is a step-child ATM. has to be dealt with

@@ -2410,29 +2525,6 @@ def shift(self, periods, axis=0, mgr=None):
return self.make_block_same_class(values=self.values.shift(periods),
placement=self.mgr_locs)

def take_nd(self, indexer, axis=0, new_mgr_locs=None, fill_tuple=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so as I said above, this PR is doing way way too much. pls just create the ExtensionBlock and just move things. Then in another PR you can do the changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is just moved a bit up in the file (to the parent class) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really is doing the bare minimum changes to core/internals.py. This won't even allow, e.g. pd.Series(MyExtensionArray()) to work, yet, though I have a follow PR ready to go that does that (with IntervalArray as a test case).

@jbrockmendel
Copy link
Member

this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).

Some overnight-rebasing notwithstanding, I've got a branch ready that splits internals into internals.blocks, internals.managers, internals.joins (and except for import updates is pure cut/paste). Would it be helpful to push that now? I had planned to wait because it will make a ton of rebasing necessary in other PRs.

@jorisvandenbossche
Copy link
Member

The practical consequence of no longer using abc.ABCMeta, is it only that you now cannot register a class but need to actually subclass? (which doesn't seem a big problem?)

@TomAugspurger
Copy link
Contributor Author

That, and library author-wise, it's now a bit harder to validate that your extension array implements everything. Now it's possible to instantiate an ExtensionArray that doesn't implement the interface. With an ABC that would raise an exception.

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
arr_or_dtype = arr_or_dtype._values
return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then _get_dtype_or_type needs adjustment. This is the point of compatibility, there shouldn't be the need to have special cases.

"""
A np.dtype duck-typed class, suitable for holding a custom dtype.

THIS IS NOT A REAL NUMPY DTYPE
"""
name = None
names = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some docs about these attributes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're documented in ExtensionDtype.

raise ValueError(
'Wrong number of items passed {val}, placement implies '
'{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs)))

def _maybe_validate_ndim(self, values, ndim):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. much nicer.

ExtensionArrays are limited to 1-D.
"""
def __init__(self, values, placement, ndim=None):
self._holder = type(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _holder can/should be a property of the Block itself (it can be a cached property)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block defines _holder as a class attribute. Why would caching it be helpful? It's not reused among multiple ExtensionBlocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its just 1 more thing to do in the init which is not needed

else:
fill_value = fill_tuple[0]

# axis doesn't matter; we are really a single-dim object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a doc-string would help here

@@ -2437,7 +2507,8 @@ class DatetimeBlock(DatetimeLikeBlockMixin, Block):
_can_hold_na = True

def __init__(self, values, placement, ndim=None):
if values.dtype != _NS_DTYPE:
if values.dtype != _NS_DTYPE and values.dtype.base != _NS_DTYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is wrong if you have to change this.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 1, 2018 via email

@@ -108,12 +108,7 @@ class Block(PandasObject):
_concatenator = staticmethod(np.concatenate)

def __init__(self, values, placement, ndim=None):
if ndim is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jbrockmendel. Pretty sure this is what you had in mind. Agreed it's cleaner.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(needed a followup commit to use self.ndim a few lines down.)

@TomAugspurger
Copy link
Contributor Author

CI is all green.

@TomAugspurger
Copy link
Contributor Author

Just to make sure, @jorisvandenbossche, @shoyer are you +1 on the changes here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, +1 to merge this now, to be able to proceed with the other parts. We will probably need to make minor tweaks to the interface, but that will only become clear by doing next PRs to actually use this.

Added some minor doc comments. And two comments on the interface (take fill_value kwarg and formatting_values), but that can also go in a follow-up.

* take
* copy
* _formatting_values
* _concat_same_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two lines above can be removed here and mentioned in the list below (actually only formatting_values, as concat_same_type is already there)

an instance, not error.

Additionally, certain methods and interfaces are required for proper
this array to be properly stored inside a ``DataFrame`` or ``Series``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is repetitive with above (the list of methods that are required), and there is also a typo in "for proper this array to be properly"


Examples
--------
Suppose the extension array somehow backed by a NumPy structured array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this just "NumPy array", as this is not specific to structured arrays

def take(self, indexer, allow_fill=True, fill_value=None):
mask = indexer == -1
result = self.data.take(indexer)
result[mask] = self._fill_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question here is: should the keyword argument fill_value actually be honored? (like if fill_value is None: fill_value = self._fill_value)
Are there case where pandas will actually pass a certain value?

In any case some clarification would be helpful, also if it is just in the signature for compatibility but may be ignored (maybe in a follow-up).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re fill_value, I based this off Categorical.take. It does assert isna(fill_value), but otherwise ignores it.

I think that since ExtensionArray.take returns an ExtensionArray, most implementations will just ignore fill_value. I'll clarify it in the docs.

# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can provide a default implementation of return np.asarray(self) ? (so a densified object array). That is what I do in geopandas, and I suppose would also work for the IPadresses ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suppose that'll be OK for many implementations.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 2, 2018

Thanks @jorisvandenbossche, I've fixed up the docs in my followup branch.

And I think you're right about a default implementation for _formatting_values, writing tests for that as well.

@shoyer
Copy link
Member

shoyer commented Feb 2, 2018 via email

@TomAugspurger
Copy link
Contributor Author

Yes, I'm pretty sure I'm +1 on the current state, though right now GitHub
is crashing when I try to load the page with this PR :).

This is partly why I'm hoping to push further changes into a followup :)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 2, 2018 via email

@jreback
Copy link
Contributor

jreback commented Feb 2, 2018

i haven’t had a look
but i guess it’s ok

@TomAugspurger TomAugspurger merged commit e8620ab into pandas-dev:master Feb 2, 2018
@TomAugspurger
Copy link
Contributor Author

Alrighty, thanks! Followup PRs inbound :)

@TomAugspurger TomAugspurger deleted the pandas-array-interface-3 branch February 2, 2018 21:34
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 2, 2018 via email

@TomAugspurger TomAugspurger mentioned this pull request Feb 14, 2018
15 tasks
harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
…9268)

* REF: Define extension base classes

* Updated for comments

* removed take_nd
* Changed to_dense to return get_values
* Fixed docstrings, types
* Removed is_sparse

* Remove metaclasses from PeriodDtype and IntervalDtype

* Fixup form_blocks rebase

* Restore concat casting cat -> object

* Remove _slice, clarify semantics around __getitem__

* Document and use take.

* Clarify type, kind, init

* Remove base

* API: Remove unused __iter__ and get_values

* API: Implement repr and str

* Remove default value_counts for now

* Fixed merge conflicts

* Remove implementation of construct_from_string

* Example implementation of take

* Cleanup ExtensionBlock

* Pass through ndim

* Use series._values

* Removed repr, updated take doc

* Various cleanups

* Handle get_values, to_dense, is_view

* Docs

* Remove is_extension, is_bool

Remove inherited convert

* Sparse formatter

* Revert "Sparse formatter"

This reverts commit ab2f045.

* Unbox SparseSeries

* Added test for sparse consolidation

* Docs

* Moved to errors

* Handle classmethods, properties

* Use our AbstractMethodError

* Lint

* Cleanup

* Move ndim validation to a method.

* Try this

* Make ExtensionBlock._holder a property

Removed ExtensionBlock.__init__

* Make _holder a property for all

* Refactored validate_ndim

* fixup! Refactored validate_ndim

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants