-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array Interface and Categorical internals Refactor #19268
Array Interface and Categorical internals Refactor #19268
Conversation
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 01, 2018 at 20:55 Hours UTC |
pandas/core/arrays/base.py
Outdated
Should return an ExtensionArray, even if ``self[slicer]`` | ||
would return a scalar. | ||
""" | ||
# XXX: We could get rid of this *if* we require that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can almost get rid of _slice
with a default implementation noted in the comments, but I think dimensionality reduction from array -> scalar could break things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming slicer
is a slice, in what cases can it give a scalar?
In principle this could just be handled in getitem
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice(0)
or slice(0, 1)
hit` this.
Ahh, but I didn't realize that NumPy always returned arrays from slices:
In [5]: np.array([0])[slice(0, 1)]
Out[5]: array([0])
So if we assume that as well, then we could get rid of _slice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can assume that if __getitem__
gets a slice
object, it should return an instance of itself.
pandas/core/dtypes/base.py
Outdated
@property | ||
def type(self): | ||
"""Typically a metaclass inheriting from 'type' with no methods.""" | ||
return type(self.name, (), {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure what array.dtype.type
is used for. This passes the test suite, but may break things like
array1.dtype.type is array2.dtype.type
since the object IDs will be different (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In NumPy, dtype.type
is the corresponding scalar type, e.g.,
>>> np.dtype(np.float64).type
numpy.float64
I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes sense. Would you say that object
is a good default here? I'll work through a test array that uses something meaningful like numbers.Real
or int
. We do use values.dtype.type
in a few places, like figuring out which block type to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a good default value here. Generally the right choice is to return the corresponding scalar type, e.g., Interval
for IntervalDtype
.
pandas/core/internals.py
Outdated
@@ -108,14 +109,15 @@ class Block(PandasObject): | |||
def __init__(self, values, placement, ndim=None, fastpath=False): | |||
if ndim is None: | |||
ndim = values.ndim | |||
elif values.ndim != ndim: | |||
elif self._validate_ndim and values.ndim != ndim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback would be curious to hear your thoughts here. Needed this so that ExtensionBlock
could inherit from Block
and call super().__init__
. We could avoid this by setting .values
, .mgr_locs
, .ndim
directly in ExtensionBlock.__init__
, but I think it's best practice to always call your parent's init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, see my above comment
pandas/core/internals.py
Outdated
# Placement must be converted to BlockPlacement so that we can check | ||
# its length | ||
if not isinstance(placement, BlockPlacement): | ||
placement = BlockPlacement(placement) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it much clearer to just copy these two lines rather than calling mgr_locs.setter
, but can revert to that if desired.
pandas/core/internals.py
Outdated
@@ -2360,23 +2443,13 @@ def is_view(self): | |||
def to_dense(self): | |||
return self.values.to_dense().view() | |||
|
|||
def convert(self, copy=True, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ExtensionBlock now.
pandas/core/internals.py
Outdated
@property | ||
def array_dtype(self): | ||
""" the dtype to return if I want to construct this block as an | ||
array | ||
""" | ||
return np.object_ | ||
|
||
def _slice(self, slicer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ExtensionBlock now.
pandas/core/internals.py
Outdated
@@ -2468,7 +2541,8 @@ class DatetimeBlock(DatetimeLikeBlockMixin, Block): | |||
_can_hold_na = True | |||
|
|||
def __init__(self, values, placement, fastpath=False, **kwargs): | |||
if values.dtype != _NS_DTYPE: | |||
if values.dtype != _NS_DTYPE and values.dtype.base != _NS_DTYPE: | |||
# not datetime64 or datetime64tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was so NonConsolidatableMixin
could call __init__
. I think it's harmless.
pandas/core/internals.py
Outdated
|
||
# Methods we can (probably) ignore and just use Block's: | ||
|
||
# * replace / replace_single |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any things on this @jreback? CategoricalBlock.replace used to be ObjectBlock.replace
, but now it's just Block.replace
, which is much simpler.
pandas/core/internals.py
Outdated
values = values.reshape((1,) + values.shape) | ||
return values | ||
|
||
def _can_hold_element(self, element): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @jreback I'll defer to you on this one, since you know more about when this is called. I don't think we can just do return isinstance(element, self._holder)
, since element
may need to be coerced. For an IP address, we want
ser = pd.Series(ip.IPAddress(['192.168.1.1']))
ser[0] = '192.168.1.0'
to work, and (I think) element
would just be the str there.
@TomAugspurger for reviewing, did you need to change a lot to |
i would much prefer that moves occur separately from changes |
I did keep the move in 4b06ae4 All the non-move changes to categorical are in a9e0972#diff-f3b2ea15ba728b55cab4a1acd97d996d |
But I can split the move into its own PR as well. One thing: do we have code for pickle compat on module changes? The old pickle files expect a |
Splitting these PRs now (I'll make a new PR for the move and then rebase this one) |
#19269 for the move. |
c4ff28f
to
2ef5216
Compare
Codecov Report
@@ Coverage Diff @@
## master #19268 +/- ##
==========================================
+ Coverage 91.62% 91.62% +<.01%
==========================================
Files 150 150
Lines 48726 48672 -54
==========================================
- Hits 44643 44598 -45
+ Misses 4083 4074 -9
Continue to review full report at Codecov.
|
Rebased on master. Summary of the changes from master:
This isn't really a test of whether extension arrays work yet, since we're |
@shoyer, @chris-b1, @wesm if you're interested in the interface, this is the PR it's being defined in, though we may have to make refinements later. I have followup PRs ready for |
pandas/core/arrays/base.py
Outdated
# ------------------------------------------------------------------------ | ||
@property | ||
def base(self): | ||
"""The base array I am a view of. None by default.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example here?
Perhaps it would also help to explain how is this used by pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in Block.is_view
, which AFAICT is only used for chained assignment?
If that's correct, then I think we're OK with saying this purely for compatibility with NumPy arrays, and has no effect. I've currently defined ExtensionArray.is_view
to always be False
, so I don't even make use of it in the changes so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).
However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also remove this. NumPy doesn't always maintain this properly, so it can't actually be essential.
pandas/core/arrays/base.py
Outdated
@abc.abstractmethod | ||
def take(self, indexer, allow_fill=True, fill_value=None): | ||
# type: (Sequence, bool, Optional[Any]) -> ExtensionArray | ||
"""For slicing""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should clarify what valid values of indexer
are. Does -1 indicate a fill value?
pandas/core/arrays/base.py
Outdated
|
||
def take_nd(self, indexer, allow_fill=True, fill_value=None): | ||
"""For slicing""" | ||
# TODO: this isn't really nescessary for 1-D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this
pandas/core/arrays/base.py
Outdated
|
||
@property | ||
def is_sparse(self): | ||
"""Whether your array is sparse. True by default.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: False by default :)
This should clarify what it means to be a sparse array. How does pandas treat sparse arrays differently?
I would consider dropping this if it isn't strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's unnecessary.
pandas/core/arrays/base.py
Outdated
Should return an ExtensionArray, even if ``self[slicer]`` | ||
would return a scalar. | ||
""" | ||
return type(self)(self[slicer]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This default implementation is likely to fail for some obvious implementations. Perhaps we can have a constructor method _from_scalar()
instead that converts a scalar into a length 1 array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I can verify that this is always called with a slice
object. In that case, __getitem__
will return an ExtensionArray
, and we don't have to worry about the scalar case. Unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would try to get rid of this if possible, and just ask that __getitem__
can deal with this (of course, alternative is to add separate methods for different __getitem__
functionalities like _slice
, but then also _mask
, but I don't really see the advantage of this).
pandas/core/arrays/base.py
Outdated
|
||
|
||
@add_metaclass(abc.ABCMeta) | ||
class ExtensionArray(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any expected requirements for the constructor __init__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should figure out what those are and document them. At the very least, we expected ExtensionArray(extension_array)
to work correctly. I'll look for other assumptions we make. Or that could be pushed to another classmethod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also expect that ExtensionArray()
, with no arguments, works so that subclasses don't have to implement construct_from_string
.
Rather than imposing that on subclasses, we could require some kind of .empty
alternative constructor.
pandas/core/arrays/base.py
Outdated
def dtype(self): | ||
"""An instance of 'ExtensionDtype'.""" | ||
# type: () -> ExtensionDtype | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop pass
from all these methods. It's not needed (docstrings alone suffice).
pandas/core/arrays/base.py
Outdated
@abc.abstractmethod | ||
def nbytes(self): | ||
"""The number of bytes needed to store this object in memory.""" | ||
# type: () -> int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type comments come before the docstring: http://mypy.readthedocs.io/en/latest/python2.html
pandas/core/dtypes/base.py
Outdated
@property | ||
def type(self): | ||
"""Typically a metaclass inheriting from 'type' with no methods.""" | ||
return type(self.name, (), {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In NumPy, dtype.type
is the corresponding scalar type, e.g.,
>>> np.dtype(np.float64).type
numpy.float64
I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.
pandas/core/dtypes/base.py
Outdated
|
||
@property | ||
def kind(self): | ||
"""A character code (one of 'biufcmMOSUV'), default 'O' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should clarify how it's used. How is this useful?
Perhaps "This should match dtype.kind
when arrays with this dtype are cast to numpy arrays"?
* removed take_nd * Changed to_dense to return get_values * Fixed docstrings, types * Removed is_sparse
|
||
class ExtensionDtype(object): | ||
|
||
class PandasExtensionDtype(ExtensionDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this not be just ExtensionDtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We define methods like __repr__
, caching, etc. that are not part of the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repr is reasonable to be part of the interface, caching I suppose is ok here
pandas/core/internals.py
Outdated
@@ -95,6 +96,7 @@ class Block(PandasObject): | |||
is_object = False | |||
is_categorical = False | |||
is_sparse = False | |||
is_extension = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you wouldn't do it like this. rather you would inherit from an ExtensionBlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at what these is_
properties are used for, but all the other blocks had them so I added it for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not add things like this until / unless necessary
pandas/core/internals.py
Outdated
@@ -108,14 +109,15 @@ class Block(PandasObject): | |||
def __init__(self, values, placement, ndim=None, fastpath=False): | |||
if ndim is None: | |||
ndim = values.ndim | |||
elif values.ndim != ndim: | |||
elif self._validate_ndim and values.ndim != ndim: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, see my above comment
@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns): | |||
return blocks, mask | |||
|
|||
|
|||
class ExtensionBlock(NonConsolidatableMixIn, Block): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should NOT be in this file, make a dedicated namespace. pandas.core.internals.block or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it not be in this file? All the other blocks are. Note that ExtensionBlock
is not ever going to be public, only ExtensionArray
and ExtensionDtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be in this file (if we find internals.py
too long we can split it in multiple files, but let's do that in a separate PR to make reviewing here not harder).
@jreback ExtensionBlock
will just the Block internally used for our own extension types (a bit like NonConsolidatableBlock
is not the base class for those)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By splitting the file first, the diff here will not be smaller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to test this out with GeoPandas one of the coming days, to give some more feedback
pandas/core/arrays/base.py
Outdated
i.e. ``ExtensionArray()`` returns an instance. | ||
TODO: See comment in ``ExtensionDtype.construct_from_string`` | ||
* Your class should be able to be constructed with instances of | ||
our class, i.e. ``ExtensionArray(extension_array)`` should returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should "our class" be "your class" ? Or should it be able to handle any ExtensionArray subclass (the first would be better IMO)
pandas/core/arrays/base.py
Outdated
|
||
Notes | ||
----- | ||
As a sequence, __getitem__ should expect integer or slice ``key``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also boolean mask?
pandas/core/arrays/base.py
Outdated
if the slice is length 0 or 1. | ||
|
||
For scalar ``key``, you may return a scalar suitable for your type. | ||
The scalar need not be an instance or subclass of your array type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "need not be" enough? (compared to "should not be")
I mean, we won't run into problems in the internals in pandas by seeing arrays where we expect scalars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clarify this to say
For scalar ``item``, you should return a scalar value suitable
for your type. This should be an instance of ``self.dtype.type``.
My earlier phrasing was to explain that the return value for scalars needn't be the type of item that's actually stored in your array. E.g. for my IPAddress example, the array holds two uint64s, but a scalar slice returns an ipaddress.IPv4Address
instance.
pandas/core/arrays/base.py
Outdated
# ------------------------------------------------------------------------ | ||
@property | ||
def base(self): | ||
"""The base array I am a view of. None by default.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).
However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?
pandas/core/arrays/base.py
Outdated
Should return an ExtensionArray, even if ``self[slicer]`` | ||
would return a scalar. | ||
""" | ||
return type(self)(self[slicer]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would try to get rid of this if possible, and just ask that __getitem__
can deal with this (of course, alternative is to add separate methods for different __getitem__
functionalities like _slice
, but then also _mask
, but I don't really see the advantage of this).
pandas/core/dtypes/base.py
Outdated
----- | ||
The default implementation is True if | ||
|
||
1. 'dtype' is a string that returns true for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"returns true" -> "does not raise" ?
pandas/core/dtypes/common.py
Outdated
|
||
# we want to unpack series, anything else? | ||
if isinstance(arr_or_dtype, ABCSeries): | ||
arr_or_dtype = arr_or_dtype.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work if .values
will return such a PeriodArray or IntervalArray, and I am not sure we already decided on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Series.values
, what else would it return? An object-typed NumPy array? I think the ship has sailed on Series.values
always being a NumPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Series._values
for now? That gets the values of the block, and is certainly an ExtensionArray in case the series holds one.
The we can postpone the decision on what .values
returns?
pandas/core/internals.py
Outdated
|
||
Returns | ||
------- | ||
IntervalArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block?
@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns): | |||
return blocks, mask | |||
|
|||
|
|||
class ExtensionBlock(NonConsolidatableMixIn, Block): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this should be in this file (if we find internals.py
too long we can split it in multiple files, but let's do that in a separate PR to make reviewing here not harder).
@jreback ExtensionBlock
will just the Block internally used for our own extension types (a bit like NonConsolidatableBlock
is not the base class for those)
# ExtensionArrays must be iterable, so this works. | ||
values = np.asarray(self.values) | ||
if values.ndim == self.ndim - 1: | ||
values = values.reshape((1,) + values.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? I know it currently like that in NonConsolidatableBlock
, but do we ever expect the result to be a 2D array if this is holded in an DataFrame.
Eg datetimetz block returns here a DatetimeIndex for both .values
and .get_values()
. On the other hand, a categorical block does this reshaping and returns Categorical vs 2d object numpy array.
Further, do we need to do something with dtype
arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for sparse. which is a step-child ATM. has to be dealt with
yes you should folk NonConsolidatedBlock into the ExtensionBlock infrastructure. maybe be slightly tricky right now as Sparse has some special casing. |
@@ -149,7 +151,7 @@ def _maybe_to_categorical(array): | |||
""" | |||
|
|||
|
|||
class Categorical(PandasObject): | |||
class Categorical(ExtensionArray, PandasObject): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the methods in PandasObject needs to be ABC in the ExtensionArray
|
||
class ExtensionDtype(object): | ||
|
||
class PandasExtensionDtype(ExtensionDtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repr is reasonable to be part of the interface, caching I suppose is ok here
@@ -1821,6 +1824,130 @@ def _unstack(self, unstacker_func, new_columns): | |||
return blocks, mask | |||
|
|||
|
|||
class ExtensionBlock(NonConsolidatableMixIn, Block): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly the point, this diff is already way too big. let's do a pre-cursor PR to split Block and manger (IOW internals into 2 files).
# ExtensionArrays must be iterable, so this works. | ||
values = np.asarray(self.values) | ||
if values.ndim == self.ndim - 1: | ||
values = values.reshape((1,) + values.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for sparse. which is a step-child ATM. has to be dealt with
@@ -2410,29 +2525,6 @@ def shift(self, periods, axis=0, mgr=None): | |||
return self.make_block_same_class(values=self.values.shift(periods), | |||
placement=self.mgr_locs) | |||
|
|||
def take_nd(self, indexer, axis=0, new_mgr_locs=None, fill_tuple=None): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so as I said above, this PR is doing way way too much. pls just create the ExtensionBlock and just move things. Then in another PR you can do the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is just moved a bit up in the file (to the parent class) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really is doing the bare minimum changes to core/internals.py
. This won't even allow, e.g. pd.Series(MyExtensionArray())
to work, yet, though I have a follow PR ready to go that does that (with IntervalArray
as a test case).
Some overnight-rebasing notwithstanding, I've got a branch ready that splits |
The practical consequence of no longer using abc.ABCMeta, is it only that you now cannot register a class but need to actually subclass? (which doesn't seem a big problem?) |
That, and library author-wise, it's now a bit harder to validate that your extension array implements everything. Now it's possible to instantiate an |
# we want to unpack series, anything else? | ||
if isinstance(arr_or_dtype, ABCSeries): | ||
arr_or_dtype = arr_or_dtype._values | ||
return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then _get_dtype_or_type
needs adjustment. This is the point of compatibility, there shouldn't be the need to have special cases.
""" | ||
A np.dtype duck-typed class, suitable for holding a custom dtype. | ||
|
||
THIS IS NOT A REAL NUMPY DTYPE | ||
""" | ||
name = None | ||
names = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some docs about these attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're documented in ExtensionDtype
.
pandas/core/internals.py
Outdated
raise ValueError( | ||
'Wrong number of items passed {val}, placement implies ' | ||
'{mgr}'.format(val=len(self.values), mgr=len(self.mgr_locs))) | ||
|
||
def _maybe_validate_ndim(self, values, ndim): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. much nicer.
pandas/core/internals.py
Outdated
ExtensionArrays are limited to 1-D. | ||
""" | ||
def __init__(self, values, placement, ndim=None): | ||
self._holder = type(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _holder can/should be a property of the Block itself (it can be a cached property)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block
defines _holder
as a class attribute. Why would caching it be helpful? It's not reused among multiple ExtensionBlocks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its just 1 more thing to do in the init which is not needed
else: | ||
fill_value = fill_tuple[0] | ||
|
||
# axis doesn't matter; we are really a single-dim object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a doc-string would help here
pandas/core/internals.py
Outdated
@@ -2437,7 +2507,8 @@ class DatetimeBlock(DatetimeLikeBlockMixin, Block): | |||
_can_hold_na = True | |||
|
|||
def __init__(self, values, placement, ndim=None): | |||
if values.dtype != _NS_DTYPE: | |||
if values.dtype != _NS_DTYPE and values.dtype.base != _NS_DTYPE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is wrong if you have to change this.
Removed ExtensionBlock.__init__
Done in
cd0997e
if you could take a look.
I did it for every block, which seemed better than a mix of attributes and
properties.
…On Thu, Feb 1, 2018 at 1:07 PM, Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/internals.py
<#19268 (comment)>:
> @@ -1800,6 +1820,91 @@ def _unstack(self, unstacker_func, new_columns):
return blocks, mask
+class ExtensionBlock(NonConsolidatableMixIn, Block):
+ """Block for holding extension types.
+
+ Notes
+ -----
+ This holds all 3rd-party extension array types. It's also the immediate
+ parent class for our internal extension types' blocks, CategoricalBlock.
+
+ ExtensionArrays are limited to 1-D.
+ """
+ def __init__(self, values, placement, ndim=None):
+ self._holder = type(values)
its just 1 more thing to do in the *init* which is not needed
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19268 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIoHyKRRVAeZqotWc_9Qq56iVKPs_ks5tQbctgaJpZM4Rf8hq>
.
|
@@ -108,12 +108,7 @@ class Block(PandasObject): | |||
_concatenator = staticmethod(np.concatenate) | |||
|
|||
def __init__(self, values, placement, ndim=None): | |||
if ndim is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jbrockmendel. Pretty sure this is what you had in mind. Agreed it's cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(needed a followup commit to use self.ndim
a few lines down.)
CI is all green. |
Just to make sure, @jorisvandenbossche, @shoyer are you +1 on the changes here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, +1 to merge this now, to be able to proceed with the other parts. We will probably need to make minor tweaks to the interface, but that will only become clear by doing next PRs to actually use this.
Added some minor doc comments. And two comments on the interface (take fill_value kwarg and formatting_values), but that can also go in a follow-up.
* take | ||
* copy | ||
* _formatting_values | ||
* _concat_same_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two lines above can be removed here and mentioned in the list below (actually only formatting_values, as concat_same_type is already there)
an instance, not error. | ||
|
||
Additionally, certain methods and interfaces are required for proper | ||
this array to be properly stored inside a ``DataFrame`` or ``Series``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is repetitive with above (the list of methods that are required), and there is also a typo in "for proper this array to be properly"
|
||
Examples | ||
-------- | ||
Suppose the extension array somehow backed by a NumPy structured array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this just "NumPy array", as this is not specific to structured arrays
def take(self, indexer, allow_fill=True, fill_value=None): | ||
mask = indexer == -1 | ||
result = self.data.take(indexer) | ||
result[mask] = self._fill_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question here is: should the keyword argument fill_value
actually be honored? (like if fill_value is None: fill_value = self._fill_value
)
Are there case where pandas will actually pass a certain value?
In any case some clarification would be helpful, also if it is just in the signature for compatibility but may be ignored (maybe in a follow-up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re fill_value
, I based this off Categorical.take
. It does assert isna(fill_value)
, but otherwise ignores it.
I think that since ExtensionArray.take returns an ExtensionArray, most implementations will just ignore fill_value. I'll clarify it in the docs.
# type: () -> np.ndarray | ||
# At the moment, this has to be an array since we use result.dtype | ||
"""An array of values to be printed in, e.g. the Series repr""" | ||
raise AbstractMethodError(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can provide a default implementation of return np.asarray(self)
? (so a densified object array). That is what I do in geopandas, and I suppose would also work for the IPadresses ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suppose that'll be OK for many implementations.
Thanks @jorisvandenbossche, I've fixed up the docs in my followup branch. And I think you're right about a default implementation for |
Yes, I'm pretty sure I'm +1 on the current state, though right now GitHub
is crashing when I try to load the page with this PR :).
…On Fri, Feb 2, 2018 at 11:13 AM Tom Augspurger ***@***.***> wrote:
Just to make sure, @jorisvandenbossche
<https://github.com/jorisvandenbossche>, @shoyer
<https://github.com/shoyer> are you +1 on the changes here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19268 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1pAcUdeZ9GkAMr2bUAfuH7Fs0hZeks5tQ15DgaJpZM4Rf8hq>
.
|
This is partly why I'm hoping to push further changes into a followup :) |
Yes, for me the same. I think the page is only loading in 1 out of 5 tries
or so .. Another reason to merge :-)
2018-02-02 20:57 GMT+01:00 Stephan Hoyer <[email protected]>:
… Yes, I'm pretty sure I'm +1 on the current state, though right now GitHub
is crashing when I try to load the page with this PR :).
On Fri, Feb 2, 2018 at 11:13 AM Tom Augspurger ***@***.***>
wrote:
> Just to make sure, @jorisvandenbossche
> <https://github.com/jorisvandenbossche>, @shoyer
> <https://github.com/shoyer> are you +1 on the changes here?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#19268 (comment)
>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
ABKS1pAcUdeZ9GkAMr2bUAfuH7Fs0hZeks5tQ15DgaJpZM4Rf8hq>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19268 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA-SUJgmb4nGDrSBl0URDRGzsEeR9p_8ks5tQ2iPgaJpZM4Rf8hq>
.
|
i haven’t had a look |
Alrighty, thanks! Followup PRs inbound :) |
i haven’t had a look
but i guess it’s ok
don't hesitate to still add comments here, they can then always be dealt
with in follow-up prs
2018-02-02 22:34 GMT+01:00 Tom Augspurger <[email protected]>:
… Alrighty, thanks! Followup PRs inbound :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19268 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA-SUMR-MY81ioq-2REvrqJaRP6eN4BAks5tQ39zgaJpZM4Rf8hq>
.
|
…9268) * REF: Define extension base classes * Updated for comments * removed take_nd * Changed to_dense to return get_values * Fixed docstrings, types * Removed is_sparse * Remove metaclasses from PeriodDtype and IntervalDtype * Fixup form_blocks rebase * Restore concat casting cat -> object * Remove _slice, clarify semantics around __getitem__ * Document and use take. * Clarify type, kind, init * Remove base * API: Remove unused __iter__ and get_values * API: Implement repr and str * Remove default value_counts for now * Fixed merge conflicts * Remove implementation of construct_from_string * Example implementation of take * Cleanup ExtensionBlock * Pass through ndim * Use series._values * Removed repr, updated take doc * Various cleanups * Handle get_values, to_dense, is_view * Docs * Remove is_extension, is_bool Remove inherited convert * Sparse formatter * Revert "Sparse formatter" This reverts commit ab2f045. * Unbox SparseSeries * Added test for sparse consolidation * Docs * Moved to errors * Handle classmethods, properties * Use our AbstractMethodError * Lint * Cleanup * Move ndim validation to a method. * Try this * Make ExtensionBlock._holder a property Removed ExtensionBlock.__init__ * Make _holder a property for all * Refactored validate_ndim * fixup! Refactored validate_ndim * lint
(edit post categorical-move)
Rebased on master. Summary of the changes from master:
This isn't really a test of whether extension arrays work yet, since we're
still using Categorical for everything. I have a followup PR that implements
an IntervalArray that requires additional changes to, e.g., the constructors
so that things work. But all the changes from core/internals.py required to
make that work are present here.
Old:
new:
Figuring out which methods of
ObjectBlock
were required onCategoricalBlock
wasn't trivial for me. I probably messed some up.
I think that eventually we can remove
NonConsolidatableMixin
, with the ideathat all non-consolidatable blocks are blocks for extension dtypes? That's true
today anyway.
Followup PRs:
core/arrays/period.py
and refactoringPeriodIndex
core/arrays/interval.py
and refactoringIntervalIndex