-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Internal / External values #19558
Conversation
doc/source/internals.rst
Outdated
always an ``ndarray`` or ``ExtensionArray``). | ||
|
||
So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is | ||
the underlying ndarray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what Series[categorical]._ndarray_values
should be, categories or codes? PeriodIndex._ndarray_values
is the ordinals, so maybe codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so a) this should remove .get_values() as well completely, adding something w/o removing the old to me is a non-starter here.
I see lots of special cases. This points to a problem in the interface. if you are doing is_......_ then it needs to be fixed in ExtensenionArray (IOW have a dispatch for that function) and overrides in the subclasses. calling out Categorical here.
doc/source/internals.rst
Outdated
Values | ||
~~~~~~ | ||
|
||
Pandas extends NumPy's type system in a few places, so we have multiple notions of "values" floating around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence is totally not clear to a new reader
doc/source/internals.rst
Outdated
|
||
* ``cls._ndarray_values`` is *always* and ``ndarray`` | ||
* ``cls._values`` refers is the "best possible" array. This could be an ``ndarray``, ``ExtensionArray``, or | ||
in ``Index`` subclass (note: we're in the process of removing the index subclasses here so that it's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are using internal (pun intended) jargon here
doc/source/internals.rst
Outdated
always an ``ndarray`` or ``ExtensionArray``). | ||
|
||
So, for example, ``Series[category]._values`` is a ``Categorical``, while ``Series[category]._ndarray_values`` is | ||
the underlying ndarray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
pandas/core/base.py
Outdated
@@ -973,8 +999,12 @@ def value_counts(self, normalize=False, sort=True, ascending=False, | |||
@Appender(_shared_docs['unique'] % _indexops_doc_kwargs) | |||
def unique(self): | |||
values = self._values | |||
|
|||
if isinstance(values, ABCDatetimeIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you special casing?
pandas/core/dtypes/concat.py
Outdated
return self._shallow_copy_with_infer(np.concatenate(to_concat), **attribs) | ||
arrays = [] | ||
for x in to_concat: | ||
if is_categorical_dtype(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a special case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger don't we want np.asarray(x, dtype=object)
for all extension dtypes in this case?
So for Period and Interval _values
(used some lines below for Index types) currently still returns object array, but I think the purpose is that those will also return their own Array class, so using np.asarray(x, dtype=object)
will do the same now and is more future proof?
pandas/core/indexes/base.py
Outdated
@@ -2208,27 +2208,37 @@ def union(self, other): | |||
other = other.astype('O') | |||
return this.union(other) | |||
|
|||
if is_categorical_dtype(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these special caes?
I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical.
Codecov Report
@@ Coverage Diff @@
## master #19558 +/- ##
==========================================
+ Coverage 91.57% 91.58% +0.01%
==========================================
Files 150 150
Lines 48817 48864 +47
==========================================
+ Hits 44704 44752 +48
+ Misses 4113 4112 -1
Continue to review full report at Codecov.
|
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on February 13, 2018 at 11:54 Hours UTC |
Still a WIP, but this should hopefully pass. I'm going to make a pass through the library to remove things that are currently doing checks for I've also cherry picked Setops are even more of a mess now. I have a biggish refactor started (with nice perf boosts), but didn't finish it today. It's tricky since the current implementation was relying on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should remove .get_values() as well completely, adding something w/o removing the old to me is a non-starter here.
@jreback if you are speaking about Series.get_values and Index.get_values, then I don't think this related. Yes, ideally we can implement get_values
cleanly with one of the other ones, but get_values
is a public function that should not just be removed or changed. We can certainly discuss deprecating it, but IMO that is separate of this internal cleaning up.
pandas/core/indexes/base.py
Outdated
index | values | _values | _ndarray_values | | ||
----------------- | -------------- -| ----------- | --------------- | | ||
CategoricalIndex | Categorical | Categorical | codes | | ||
DatetimeIndex[tz] | ndarray[M8ns] | DTI[tz] | datetime@UTC | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datetime@UTC -> is the same as ndarray[M8ns] in the 'values' column ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll fix that to be consistent.
pandas/core/indexes/base.py
Outdated
index | values | _values | ndarray_values | | ||
----------------- | --------------- | ----------- | -------------- | | ||
PeriodIndex | ndarray[object] | PeriodArray | ||
IntervalIndex | IntervalArray | ndarray[Interval] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the other way around?
and for PeriodIndex, _ndarray_values
is ndarrary[ordinals] ? (for Interval there is not really a native ndarray representation possible, so that is also object array ?
pandas/core/indexes/category.py
Outdated
@@ -293,6 +304,23 @@ def values(self): | |||
""" return the underlying data, which is a Categorical """ | |||
return self._data | |||
|
|||
@property | |||
def _values(self): | |||
return self._data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between ._data
and .values
? As below you use .values
for itemsize and nbytes, but both seems to point to the underlying Categorical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just me being inconsistent.
pandas/core/indexes/category.py
Outdated
@@ -769,7 +797,7 @@ def _evaluate_compare(self, other): | |||
|
|||
def _delegate_method(self, name, *args, **kwargs): | |||
""" method delegation to the ._values """ | |||
method = getattr(self._values, name) | |||
method = getattr(self.values, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be unnecessary. Reverting.
pandas/core/indexes/multi.py
Outdated
@@ -799,9 +799,11 @@ def values(self): | |||
box = hasattr(lev, '_box_values') | |||
# Try to minimize boxing. | |||
if box and len(lev) > len(lab): | |||
taken = lev._box_values(algos.take_1d(lev._values, lab)) | |||
taken = lev._box_values(algos.take_1d(lev._values, | |||
lab)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't the algos functions always assume ndarray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed (the only one that should be ._values
is the one on line 809 below).
elif isinstance(l_values, pd.Index): | ||
tm.assert_index_equal(l_values, r_values) | ||
elif pd.api.types.is_categorical(l_values): | ||
tm.assert_categorical_equal(l_values, r_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we should maybe add an assert_array_equal
that handles the different extension arrays as well, then such a special case should not be needed here?
pandas/tests/test_base.py
Outdated
assert r_values.dtype == dtype | ||
|
||
|
||
def test_values_periodindex(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also tested in the test above?
|
What needs to be fixed? Or do just mean deprecating it?
That doesn't need to happen first, it can be done independently of this. I will open an issue for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, but in general this is looking good to me.
For Index, we now also have the _data
attribute. Is it the idea that in the future this will become the same as _values
?
pandas/core/dtypes/concat.py
Outdated
@@ -480,20 +480,22 @@ def _concat_datetimetz(to_concat, name=None): | |||
|
|||
def _concat_index_same_dtype(indexes, klass=None): | |||
klass = klass if klass is not None else indexes[0].__class__ | |||
return klass(np.concatenate([x._values for x in indexes])) | |||
return klass(np.concatenate([x._ndarray_values for x in indexes])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is only used for numeric indices, so _values
or _ndarray_values
should not matter.
I was mainly thinking that for those cases where _ndarray_values
actually differs from _values
, like a categorical, the above code will not work anyway, as klass(codes)
is not enough to reconstruct the categorical. So _values
seems safer to me.
# type: () -> np.ndarray | ||
"""Internal pandas method for lossy conversion to a NumPy ndarray. | ||
|
||
This method is not part of the pandas interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.
If it is not part of the interface, we could also define it on our own subclasses without defining it here.
But on the other hand, that will raise errors when an external extension array does not have this method. Eg some attributes on Series will call into this, which makes it somehow part of the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit strange that we say this is not part of the interface, but still provide here a default implementation and say what it is.
This simplifies the implementation since
a.) we don't have to have a PandasExtensionArray
that subclasses ExtensionArray
, just to define this method.
b.) we can safely call ._values
anywhere in our code without checking whether it's an extension array or a pandas extension array.
My preference is to leave it out of the interface until someone sees an actual need for it. I suspect this could come up if / when we start allowing custom indexes with their own indexing engines, but that seems like a ways down the road...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand those reasons, but that means it is part of the interface (in the sense that if somebody would be stupid to implement a _ndarray_values
property that returns something differently, it will break code?)
It's just a part of the interface that we ask not to implement?
To give an example, for GeoPandas, I was wondering if I would overwrite this property to return my self.data
(the integer pointers) instead of a materialized array (which can be costly).
For things like Series.strides
, Series.itemsize
, .. this will be OK (and in principle no user should call this anyway .. so maybe I should not worry about it), but if it is used in other places as well that I am not really sure about, such an implementation might actually give a wrong result (but at the same time, if this is called in certain places for my column with an extension array, I want to know that, because materializing can be costly and I want to avoid this as much as possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Your GeoPandas concern is a good one. But I'm not sure how to proceeded :/ Do you have a preference for
a.) Adding it to the interface
b.) Leaving it out of the interface, with a default implementation on EA
c.) Making a pandas-specific EA that defines _ndarray_values
, so uses of _ndarray_values
will need to check for that attr.
._values
/ ._ndarray_values
are meant to be internal, and I don't want to limit future refactorings with backwards compatibility concerns. I'm also not even sure how to document it beyond what's already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the best option is to leave it for now? :)
Ideally, the Series machinery should not use _ndarray_values
too often, as this would require knowledge of what those values mean (codes, ordinals, ...), which Series does not need to know (if there is knowledge required, it should somehow dispatch the operation to the underlying (extension) array itself).
And this is already the case I think. From a quick scan, currently in the PR, I see ndarray_values
is mainly used for:
- indexing (values used in the engine) -> for now we don't support extension arrays to be the data for Index (apart from our own), so this is a worry for later if we want to tackle that
- Index object operations (eg the setops) -> same as above
- in some specific cases where we know the parent object (eg we know it is a PeriodIndex, we just want to get the ordinals)
- for algos (eg take, unique) -> either they dispatch to EA (eg take) or for now the fallback is object array anyhow (eg value_counts), and in general this is something we still need to discuss how to handle in general (factorize, unique, ..)
- some Series attributes (shape, itemsize, strides, size) -> shape can also be derived from
_values
directly (likenbytes
) as it is part of the EA interface. And from the others, I thinksize
would be nice to not have to materialize the EA (itemsize and strides I don't care about). Can we let this returnlen(self._values)
, which is part of the interface? (I am not sure to what extent.size
and__len__
on numpy arrays are different / have different performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assessment looks correct. Nothing that's series specific currently uses _ndarray_values
, so it's really just code that is Index-specific or index / series agnostic.
I can change .shape
and .size
to use ._values
instead of instead of ._ndarary
values. I'll just have to override it in DatetimeIndex
to avoid a recursion error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.
pandas/core/base.py
Outdated
|
||
@property | ||
def nbytes(self): | ||
""" return the number of bytes in the underlying data """ | ||
return self._values.nbytes | ||
return self._ndarray_values.nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should nbytes
be called on self._values
instead? (as we require a nbytes
implementation on the Extension Array, and numpy arrays already have it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this caused issues for CI, but re-running tests now with this change.
pandas/core/indexes/category.py
Outdated
@@ -293,6 +293,15 @@ def values(self): | |||
""" return the underlying data, which is a Categorical """ | |||
return self._data | |||
|
|||
@property | |||
def itemsize(self): | |||
return self.values.itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a comment why you are overwriting this here (default is to use _ndarray_values
, which are codes, but Categorical defines itemsize as the itemsize of the categories).
Also, would it be more consistent to do self._values.itemsize
? (not that it would differ in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More consistent with the parent's implementation, yes, but I'd prefer to use the public API where possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, but I'd prefer to use the public API where possible.
But the public api is not meant to be consistent, but to maintain backwards compatibility :-) (at least for now)
pandas/core/indexes/multi.py
Outdated
@@ -1317,7 +1319,7 @@ def from_tuples(cls, tuples, sortorder=None, names=None): | |||
arrays = [[]] * len(names) | |||
elif isinstance(tuples, (np.ndarray, Index)): | |||
if isinstance(tuples, Index): | |||
tuples = tuples._values | |||
tuples = tuples._ndarray_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ndarray_values
in this case? Because I think this code path can only work if it is an object index that contains tuples, so should not be different in this case.
Although that maybe raises the question: in such cases, where we know both _values
and _ndarray_values
will return the same ndarray (eg in all code only for numeric indexes), which one of the two do we take? (might be good to be consistent on this)
pandas/io/formats/format.py
Outdated
@@ -1897,7 +1897,7 @@ def _format(x): | |||
|
|||
vals = self.values | |||
if isinstance(vals, Index): | |||
vals = vals._values | |||
vals = vals._ndarray_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, an extension array will end up here (using the GenericArrayFormatter), so should this rather be np.array(vals._values)
?
(the implementation of _ndarray_values
on an ExtensionArray should not influence its formatting?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like by this point we always have a regular Index
, not an extension array, so it doesn't matter in terms of output. Changing to _values
since that's clearer.
result = arr.unique() | ||
|
||
if isinstance(expected, np.ndarray): | ||
tm.assert_numpy_array_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected is always a DatetimeIndex in the above parametrization
(pd.DatetimeIndex(['2017-01-01T00:00:00'], tz="US/Eastern"), | ||
np.array(['2017-01-01T05:00:00'], dtype='M8[ns]')), | ||
(pd.TimedeltaIndex([10**10]), np.array([10**10], dtype='m8[ns]')), | ||
pytest.mark.xfail(reason='PeriodArray not implemented')(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why fail? As I think _ndarray_values
already are the ordinals, and will stay that way? (it is _values
that will change from object array to PeriodArray?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an xfail
since Series[period]._ndarray_values
will return an object ndarray
with teh Periods, while PeriodIndex._ndarray_values
returns the ordinals.
Passing this now would require modifying Series._ndarray_values
, which is currently just self._internal_values
, to special case object dtype to do inference to see if we have periods. I don't think that's worth it since Periods will (hopefully) be a first-class extension dtype by the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah missed that the test below tested it for both Series and Index.
It hopefully wouldn't be much work to convert uses of |
Thanks for the review. I implemented your suggestions where I didn't comment in reply, and tests pass locally. |
+ 1, that only just got me thinking. If you create an index from an extension array (directly, or eg by doing |
Yes, and agreed (we hit this indirectly though |
Implemented the changes dicussed in #19558 (comment): I implemented those on master: In [6]: idx = pd.interval_range(0, 100_000)
In [8]: %timeit idx.size
308 ns ± 9.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) pr: In [3]: %timeit idx.size
1.05 µs ± 9.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) |
do 'master' and 'pr' need to be swapped? |
Not sure what happened on that timing I posted earlier. I can't reproduce it. In [1]: import pandas as pd
In [2]: idx = pd.interval_range(0, 100_000)
In [3]: %time idx.shape
CPU times: user 177 ms, sys: 10.1 ms, total: 188 ms
Wall time: 187 ms
Out[3]: (100000,) In [1]: import pandas as pd
In [2]: idx = pd.interval_range(0, 100_000)
In [3]: %time idx.shape
CPU times: user 18 µs, sys: 2 µs, total: 20 µs
Wall time: 22.9 µs
Out[3]: (100000,) |
CI is green again. |
``ndarray`` or ``ExtensionArray``). | ||
|
||
So, for example, ``Series[category]._values`` is a ``Categorical``, while | ||
``Series[category]._ndarray_values`` is the underlying codes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section does not belong in the same document as "how to subclass", as this part is really internals for contributors to pandas.
So I would maybe split this in two: external vs internal details, where the future documentation on how to use and define ExtensionArrays would also go into the external details part (maybe "Extending Pandas" would be a good title, with then information about the different strategies: accessor, extension array, subclassing, ..)
But that's for a separate PR (I can start with that), so for here the above is fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal-internals.rst
and external-internals.rst
;)
I think that "How to subclass", and the eventual "extending pandas with custom array types" would be better in developer.rst, which is labeled as "This section will focus on downstream applications of pandas.".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I didn't see that the accessor documentation is actually already there (although, I personally don't find the parquet section that fitting in there, as it is not something typical you need to know when extending pandas. I can already start with moving it to the bottom of the file :-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small comments. ok to merge. But before adding new things, a refactor of the existing is in order to avoid all of this special casing. When this is out of sight (meaning not in a current PR), it is easy to forget the special casing. Maybe need a TODO detector . but already:
pandas)
bash-3.2$ grep -r TODO pandas --include '*.py' |wc
279 2233 22968
maybe put a special TODO(EA) or something to differentiate these
# type: () -> np.ndarray | ||
"""Internal pandas method for lossy conversion to a NumPy ndarray. | ||
|
||
This method is not part of the pandas interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with @jorisvandenbossche assessment here. maybe if we can ultimatley remove this would be good, and simply dispatch to the array object. If Index is a proper EA then this would be possible.
pandas/core/base.py
Outdated
|
||
@property | ||
def nbytes(self): | ||
""" return the number of bytes in the underlying data """ | ||
return self._values.nbytes | ||
return self.values.nbytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not ._values
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on what we do with PeriodIndex.values
and IntervalIndex.values
.
If we change the public .values
to be an ExtensionArray then there's no difference.
If we leave the public .values
as an ndarray of object, then we prefer ._values
since it doesn't materialize the ndarray.
So far we've put off that decision, but we'll need to make a choice soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to override this in DatetimeIndex (tz) then for now, this is the reason CI is failing (recursion error)
On the original comment: I think it is better for now to use ._values
instead of .values
in such cases. .values
is the public one that might not always be consistent for backwards compatibility, so let's internally use always _values
to have something certainly consistent? (and if we decide .values
to return extension arrays, it is just a public wrapper around ._values
, and if we decide not, we don't have to change anything internally)
pandas/core/indexes/base.py
Outdated
@property | ||
def _values(self): | ||
# type: () -> Union[ExtensionArray, Index] | ||
# TODO: remove index types as they become is extension arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove 'is'
|
||
index | values | _values | _ndarray_values | | ||
----------------- | --------------- | ------------ | --------------- | | ||
PeriodIndex | ndarray[object] | ndarray[obj] | ndarray[int] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory these should return an Index
type for ._values
for PI and II, and it might work, but would be disruptive w/o a Block type to hold it. This is why for example we have the is_period_arraylike
methods to detect an array of Periods. The holder ATM is a ndarray[object].
pandas/core/indexes/base.py
Outdated
@@ -2228,27 +2264,37 @@ def union(self, other): | |||
other = other.astype('O') | |||
return this.union(other) | |||
|
|||
# TODO: setops-refactor, clean all this up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's do this next then. these if/then/else are really smelly.
pandas/core/indexes/base.py
Outdated
@@ -2228,27 +2264,37 @@ def union(self, other): | |||
other = other.astype('O') | |||
return this.union(other) | |||
|
|||
# TODO: setops-refactor, clean all this up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if might be worth it to put this in a single method now (not sure if its possible) to isolate this code, maybe
def _get_internal_type(other):
if is_period_dtype(other) ......
...
else:
...
@@ -678,13 +678,36 @@ def _assert_tzawareness_compat(self, other): | |||
raise TypeError('Cannot compare tz-naive and tz-aware ' | |||
'datetime-like objects') | |||
|
|||
@property | |||
def _values(self): | |||
# tz-naive -> ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we really ought to fix this
# So we extract the tz-naive DatetimeIndex, unique that, and wrap the | ||
# result with out TZ. | ||
if self.tz is not None: | ||
naive = type(self)(self._ndarray_values, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this hacking around is not great. The problem is these temporary work-around become more and more. Really need to bite the bullet and fix up the implementations to avoid this.
Agreed entirely about the hackiness, especially in the set-ops. I should be able to do it in parallel with my Interval ExtensionArray PR. |
All green. Merging. |
Great! |
@TomAugspurger this broke the build. what is going on here? |
actually looks like just a network failure ok. But before merge anything else all of the other builds need to work. |
* REF/Clean: Internal / External values * Move to index base * Cleanup unique handling * Simplify object concat * Use values for intersection I think eventually we'll want to ndarray_values for this, but it'll require a bit more work to support. Currently, using ndarary_values causes occasional failures on categorical. * hmm * Additional testing * More tests * ndarray_values * API: Default ExtensionArray.astype (cherry picked from commit 943a915562b72bed147c857de927afa0daf31c1a) (cherry picked from commit fbf0a06) * Simplify concat_as_object * Py2 compat (cherry picked from commit b20e12c) * Set-ops ugliness * better docstrings * tolist * linting * Moved dtypes (cherry picked from commit d136227) * clean * cleanup * NumPy compat * Use base _values for CategoricalIndex * Update dev docs * cleanup * Linting * Precision in tests * Push _ndarray_values to ExtensionArray Now IndexOpsMixin._ndarray_values will dispatch all the way down to the EA. Subclasses like Categorical can override it as they see fit. * Clean up tolist * Move test locations * Fixed test * REF: Update per comments * lint * REF: Use _values for size and shape * PERF: Implement size, shape for IntervalIndex * PERF: Avoid materializing values for PeriodIndex shape, size * Cleanup * Override nbytes
Closes #19548
Just changed Index so far. Need to
Series._ndarray_values