-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EA: revisit interface #32586
Comments
I went back through #20361, but it isn't very illuminating. My early version used
This has always been nebulous, at least to me. I think it's primarily used in indexing engines, but I may be out of date.
Joris has argued for a while now that it should only accept instances of I think my preference is to restrict this to instances of ExtensionDtype.type. It's somewhat strange for Will comment more on the "additional methods" later, but my opening thought is that we should come up with a guiding principle (e.g. "strive for minimalism" or "drop-in replacement for 1D ndarray" and apply that to cases, rather than deciding case by case). |
By my count we have 6 places left where we use
That is referenced indirectly in the _values_for_factorize docstring, should probably be made explicit. The ExtensionIndex code is making an implicit assumption about values_for_argsort that should probably be made explicit: the ordering should be preserved not just within Unless we can find a compelling reason to keep both _values_for_argsort and _values_for_factorize, I'd advocate dropping one of them.
I'm leaning more and more in that direction, just found another problem caused by being too lenient, xref #32668. I expect DTA/TDA/PA are the worst offenders, will experiment with tightening those up and see if it breaks the world. As a heuristic, maybe something like "
+1 on "drop-in replacement for 1D ndarray"; my understanding is that the E in EA is about "extending np.ndarray" A couple other potential guiding principles (not mutually exclusive):
|
@jbrockmendel thanks for gathering all those
Some additional, assorted thoughts about this aspect:
It's certainly good to think about if we can combine the two.
Yes, I think that in many places where we now use So one option might be to keep Another option might be to let this more general conversion be handled by the new astype machinery (#22384), since you could see a general conversion from list-like as in Aside, we should probably have defined methods like
Personally, I think we should adopt the rule of It might be we need something similar like
I am not that optimistic we can actually fully get rid of it without some replacement (there are cases were we need an ndarray). Although in some cases For the "bigger" topics discussed here (like the factorize/argsort values), it might be worth to start separate the discussion? |
It looks like we have a special |
Or putting this the other way around: could |
|
cc @crepererum who works on https://github.com/JDASoftwareGroup/rle-array |
Is it possible for you to define |
Can you clarify how that would be problematic? |
Yes and no, we could though specify it currently as the union of
Currently for all results of an operation on a |
And I also don't think this should be the goal / point of
That's not where |
To put it in another way: generally speaking, |
Yes |
So for the "
Are there preferences? (or other options?) From a backwards compatibility point of view, I think both are similar (in both cases you need to update a method ( The second option of course requires an update to the astype machinery (#22384), which doesn't exist today, but on the other hand is also something we need to do at some point eventually. |
I think my preference is for |
I would rather avoid having adding another constructor (im already not wild about IIRC a big part of the reason why _from_sequence became less strict than originally intended is because DTA/TDA/PA _from_sequence took on a bunch of cases that the DTI/TDI/PI constructors handle, in particular |
We still need a "non-strict" version for general construction (so for the same reason that our internal datetime-like ones were not strict). For example, |
Maybe something like |
Well, see my option number 2 above (#32586 (comment)), which proposes to defer this to the
I don't think this is necessarily required. The values for factorize can be completely internal to the array (not to be used by users). |
Regarding the " pandas/pandas/core/arrays/base.py Lines 740 to 747 in c47e9ca
Of course Fletcher is an example of external EAs that implement |
This is a fair point, especially in light of the fletcher example. But then why do we need _values_for_factorize and _from_factorize at all? We can require
If you're referring to ExtensionIndex._get_engine_target, that use _values_for_argsort |
Yes, that's also possible, and I think we discussed that at some point, but we opted for also having the
No, I am referring to where pandas/pandas/core/reshape/merge.py Lines 1905 to 1907 in 9130da9
Now, I think for |
So we should think about whether we can always do with |
Got it, thanks. IIRC the first draft of the PR that did that used
I think Using _values_for_argsort (_vfa) for _get_engine_target (_get) makes some implicit assumptions about _vfa that we should consider making explicit:
|
Sorry for joining that late, but here are some comments from the perspective of rle-array, which is a very simple example of how EAs could use for transparent and efficient compression. Some of it is more a question or a wish, so please take it with a grain of salt. Also, thanks for the whole initiative :) TypesIt would be nice if we could indeed be a little bit more concrete at many places and/or provide some more helpers in the layer around EAs. For example: # `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)
# type conversion should be handled by EA?
ea[np.array([True, False], dtype=object)]
# what about NAs?
ea[np.array([True, False, None], dtype=object)]
# and different types of ints
ea[np.array([0, 1], dtype=np.int64)]
ea[np.array([0, 1], dtype=np.uint64)]
ea[np.array([0, 1], dtype=np.int8)]
# should floats be casted if they are exact integers?
ea[np.array([0, 1], dtype=np.float64)]
# what about the massive amount of slices?
ea[:]
ea[2:15:2]
ea[-15:-1:-2]
ea[::-1] Same goes for many of the other What does E in EA mean?For example, # `ea` is some ExtensionArray
assert isinstance(ea, ExtensionArray)
mask = ea < 20
assert isinstance(mask, ExtensionArray)
df[mask] = ... I sometimes have the impression that the builtin EAs are somewhat special (e.g. Sane defaults but powerful overwritesI second what @xhochy already said: It would be nice if the base EA class would provide many methods that are based on a very small interface that the subclass must implement, but at the same time could be overwritten. That said, it is important that the test suite keeps testing the actual downstream EA and not the example/default implementations in the base class. Object CastingIt was the case the the EA data was silently casted to object-ndarrays by some DF methods. Not sure if this still the case, but for the RLE use case this means that users suddenly might run out of memory. In general, it would be nice if EAs could control the casting a little stricter. rle-array issues a performance warning every time this happens. EAs first vs second classAt least pre-1.0-pandas had a bad groupby-aggregate-performance when it came to exotic EAs because it tried 2 different fast paths before falling back to the actual python "slow-path" (which for RLE can be quite fast). It would be great if EAs could avoid this trail-and-error and directly choose the correct code path. Memory ShufflesThis is very special to RLE, but groupby operations in pandas use unstable sorting (even elements within the same group are re-shuffled), which works great for plain-old-numpy-arrays but is very painful for non-array-based EAs. NumPy InteroptWould be great to have a test suite like this one and some help from the EA baseclass. Special Methods (
|
One more suggestion for these: The minimal logic/methods such that we don't need |
To split off another topic from this issue, I opened a dedicated issue for the |
re
ATM DTA and PA raise on anything but case 1. Categorical allows any of these. IntervalArray requires matching I expect long-term we'll want EA to have some ability to determine how it concats with other types, which will require some form of delegation like what Tom described. More immediately I think we should clarify whether _concat_same_type is expected to handle case 2 or just case 1. |
For reference, regarding the concatenation (concat, append, etc) of ExtensionArrays, there is a new proposal -> #33607 |
This came up in #36400: i propose adding default implementations of |
Regarding |
That would be convenient, but the putmask allows for |
But still the correct length for
|
No. AFAICT its unrestricted. numpy will either truncate or repeat |
OK, but do we have any use case for that internally to have the |
We have Index.putmask and Block.putmask both of which expect np.putmask-like mechanics |
Can you give an example where this specific numpy behaviour of truncating/repeating array-like values is used? Doing a search for putmask, I get the following usages. BlockManager.putmask is used in:
Block.putmask is used in (apart from BlockManager.putmask):
Index.putmask is used in:
Index.putmask is itself also a public method, but thus a method we hardly use ourselves (and I also personally never used it in practice or seen someone use it in analysis code). It's also only available on Index (eg not on Series). I suppose it tries to mimic numpy, but also deviates in several ways (it's a method while a numpy array doesn't have such a method, it returns a copy instead of working inplace, it upcasts instead of raising). I think this could also be arguments to actually deprecate it as a public Index method (which is independent of the question to add it to EA of course). |
So
which is basically equivalent (in case the other values have the same length as the array) to: And I suppose it is this behaviour of putmask that might used in eg update or where or replace. Now, even with that, does that make putmask basically:
If it's basically the above, then I am not sure it warrants a method in the EA interface? (would there be a reason that a EA author would like to override this?) |
As mentioned on the call today, there are a handful of methods missing from EA if we want to stop special-casing pandas-internal EAs.
|
I'd like to move forward on de-special-casing our internal EAs, particularly Any thoughts/objections? |
+1 on adding methods where appropriate and replace makes a lot of sense |
|
I'm actually finding that if we get a correct |
I didn't fully understand it on the call, but so my question was: doesn't (I am still wondering if a "putmask" can ever be as efficient as the current Categorical-specific "replace", though) |
I have a branch that implements what i was describing, will make a draft PR shortly for exposition.
It does, but the implementation via
Sort of. The existing * I'm assuming away intricacies of how |
AFAICT the only thing discussed here that is really up on the air is the strictness of _from_sequence, for which the discussion has moved to #33254. closing. |
This is as good a time as any to revisit the "experimental" EA interface.
My read of the Issues and recollection of threads suggests there are three main groups of topics:
Clarification of the Interface
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What characteristics should _ndarray_values have? Is it needed? (DEPR: _ndarray_values vs to_numpy vs __array__ #32412)_ndarray_values has been removed4.5) Require that
__iter__
return native types? API: ExtensionArrays and conversion to "native" types (eg in tolist, to_dict, iteration, ..) #29738Ndarray Compat
^^^^^^^^^^^^^^^^^
5) Headaches have been caused by trivial ndarray methods not being on EA
- #31199 size
- #32342 "T" (just the most recent; this has come up a lot)
- #24583 ravel
6) For arithmetic we're going to need something like either
tile
orbroadcast_to
Methods Needed/Wanted For Index/Series/DataFrame/Block
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
7) Suggested Methods (partial list)
- #27264 duplicated
- [x] #23437 _empty
- #28955 apply
- #23179 map
- #22680 hasnas
- [x] #27081 equals
- [x] #24144 _where
- [x] _putmask would be helpful for ExtensionIndex
I suggest we discuss these in order. Before jumping in, is there anything vital missing from this list? (this is only a small subset of the issues on the tracker)
cc @pandas-dev/pandas-core @xhochy
The text was updated successfully, but these errors were encountered: