-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support operators for ExtensionArray #20889
Conversation
Codecov Report
@@ Coverage Diff @@
## master #20889 +/- ##
=========================================
Coverage ? 91.78%
=========================================
Files ? 153
Lines ? 49423
Branches ? 0
=========================================
Hits ? 45362
Misses ? 4061
Partials ? 0
Continue to review full report at Codecov.
|
just an FYI, rather than a single PR, this needs to be a series of smaller, self-consistent PR's. much easier to review and revise, otherwise this will get bogged down. |
@jreback Right now, I had to include some code from an early version of #20885. Once that is approved, that will shorten this one a little bit. I can then first submit something that deals with #20825 . And once that is approved, I can submit the things related to ops. Let me know if this plan works for you. |
@Dr-Irv There are a bunch of planned/discussed steps between the status quo and the functionality in this PR. If you're up to help with the whole transition that'd be great. If you're mostly interested in getting this specific feature in place ASAP, we can talk about workarounds that you can apply in the interim. @jreback LMK if I'm overstepping by offering to shepherd this process. It's up my alley. |
@jbrockmendel I'd like both. I need the operator functionality for |
I'd appreciate your expertise here @jbrockmendel. FWIW, I'll have some availability this week, but very little next week. |
The workaround I have in mind assumes you are mostly interested in patching Series arithmetic and logical operators. Is that accurate?
@Dr-Irv sounds good. The first thing I'd suggest is figuring out if there is anything in this PR that is not strictly necessary for the feature(s) you need or can otherwise be separated into independent PRs. Based on a very quick glance it looks like the tests for Are the changes to boolean ops logically independent of the changes to arithmetic ops? If so, I suggest focusing on the boolean ops first, since they aren't nearly so wound up with Index vs Series shenanigans. There might be something in #19795 worth resurrecting. |
Yes, that is correct. I also need the results of arithmetic and logical operators to be Series of objects. This PR has 3 things in it.
So, if you want, I can create a new PR that only works on |
FYI I have this almost all implemented for integer na support: jreback@6fc19f9 (this is a bit stale), will be updated soon. |
I think it is still worthwhile to push forward this PR to implement this separately from the integer-array PR (which might take more discussion, and it also a very big one right now). @jreback can you give more concrete feedback on the way @Dr-Irv implemented things here, based on what you did in the other PR? (for the changes in
@jbrockmendel can you elaborate on this a bit? (regarding this PR, what would you do instead) |
However, if the underlying ExtensionDtype overrides the logical | ||
operators, then the implementer may want to have an ExtensionArray | ||
subclass contain the result. This can be done by changing the property | ||
_logical_result from its default value of None to the _from_sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed to have this property? Can't we simply detect whether the result is a boolean numpy array or again an ExtensionArray ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this use-case a bit more? I think we will certainly want Series <compare> Series
to always be an ndarray of booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak for the author, but my assumption was that this has to do with some of the spaghetti-code in ops._bool_method_SERIES
, where sometimes a bool-dtype is returned and other times an int-dtype is returned (and datetimelike are currently all broken, see #19972, #19759). Straightening out this mess independently of EA implementations is part of the plan referred to above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in my use case, I need the boolean operators to return an object that represents the relation. I'm using pandas on top of 2 different libraries (that functionally are the same) where the operators (x <= y), (x >= y) and (x == y) are not booleans, but objects representing the relations.
# ------------------------------------------------------------------------ | ||
# Utilities for use by subclasses | ||
# ------------------------------------------------------------------------ | ||
def is_sequence_of_dtype(self, seq): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is expected to always be true. If it isn't I'd recommend making a superclass that has all the scalar types like I do in https://github.com/ContinuumIO/cyberpandas/blob/c66bbecaf5193bd284a0fddfde65395d119aad41/cyberpandas/ip_array.py#L22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's suppose that you don't implement the ExtensionArray
operators as methods of the subclass of ExtensionArray
, but you let the underlying ExtensionDtype
handle the operators for you. (This is what I used for Decimal
). Some of the operators will return a sequence containing all objects of ExtensionDtype
. Some operators (e.g., logical ones), will not. So internally, it's useful to have a test to know whether a sequence has objects of the corresponding ExtensionDtype
so that you can then return an ExtensionArray
as a result, otherwise, you just let things get coerced based on the type in the sequence.
@@ -990,6 +991,93 @@ def _construct_divmod_result(left, result, index, name, dtype): | |||
) | |||
|
|||
|
|||
def dispatch_to_extension_op(left, right, op_name=None, is_logical=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dispatch_to_index_op
uses op
instead of op_name
. Is there a reason for this difference? (and I mean in the actual implementation here it assumes a method name, and not an operator function that can be called)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for the difference is as follows. The way I implemented this, we first look to see if the operator is defined for the ExtensionArray
subclass. If not, then we use the implementation of the operator on the underlying ExtensionDtype
. So if you pass op
, you get the operator bound to a specific class. If you have op_name
, then we can translate to either the ExtensionArray
subclass implementation or the ExtensionDtype
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the op is not defined on ExtensionArray, calling it directly (op(left_values, right_values)
) will raise a TypeError that you can catch (which you already do), so I don't really see the difference
except TypeError: | ||
pass | ||
except Exception as e: | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not doing op(left.values, right/right.values)
? What does this manual checking/trying does that the former does not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. In the code above, I'm testing to see if the ExtensionArray
subclass has the operator defined. op
and method
are the same, as method
is computed as the operator on left.values
. I could change the name of the variable from method
to op
to make this clearer.
However, if the underlying ExtensionDtype overrides the logical | ||
operators, then the implementer may want to have an ExtensionArray | ||
subclass contain the result. This can be done by changing the property | ||
_logical_result from its default value of None to the _from_sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this use-case a bit more? I think we will certainly want Series <compare> Series
to always be an ndarray of booleans.
# ------------------------------------------------------------------------ | ||
# Utilities for use by subclasses | ||
# ------------------------------------------------------------------------ | ||
def is_sequence_of_dtype(self, seq): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is expected to always be true. If it isn't I'd recommend making a superclass that has all the scalar types like I do in https://github.com/ContinuumIO/cyberpandas/blob/c66bbecaf5193bd284a0fddfde65395d119aad41/cyberpandas/ip_array.py#L22
deflen = len(left) | ||
excons = type(left.values)._from_sequence | ||
exclass = type(left.values) | ||
testseq = left.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding these names. (method makes sense).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excons
is the constructor used to construct a result from a sequence that has all its elements of type ExtensionDtype
. (so "ex" for "Extension" and "cons" for "constructor") exclass
is the underlying class of the ExtensionArray
subclass.
ovalues = [parm] * deflen | ||
return ovalues | ||
|
||
if res is NotImplemented: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this fallback a bit more?
If the EA doesn't define ops, then I'm perfectly fine with raising NotImplementedError
at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. The idea here is that either the EA defines the ops, or the ExtensionDtype
defines the ops.
The idea here is that if you know that your underlying ExtensionDtype
already has the ops defined, you don't have to implement each of the ops at the ExtensionArray
level.
I used DecimalArray
as an example. The operators are already defined for Decimal
so there is no reason to have to implement the operators for an array of Decimal
.
|
@jbrockmendel but is it needed to fix all of this (make sure all of Series ops are dispatched correctly etc, resolve duplication between index and series, ...) before getting basic ops working for ExtensionArrays? |
echo the above comments, this has a lot of othrogonal changes going on. let's strip out things to the minimal changes. |
In response to @jbrockmendel:
I really disagree with this. If you already have the operators defined for the |
@jreback Since you wanted something smaller, I have split off the code that fixes #20825 into #21183 . This is to handle the issue with Note that the |
Why do the tests need the combine fix? |
@jorisvandenbossche To test the operator functionality, I am leveraging the code that tests operators for |
Implementing it on the DType is fine, I guess. It should be really easy to go from there to implementing it on the EA subclass. Once that is done, the changes to In most cases returning |
@jbrockmendel You wrote:
Are you suggesting that for someone implementing their own EA subclass, it should be really easy for them to implement the operators on the EA subclass? Or are you suggesting that pandas provide default operators for The changes in this PR that are in I can investigate whether |
To try to take a step back from the discussion:
Given the above, I think we can try to harmonize the goal of making this fallback both easy as explicit (1 and 3), and keep the ops defined on the ExtensionArray and |
Regarding the elementwise fallback to # in MyEASubclass
def __lt__(self, other):
return np.array([a < b for a, b in zip(self, other)]) likewise for the other methods.
For cyberpandas, the scalar fall back would be incorrect. We use IPAddress(0) as the NA value, so comparisions with it should be NA to match pandas behavior. In [3]: arr = IPArray([0, 1, 2, 3])
In [4]: arr <= arr
Out[4]: array([False, True, True, True])
In [5]: arr.astype(object) <= arr.astype(object)
Out[5]: array([ True, True, True, True]) A mixin seems like a reasonable compromise. |
It's indeed relatively straightforward, although a bit more complicated as you small example if you want to cover all cases (scalar vs array |
@TomAugspurger You wrote:
Yes, but having to do it for all of the operators is painful. What I implemented avoids that. Take the example of I like the idea of using a mixin as proposed by @jorisvandenbossche. However, it seems that I ought to wait until @jreback finishes #21191 and #21160 and it is merged in, and then I can see where things are and make modifications as needed for my use case. |
It should be possible to write a function that generates all the different operators (which is how we often do it inside pandas). But we can use this approach for the mixin.
I don't think you need to wait with eg writing the mixin with fallback (as that use case will not be covered in #21191). |
Yes, this.
The element-by-element implementation in the PR is effectively the object-dtype behavior, and I'd be fine with that being a default at the BaseArray level (@TomAugspurger thoughts? It would also be reasonable for BaseArray to implement these ops to raise
I'll take a closer look at #21883, thanks. BTW, thanks for pushing forward on this. You've clearly put some thought+effort into this and it is very much appreciated.
Not at all, the steps I mentioned were very much for the "what would you do instead" question. I do think that implementing ops at the array level will minimize duplicated effort in
+1. |
My plan right now is to look at the mixin approach on May 29 or later. When I created this PR, the whole point was to inspire discussion on the design and implementation, so I think this discussion has been healthy in terms of us moving forward. |
Closing this, as there is a new proposed implementation in #21261 . |
__add__
method for json and test it, make sure others failtest_combine
methodgit diff upstream/master -u -- "*.py" | flake8 --diff
Some Notes
Series
withbool
, they can do that. I'm testing it with my library, but once we agree on this basic implementation, I will add tests for that specific use case.ExtensionArray
class has implemented the operator. If not, then you just go element by element.