-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: any/all in context of boolean dtype with missing values #29686
Comments
Agreed that the For |
So if we follow the NA behaviour for logical operations as discussed in #28778 (and implemented for pd.NA in #29597), this will mostly result in NA, but sometimes can also result in True, eg for:
since there is already one |
Should that example have a I guess for |
Yes, updated
Yes, that would be consistent with the logical op behaviour ( |
Thanks. So, just to make sure, can you verify this table, and add it to the original post if it's correct (and eventually the docs)?
|
Thanks for that overview! And yes, that is seems correct based on my understanding. This also seems to be what R is doing: https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/any (except that their default "skipna" is the opposite: |
OK. I'm happy to deviate from R in the default skipna. |
And that's the case for all our reductions anyway |
Sorry trying to read through all of the related items but not seeing it - what did we ultimately decide for comparison operators? IMO |
The logical operations (and, or) are being discussed in #28778 (and implemented for pd.NA in #29597). |
Gotcha thanks! So yea I think I agree with Tom's table then - any / all should follow the logic rules of OR / AND respectively across all elements |
In the new missing values support, and especially while implementing the BooleanArray (#29555), the question comes up: what should
any
andall
do in presence of missing values?edit from Tom: Here's a proposed table of behavior
all([True, NA], skipna=False)
all([False, NA], skipna=False)
all([NA], skipna=False)
all([], skipna=False)
any([True, NA], skipna=False)
any([False, NA], skipna=False)
any([NA], skipna=False)
any([], skipna=False)
all([True, NA], skipna=True)
all([False, NA], skipna=True)
all([NA], skipna=True)
all([], skipna=True)
any([True, NA], skipna=True)
any([False, NA], skipna=True)
any([NA], skipna=True)
any([], skipna=True)
Some context:
Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of
any
/all
with object dtype has all kinds of corner cases. @xhochy recently opened #27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)The documentation of
any
says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)and similar for
all
(https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).Default behaviour with
skipna=True
in case of some NA's and some True/False values, I think the behaviour is clear:
any
/all
are reductions, and in pandas we useskipna=True
for reductions.So you get something like this:
(I am still using
np.nan
here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)
Behaviour for all-NA in case of
skipna=True
This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set.
And then, we follow numpy's behaviour (False for
any
, True forall
):(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)
Behaviour with
skipna=False
Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see #27709), and it depends on the order of the values and which missing value (np.nan or None) is used.
With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:
This follows from numpy's behaviour with floats:
and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:
I think this should rather give False or NA instead of True.
While for object dtype it might make sense to align the behaviour with float (as argued in #27709 (comment)), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg
False | NA = NA
, so in that case, the above should give NA).But are we ok with
any
/all
not returning a boolean in this case? (note, you only have this if someone specifically setskipna=False
)The text was updated successfully, but these errors were encountered: