API: any/all in context of boolean dtype with missing values #29686

jorisvandenbossche · 2019-11-18T14:23:11Z

In the new missing values support, and especially while implementing the BooleanArray (#29555), the question comes up: what should any and all do in presence of missing values?

edit from Tom: Here's a proposed table of behavior

case	input	output
1.	`all([True, NA], skipna=False)`	NA
2.	`all([False, NA], skipna=False)`	False
3.	`all([NA], skipna=False)`	NA
4.	`all([], skipna=False)`	True
5.	`any([True, NA], skipna=False)`	True
6.	`any([False, NA], skipna=False)`	NA
7.	`any([NA], skipna=False)`	NA
8.	`any([], skipna=False)`	False

case	input	output
9.	`all([True, NA], skipna=True)`	True
10.	`all([False, NA], skipna=True)`	False
11.	`all([NA], skipna=True)`	True
12.	`all([], skipna=True)`	True
13.	`any([True, NA], skipna=True)`	True
14.	`any([False, NA], skipna=True)`	False
15.	`any([NA], skipna=True)`	False
16.	`any([], skipna=True)`	False

Some context:

Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of any/all with object dtype has all kinds of corner cases. @xhochy recently opened #27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)

The documentation of any says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)

Return whether any element is True, potentially over an axis.

Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

...

skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

and similar for all (https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).

Default behaviour with `skipna=True`

in case of some NA's and some True/False values, I think the behaviour is clear: any/all are reductions, and in pandas we use skipna=True for reductions.

So you get something like this:
(I am still using np.nan here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)

In [2]: pd.Series([True, False, np.nan]).any() 
Out[2]: True

In [3]: pd.Series([True, False, np.nan]).all()
Out[3]: False

In [4]: pd.Series([True, True, np.nan]).all() 
Out[4]: True

(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)

Behaviour for all-NA in case of `skipna=True`

This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set.
And then, we follow numpy's behaviour (False for any, True for all):

In [8]: np.array([], dtype=bool).any() 
Out[8]: False

In [9]: np.array([], dtype=bool).all()
Out[9]: True

(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)

Behaviour with `skipna=False`

Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see #27709), and it depends on the order of the values and which missing value (np.nan or None) is used.

With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:

If skipna is False, then NA are treated as True, because these are not equal to zero.

This follows from numpy's behaviour with floats:

In [10]: np.array([0, np.nan]).any()
Out[10]: True

and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:

>>> pd.Series([False, pd.NA], dtype="boolean").any()
True

I think this should rather give False or NA instead of True.
While for object dtype it might make sense to align the behaviour with float (as argued in #27709 (comment)), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg False | NA = NA, so in that case, the above should give NA).
But are we ok with any/all not returning a boolean in this case? (note, you only have this if someone specifically set skipna=False)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-11-18T14:41:06Z

Agreed that the skipna=True case looks fine as is.

For skipna=False, I think that the presence of any NA would make the result NA, though it'd be good to survey what other systems do here.

jorisvandenbossche · 2019-11-18T14:54:35Z

For skipna=False, I think that the presence of any NA would make the result NA, though it'd be good to survey what other systems do here.

So if we follow the NA behaviour for logical operations as discussed in #28778 (and implemented for pd.NA in #29597), this will mostly result in NA, but sometimes can also result in True, eg for:

>>> pd.Series([True, pd.NA], dtype="boolean").any(skipna=False)
True

since there is already one True, the result can always be True regardless of whether the NA is actually True or False.

TomAugspurger · 2019-11-18T14:58:20Z

Should that example have a skipna=False?

I guess for any that makes sense. For .all Series([True, pd.NA], dtype="boolean").all(skipna=False)` would be NA?

jorisvandenbossche · 2019-11-18T14:58:43Z

Should that example have a skipna=False?

Yes, updated

For Series([True, pd.NA], dtype="boolean").all(skipna=False)` would be NA?

Yes, that would be consistent with the logical op behaviour (True & NA = NA)

TomAugspurger · 2019-11-18T15:43:55Z

Thanks. So, just to make sure, can you verify this table, and add it to the original post if it's correct (and eventually the docs)?

case	input	output
1.	`all([True, NA], skipna=False)`	NA
2.	`all([False, NA], skipna=False)`	False
3.	`all([NA], skipna=False)`	NA
4.	`all([], skipna=False)`	True
5.	`any([True, NA], skipna=False)`	True
6.	`any([False, NA], skipna=False)`	NA
7.	`any([NA], skipna=False)`	NA
8.	`any([], skipna=False)`	False

case	input	output
9.	`all([True, NA], skipna=True)`	True
10.	`all([False, NA], skipna=True)`	False
11.	`all([NA], skipna=True)`	True
12.	`all([], skipna=True)`	True
13.	`any([True, NA], skipna=True)`	True
14.	`any([False, NA], skipna=True)`	False
15.	`any([NA], skipna=True)`	False
16.	`any([], skipna=True)`	False

jorisvandenbossche · 2019-11-18T16:25:37Z

Thanks for that overview! And yes, that is seems correct based on my understanding.

This also seems to be what R is doing: https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/any (except that their default "skipna" is the opposite: na.rm=FALSE is the default)

TomAugspurger · 2019-11-18T16:31:15Z

OK. I'm happy to deviate from R in the default skipna.

jorisvandenbossche · 2019-11-18T16:32:44Z

I'm happy to deviate from R in the default skipna.

And that's the case for all our reductions anyway

WillAyd · 2019-11-19T05:17:33Z

Sorry trying to read through all of the related items but not seeing it - what did we ultimately decide for comparison operators? IMO any should match whatever or does and all should match whatever and does

jorisvandenbossche · 2019-11-19T09:34:54Z

The logical operations (and, or) are being discussed in #28778 (and implemented for pd.NA in #29597).
The table that Tom made with the behaviour for all possible cases for any/all match with what the current conclusion / implementation is in those issues regarding logical operations ("Kleene logic" or "three value logic" giving eg True | NA == True, True & NA == NA and False & NA == False, see the bottom of this comment: #28778 (comment))

WillAyd · 2019-11-19T15:51:07Z

Gotcha thanks! So yea I think I agree with Tom's table then - any / all should follow the logic rules of OR / AND respectively across all elements

jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action labels Nov 18, 2019

jorisvandenbossche mentioned this issue Nov 20, 2019

ENH: add BooleanArray extension array #29555

Merged

TomAugspurger mentioned this issue Nov 25, 2019

Missing values proposal: concrete steps for 1.0 #29556

Closed

13 tasks

jorisvandenbossche mentioned this issue Dec 4, 2019

API: BooleanArray any/all with NA logic #30062

Merged

jreback added this to the 1.0 milestone Dec 8, 2019

jorisvandenbossche closed this as completed in #30062 Dec 12, 2019

jorisvandenbossche mentioned this issue Apr 3, 2020

BUG: BooleanArray.any with all False values and skipna=False is buggy #33253

Closed

jorisvandenbossche mentioned this issue Sep 30, 2020

ARROW-1846: [C++][Compute] Implement "any" reduction kernel for boolean data apache/arrow#8294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: any/all in context of boolean dtype with missing values #29686

API: any/all in context of boolean dtype with missing values #29686

jorisvandenbossche commented Nov 18, 2019 •

edited by TomAugspurger

Loading

TomAugspurger commented Nov 18, 2019

jorisvandenbossche commented Nov 18, 2019 •

edited

Loading

TomAugspurger commented Nov 18, 2019

jorisvandenbossche commented Nov 18, 2019 •

edited

Loading

TomAugspurger commented Nov 18, 2019 •

edited

Loading

jorisvandenbossche commented Nov 18, 2019

TomAugspurger commented Nov 18, 2019

jorisvandenbossche commented Nov 18, 2019 •

edited

Loading

WillAyd commented Nov 19, 2019

jorisvandenbossche commented Nov 19, 2019

WillAyd commented Nov 19, 2019

API: any/all in context of boolean dtype with missing values #29686

API: any/all in context of boolean dtype with missing values #29686

Comments

jorisvandenbossche commented Nov 18, 2019 • edited by TomAugspurger Loading

Default behaviour with skipna=True

Behaviour for all-NA in case of skipna=True

Behaviour with skipna=False

TomAugspurger commented Nov 18, 2019

jorisvandenbossche commented Nov 18, 2019 • edited Loading

TomAugspurger commented Nov 18, 2019

jorisvandenbossche commented Nov 18, 2019 • edited Loading

TomAugspurger commented Nov 18, 2019 • edited Loading

jorisvandenbossche commented Nov 18, 2019

TomAugspurger commented Nov 18, 2019

jorisvandenbossche commented Nov 18, 2019 • edited Loading

WillAyd commented Nov 19, 2019

jorisvandenbossche commented Nov 19, 2019

WillAyd commented Nov 19, 2019

jorisvandenbossche commented Nov 18, 2019 •

edited by TomAugspurger

Loading

Default behaviour with `skipna=True`

Behaviour for all-NA in case of `skipna=True`

Behaviour with `skipna=False`

jorisvandenbossche commented Nov 18, 2019 •

edited

Loading

jorisvandenbossche commented Nov 18, 2019 •

edited

Loading

TomAugspurger commented Nov 18, 2019 •

edited

Loading

jorisvandenbossche commented Nov 18, 2019 •

edited

Loading