BUG: indexing with DataFrame with nullable boolean dtype #36395

lgelmi · 2020-09-16T09:16:22Z

Premise: I tried to look for similar issues (closed or not), but (to my surprise) I couldn't find any.

Problem

I often filter my data through comparison with given values, this breaks using pd.NA.

import pandas as pd
from numpy import nan
from pandas import NA

NA == 1, nan == 1
>> (<NA>, False)

NA != 1, nan != 1
>> (<NA>, True)

NA > 1, nan > 1
>> (<NA>, False)

NA < 1, nan < 1
>> (<NA>, False)

Which implies:

import pandas as pd

df = pd.DataFrame([1,2,NA], dtype="Int8")
df == 2
>>
       0
0     1
1     2
2  <NA>

and, even worse:

df[df == 2]
>>
Traceback (most recent call last):
  [ ... ]
  File "../lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2861, in _extract_bool_array
    assert mask.dtype == bool, mask.dtype
AssertionError: object

As you surely know, this would have worked flawlessly with numpy.nan.

The solution I'd like

pd.NA and numpy.nan should behave the same, especially in regards of comparisons.

API breaking implications

As far as I know pd.NA has been declared experimental, so this should not break much, but may greatly simplify the transition to it for performance and type consistency purposes (which are, from my point of view, the main advantages).

Describe alternatives you've considered

I'm not entirely sure pd.NA should be the same as nan in all regards. The documentation itself does not imply it at all.
That being said, I'd still like to be able to filter my data without so much pain. :)

jorisvandenbossche · 2020-09-16T12:17:54Z

The differences in behaviour between np.nan and pd.NA are on purpose (see the discussions in #28095 and #28778, but it are long discussions. The first issue links to a summary design document that explains this as well).
But ..

I'd still like to be able to filter my data without so much pain. :)

.. it's certainly still the goal that this works intuitively. Using with a series as example (with your example dataframe):

In [10]: s = df[0]  

In [11]: s   
Out[11]: 
0       1
1       2
2    <NA>
Name: 0, dtype: Int8

In [12]: s == 2  
Out[12]: 
0    False
1     True
2     <NA>
Name: 0, dtype: boolean

In [13]: s[s == 2] 
Out[13]: 
1    2
Name: 0, dtype: Int8

In [14]: df[s == 2]  
Out[14]: 
   0
1  2

The fact that this raises when using a boolean dataframe as filter (df[df == 2]) can be certainly be considered as a bug (or oversight in the initial implementation)

lgelmi · 2020-09-16T12:56:20Z

I suspected as much and I can see (and agree with) the reasons that led to choose a different behavior respect to numpy.nan.
I will wait patiently (but eagerly) for updates. :D

dsaxton · 2020-09-16T14:52:08Z

Indexing with a nullable boolean DataFrame works on the branch in PR #36201:

[ins] In [1]: import pandas as pd
         ...:
         ...:
         ...: arr = pd.array([1, 2, None])
         ...: df = pd.DataFrame({"a": arr, "b": arr})
         ...: print(df)
         ...: print(df[df == 1])
         ...:
      a     b
0     1     1
1     2     2
2  <NA>  <NA>
      a     b
0     1     1
1  <NA>  <NA>
2  <NA>  <NA>

jorisvandenbossche · 2020-09-16T15:26:47Z

@dsaxton thanks for noting! Can you add a test to that PR for this case as well then? (and indicate it will close this issue) And will also try to take a look at that PR then ;)

lgelmi · 2021-01-05T09:09:28Z

Seems like version 1.2 solves the issue as promised.
Thanks! 👍

lgelmi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 16, 2020

dsaxton added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 16, 2020

jorisvandenbossche changed the title ~~pd.NA should behave similar to np.nan.~~ BUG: indexing with DataFrame with nullable boolean dtype Sep 16, 2020

jorisvandenbossche added Bug and removed Enhancement labels Sep 16, 2020

dsaxton mentioned this issue Sep 16, 2020

BUG: Don't raise for NDFrame.mask with nullable boolean #36201

Closed

6 tasks

TomAugspurger mentioned this issue Oct 8, 2020

BUG: mask() and where() do not work pandas.core.arrays.boolean.BooleanDtype #36975

Closed

3 tasks

lgelmi closed this as completed Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: indexing with DataFrame with nullable boolean dtype #36395

BUG: indexing with DataFrame with nullable boolean dtype #36395

lgelmi commented Sep 16, 2020 •

edited

Loading

jorisvandenbossche commented Sep 16, 2020

lgelmi commented Sep 16, 2020

dsaxton commented Sep 16, 2020

jorisvandenbossche commented Sep 16, 2020

lgelmi commented Jan 5, 2021

BUG: indexing with DataFrame with nullable boolean dtype #36395

BUG: indexing with DataFrame with nullable boolean dtype #36395

Comments

lgelmi commented Sep 16, 2020 • edited Loading

Problem

The solution I'd like

API breaking implications

Describe alternatives you've considered

jorisvandenbossche commented Sep 16, 2020

lgelmi commented Sep 16, 2020

dsaxton commented Sep 16, 2020

jorisvandenbossche commented Sep 16, 2020

lgelmi commented Jan 5, 2021

lgelmi commented Sep 16, 2020 •

edited

Loading