Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: indexing with DataFrame with nullable boolean dtype #36395

Closed
lgelmi opened this issue Sep 16, 2020 · 5 comments
Closed

BUG: indexing with DataFrame with nullable boolean dtype #36395

lgelmi opened this issue Sep 16, 2020 · 5 comments
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@lgelmi
Copy link

lgelmi commented Sep 16, 2020

Premise: I tried to look for similar issues (closed or not), but (to my surprise) I couldn't find any.

Problem

I often filter my data through comparison with given values, this breaks using pd.NA.

import pandas as pd
from numpy import nan
from pandas import NA

NA == 1, nan == 1
>> (<NA>, False)

NA != 1, nan != 1
>> (<NA>, True)

NA > 1, nan > 1
>> (<NA>, False)

NA < 1, nan < 1
>> (<NA>, False)

Which implies:

import pandas as pd

df = pd.DataFrame([1,2,NA], dtype="Int8")
df == 2
>>
       0
0     1
1     2
2  <NA>

and, even worse:

df[df == 2]
>>
Traceback (most recent call last):
  [ ... ]
  File "../lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 2861, in _extract_bool_array
    assert mask.dtype == bool, mask.dtype
AssertionError: object

As you surely know, this would have worked flawlessly with numpy.nan.

The solution I'd like

pd.NA and numpy.nan should behave the same, especially in regards of comparisons.

API breaking implications

As far as I know pd.NA has been declared experimental, so this should not break much, but may greatly simplify the transition to it for performance and type consistency purposes (which are, from my point of view, the main advantages).

Describe alternatives you've considered

I'm not entirely sure pd.NA should be the same as nan in all regards. The documentation itself does not imply it at all.
That being said, I'd still like to be able to filter my data without so much pain. :)

@lgelmi lgelmi added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 16, 2020
@jorisvandenbossche
Copy link
Member

The differences in behaviour between np.nan and pd.NA are on purpose (see the discussions in #28095 and #28778, but it are long discussions. The first issue links to a summary design document that explains this as well).
But ..

I'd still like to be able to filter my data without so much pain. :)

.. it's certainly still the goal that this works intuitively. Using with a series as example (with your example dataframe):

In [10]: s = df[0]  

In [11]: s   
Out[11]: 
0       1
1       2
2    <NA>
Name: 0, dtype: Int8

In [12]: s == 2  
Out[12]: 
0    False
1     True
2     <NA>
Name: 0, dtype: boolean

In [13]: s[s == 2] 
Out[13]: 
1    2
Name: 0, dtype: Int8

In [14]: df[s == 2]  
Out[14]: 
   0
1  2

The fact that this raises when using a boolean dataframe as filter (df[df == 2]) can be certainly be considered as a bug (or oversight in the initial implementation)

@lgelmi
Copy link
Author

lgelmi commented Sep 16, 2020

I suspected as much and I can see (and agree with) the reasons that led to choose a different behavior respect to numpy.nan.
I will wait patiently (but eagerly) for updates. :D

@dsaxton dsaxton added NA - MaskedArrays Related to pd.NA and nullable extension arrays and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 16, 2020
@dsaxton
Copy link
Member

dsaxton commented Sep 16, 2020

Indexing with a nullable boolean DataFrame works on the branch in PR #36201:

[ins] In [1]: import pandas as pd
         ...:
         ...:
         ...: arr = pd.array([1, 2, None])
         ...: df = pd.DataFrame({"a": arr, "b": arr})
         ...: print(df)
         ...: print(df[df == 1])
         ...:
      a     b
0     1     1
1     2     2
2  <NA>  <NA>
      a     b
0     1     1
1  <NA>  <NA>
2  <NA>  <NA>

@jorisvandenbossche jorisvandenbossche changed the title pd.NA should behave similar to np.nan. BUG: indexing with DataFrame with nullable boolean dtype Sep 16, 2020
@jorisvandenbossche
Copy link
Member

@dsaxton thanks for noting! Can you add a test to that PR for this case as well then? (and indicate it will close this issue) And will also try to take a look at that PR then ;)

@lgelmi
Copy link
Author

lgelmi commented Jan 5, 2021

Seems like version 1.2 solves the issue as promised.
Thanks! 👍

@lgelmi lgelmi closed this as completed Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

3 participants