-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: indexing with DataFrame with nullable boolean dtype #36395
Comments
The differences in behaviour between np.nan and pd.NA are on purpose (see the discussions in #28095 and #28778, but it are long discussions. The first issue links to a summary design document that explains this as well).
.. it's certainly still the goal that this works intuitively. Using with a series as example (with your example dataframe):
The fact that this raises when using a boolean dataframe as filter ( |
I suspected as much and I can see (and agree with) the reasons that led to choose a different behavior respect to numpy.nan. |
Indexing with a nullable boolean DataFrame works on the branch in PR #36201: [ins] In [1]: import pandas as pd
...:
...:
...: arr = pd.array([1, 2, None])
...: df = pd.DataFrame({"a": arr, "b": arr})
...: print(df)
...: print(df[df == 1])
...:
a b
0 1 1
1 2 2
2 <NA> <NA>
a b
0 1 1
1 <NA> <NA>
2 <NA> <NA> |
@dsaxton thanks for noting! Can you add a test to that PR for this case as well then? (and indicate it will close this issue) And will also try to take a look at that PR then ;) |
Seems like version 1.2 solves the issue as promised. |
Premise: I tried to look for similar issues (closed or not), but (to my surprise) I couldn't find any.
Problem
I often filter my data through comparison with given values, this breaks using pd.NA.
Which implies:
and, even worse:
As you surely know, this would have worked flawlessly with numpy.nan.
The solution I'd like
pd.NA and numpy.nan should behave the same, especially in regards of comparisons.
API breaking implications
As far as I know pd.NA has been declared experimental, so this should not break much, but may greatly simplify the transition to it for performance and type consistency purposes (which are, from my point of view, the main advantages).
Describe alternatives you've considered
I'm not entirely sure pd.NA should be the same as nan in all regards. The documentation itself does not imply it at all.
That being said, I'd still like to be able to filter my data without so much pain. :)
The text was updated successfully, but these errors were encountered: