-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISCUSS: boolean dtype with missing value support #28778
Comments
(also if we don't go for a new NA value, a boolean ExtensionArray with missing data support can be interesting, but in such a case it's probably more difficult to change the behaviour compared to what we currently have with np.nan) |
I think I prefer what R and Julia do here but would be curious to hear counter arguments in support of existing behavior. Just to clarify, you think any operation where one of the operands is NA should return NA right? But something like |
This is not accurate. These ops are basically nothing but corner cases, a handful of which do three-value logic. That's before considering DataFrame, which only sometimes behaves like Series. I'll elaborate later this afternoon. |
@WillAyd I think the existing behaviour is mainly a consequence of using np.nan (for which the existing behaviour makes sense). And an argument to keep the existing behaviour would be that we have done it like that for a long time..
For comparisons yes, for logical operations it depends. I pasted below the more elaborate explanation with code examples that I wrote in the proposal on hackmd.
Yes, that wouldn't change compared to the current behaviour I think. Behaviour in comparison operationsIn numerical operations, NA propagates (see also above). But for boolean operations the situation is less clear. Currently, we use the behaviour of >>> np.nan == 1
False
>>> np.nan < 1
False
>>> np.nan != 1
True However, a missing value could also propagate: >>> pd.NA == 1
NA
>>> pd.NA < 1
NA
>>> PD.NA != 1
NA This is for example what Julia and R do. Boolean data type with missing values and logical operationsIf we propagate NA in comparison operations (see above), the consequence is that you end up with boolean masks with missing values. This means that we need to support a boolean dtype with NA support, and define the behaviour in logical operations and indexing.
Currently, the logical operations are not very consistently defined. On Series/DataFrame, it returns mostly False, and for scalars it is not defined: >>> pd.Series([True, False, np.nan]) & True
0 True
1 False
2 False
dtype: bool
>>> pd.Series([True, False, np.nan]) | True
0 True
1 True
2 False
dtype: bool
>>> np.nan & True
TypeError: unsupported operand type(s) for &: 'float' and 'bool' For those logical operations, Julia, R and SQL choose for the "three-valued logic" (only propagate missing values when it is logically required). See https://docs.julialang.org/en/v1/manual/missing/index.html for a good explanation. This would give: >>> pd.Series([True, False, pd.NA]) & True
0 True
1 False
2 NA
dtype: bool
>>> pd.NA & True
NA
>>> pd.NA & False
False
>>> pd.NA | True
True
>>> pd.NA | False
NA |
For the question around the indexing behaviour with boolean values in the presence of NAs, I think there are 3 options:
I looked at some other languages / libraries that deal with this. Postgres (SQL) filters only where True (thus interprets NULL as False in the filtering operation):
In R, it depends on function. dplyr's > df <- tibble(col1 = c(1L, 2L, 3L), col2 = c(1L, NA, 3L))
> df
# A tibble: 3 x 2
col1 col2
<int> <int>
1 1 1
2 2 NA
3 3 3
> df %>% mutate(mask = col2 > 2)
# A tibble: 3 x 3
col1 col2 mask
<int> <int> <lgl>
1 1 1 FALSE
2 2 NA NA
3 3 3 TRUE
> df %>% filter(col2 > 2)
# A tibble: 1 x 2
col1 col2
<int> <int>
1 3 3 But so in base R, it propagates NAs (missing value in the index always yields a missing value in the output, from https://adv-r.hadley.nz/subsetting.html): > x <- c(1, 2, 3)
> mask <- c(FALSE, NA, TRUE)
> x[mask]
[1] NA 3 Julia currently raises an error (but not sure if this is on purpose or just not yet implemented. EDIT: based on https://julialang.org/blog/2018/06/missing this seems to be on purpose): julia> arr = [1 2 3]
1×3 Array{Int64,2}:
1 2 3
julia> mask = [false missing true]
1×3 Array{Union{Missing, Bool},2}:
false missing true
julia> arr[mask]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Stacktrace:
[1] checkindex(::Type{Bool}, ::Base.OneTo{Int64}, ::Missing) at ./abstractarray.jl:504
[2] checkindex at ./abstractarray.jl:519 [inlined]
[3] checkbounds at ./abstractarray.jl:434 [inlined]
[4] checkbounds at ./abstractarray.jl:449 [inlined]
[5] _getindex at ./multidimensional.jl:596 [inlined]
[6] getindex(::Array{Int64,2}, ::Array{Union{Missing, Bool},2}) at ./abstractarray.jl:905
[7] top-level scope at none:0 Apache Arrow C++ (pyarrow) has currently the same behaviour as base R (propagating):
|
(comment from the hackmd copied here)
In our ops code, pandas objects always have "priority" on numpy arrays. So if you do But, it's certainly true that when actually performing similar operations on the equivalent numpy arrays, you can get different results, certainly if np.nan and pd.NA behave differently. So that is a clear drawback of choosing for such different behaviour. Hypothetical example:
But, this also relates with the question: how do we convert to numpy? (which wasn't really discussed yet) By default, if there are NAs, we could also convert to object dtype (like now for IntegerArray) preserving the pd.NA, and then you wouldn't get this different behaviour. And then it could be an option for the user to ask for a conversion to np.nan (to get a non-object float array), but it would be an explicit request of the user to get something different than pd.NA. |
I think I promised to offer examples of weird edge cases and then this got lost in my inbox while travelling. Is that still something that would be useful? |
I recently made a prototype BooleanArray that deals with missing in the current pandas logic: https://uwekorn.com/2019/09/02/boolean-array-with-missings.html It shouldn't be hard to adapt that to output results in Julia/Kleene logic and also implement other operations like |
(extracted from #28095 (comment)) (NA == NA) = ?In this case, I would expect also NamingIn general, I like the Julia documentation but I would prefer when we would stick to using more widely known terms of the three value logic like Kleene logic. With them, we have a good theoretical foundation we can refer to on what a computation should return and this might make communication with e.g. the database community a lot easier. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Is there more feedback on this?
|
My understanding is that in (nearly?) every other situation, pd.NA refuses to cast to bool? |
|
(discuss in #30265) On indexing, propagating NAs presents some challenging dtype casting issues. For example, what is the dtype on the output of In [3]: s = pd.Series([1, 2, 3])
In [4]: mask = pd.array([True, False, None])
In [5]: s[mask] Would that be an Int64 dtype, with the last value being NA? Would we cast to float64 and use And what if the original series was float dtype? A float-dtype with NaN is about the only option we have right now, since we lack a float64 array that can hold NA. I don't think that an indexing operation that introduces missing values should change the dtype of the array. I don't think anyone would realistically want that. So... do we just raise for now? What about cases when we are able to index without changing the dtype? Those would be
IMO, which shouldn't have value-dependent behavior, so if >>> pd.Series([1, 2])[pd.array([True, None]) raises, then so should I think supporting 2 is fine, since it just depends on the dtypes. >>> pd.Series([1, 2], dtype="Int64")[pd.array([True, None])]
Series([1, NA], dtype="Int64")
>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, False])]
Series([1], dtype="Int64") |
Pushed a prototype up for discussion at #30265. Let's move the indexing discussion over there. |
Without complexities of implementation in mind, I am not sure that we actually would want such propagation of missing values? I think the main take-away of the discussion was that there is not a clear "best" option. |
That's my conclusion in #30265 (comment) as well. |
I think this has been implemented. https://dev.pandas.io/docs/user_guide/boolean.html and https://dev.pandas.io/docs/reference/api/pandas.arrays.BooleanArray.html#pandas.arrays.BooleanArray |
A note for who followed the discussion here. The specific issue about indexing (masking) with booleans in the presence of missing values has come up again in #31503. |
The main inconvenience of the new behaviour, IMHO, is that all code written for prior versions is extremely likely to need a lot of |
@laserjeyes note that in the meantime, NAs are considered as False when it comes to filtering, which should normally lessen the need to the |
Part of the discussion on missing value handling in #28095, detailed proposal at https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB.
if we go for a new NA value, we also need to decide the behaviour of this value in comparison operations. And consequently, we also need to decide on the behaviour of boolean values with missing data in logical operations and indexing operations.
So let's use this issue for that part of the discussion.
Some aspects of this:
value == np.nan -> False
,values > np.nan -> False
, but we can also propagate missing values (value == NA -> NA
, ...)|
or&
with missing data. But we could also use a "three-valued logic" like Julia and SQL (this has, eg,NA | True = True
orNA & True = NA
).(TODO: should check how other languages do this)
Julia has a nice documentation page explain how they support missing values, the above ideas largely match with that.
Besides those behavioural API discussions, we also need to decide on how to approach this technically (boolean ExtensionArray with boolean numpy array + mask for missing values?) Shall we discuss that here as well, or keep that separate?
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: