-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROADMAP: Consistent missing value handling with new NA scalar #28095
Comments
Technical note: following our roadmap, I posted an issue with the proposal, but we need to see a bit how to have those discussions. As I posted the full proposal externally and not in a PR, it's more difficult to have inline comments (although I can also post it on eg google docs which is a bit friendlier for that). But alternatively, we could also do PRs instead of issues?) |
This is an interesting proposal! I'm mostly concerned about type stability / predictability, e.g.,
We might need an It would also be worth consider whether we would like to deviate from NaN semantics in some cases. For example, it might make sense to define Julia has a nice documentation page explain how they support missing values. It might be a good model to emulate for pandas. Two notable difference about Julia:
|
Thanks for the feedback!
Yes, that's an item I briefly mentioned in the proposal with similar examples. We can (and will need to) decide on certain rules how to go about this, but they can still be unpredictable for users of course.. I also think we might need a I will need to think a bit more on usage implications of this.
That's also something I have been thinking about. I now included some additional text about this at the bottom of the proposal. Ideally, I think we would do something like that and follow what Julia (and also mostly R and SQL) are doing (meaning: propagating NA on comparisons, having three-valued logic for boolean operations). However, this also has some additional complications, certainly if we want to introduce this gradually. |
Read through but not sure about this. AFAIK for other languages (particularly Julia) there was an issue distinguishing between For extension types we'd (I assume) mostly be masking the location of missing values anyway, so what particular advantage do you see to returning pd.NA from those instead of just np.nan? |
(sorry for the slow reply, busy EuroScipy)
Can you elaborate a bit more on this sentence?
For sure, we could also return np.nan as the scalar value (you could basically replace pd.NA with np.nan in the proposal), and this is something our users are already familiar with.
Originally I was planning to write a proposal combining the new sentinel value and consistently using mask-based approach for all dtypes (which more easily enables being consistent with the scalar value as well). Choosing a specific implementation (eg mask-based) will have much more impact on our code base, a possible choice for pd.NA over np.nan as the scalar value is much more a user-facing API question, where I find that pd.NA can give a more consistent, easier to teach pandas API. |
So if I have a Series[float64] could it contain both Misc thoughts:
|
In theory, yes (that is what R, Julia, SQL, .. do). In practice, though, we will certainly need some kind of option (and as default in the beginning) to still treat np.nan as a missing value.
Partly, yes, but not as the main reason (but you could see Arrow as "yet another example" next to other R/Julia/SQL that distinguish both concepts, and a python one). Note that Arrow nowadays uses "null" instead of "na" (but that's just the name, not a difference in concept). I am using R regularly as example, but I am certainly not an expert in R. Eg they distinguish both, but it seems that NaN behaves more like NA there compared to numpy's np.nan (so it's maybe not the best reference to compare to).
I suppose with the idea to avoid the dubious situations in what the result type should be? (eg NA + 1 -> float or int or ?) I don't really like that on the long term (for consistency reasons), but on the short term this might be what we will do in practice anyway. But, eg in that idea, what sentinel would we use for nullable integers? Still np.nan as today I suppose. But, that gives you already the same dubious situation (what is the result type of
That's certainly something to consider as well, I think. Although (given eg #28124), I suppose there are people that like to use None since that can play better with other python packages. |
That example works, but more problematic examples are |
Why is that more problematic? Because float and int, although different types, still can refer the same (semantic) value, while timestamp and timedelta are different concepts. That's indeed a bigger difference. I think this thread is the place to discuss this. |
I would like to revive this discussion (will try to update the proposal with some more concrete details, and send a note to the mailing list). What are people's general thoughts about this? |
cc @xhochy you might also be interested in this given your recent explorations of boolean extension arrays / missing values in any/all operations |
@jorisvandenbossche reading through this I don't see a clear answer to the "how do we do this while avoiding breaking arithmetic?" problem. Is your position something to the effect of "just don't do arithmetic with these"? |
@jbrockmendel What do you mean with the "breaking arithmetic problem" exactly? But in the end, this is a choice to make. Yes, when dealing with scalars, having a single NA scalar looses some information compared to having a dtype-specific scalar values. But note that in the current situation, we already have this as well. Eg we don't yet have separate not-a-timedelta or not-a-period values. We could add those, and that would a counter proposal to this. But do we then also want to add a not-a-integer, not-a-string, not-a-... ? My personal opinion here is that having consistency across dtypes with a single NA value is more valuable than preserving some more information in the scalar NA value for arithmetic with scalars. |
Is the "NA dtype" proposal explained anywhere? IIUC, the idea would be that What do we think about these cases? >>> Series([0, 1]) + pd.NA
Series([NA, NA], dtype=NAType)
>>> Series([0, 1]) + pd.Series([NA, NA], dtype="int")
Series([NA, NA], dtype="int")
>>> Series([0, 1]) + [NA, NA] # equivalent Series([0, 1]) + Series([NA, NA], dtype=NAType)
Series([NA, NA], dtype="NAType") Is it strange that we don't have the property that
? I'm not sure what to make of it. |
Not yet much more than just a similar mention as in the discussion above. I included a bit more comment now (mainly based on your comment and the rest of this comment).
The examples you give is what I had in mind. In addition, I think the idea would be that this "NA dtype" could be upcasted to any other dtype (eg when concatting, ..) so that in other operations it "doesn't get in the way" if you accidentally have this type. Of course some dtype specific operations can be problematic (like
Yes, that is correct. But note that we also don't have that right now:
But, this is something that could in principle be solved by having seperate NaT for datetime/timedelta/period (cfr #24983). While when going for a single NA scalar, we cannot fully solve this. It would be interesting to investigate how other libraries / languages are dealing with this (Julia has a more powerful Union(type, Missing) which handles this). |
I think there's agreement that the current state of NaN for {int, float, bool, str, ...} and NaT for {datetime, timedelta, period} isn't good. If we're changing things, I think we'll move to either
Agreed. I think Julia would be a good one to investigate. In their docs, I didn't see what see what the type of |
I see two arguments for a single
Am I missing anything? Aiming for non-normative descriptions so far. |
Collecting responses to a bunch of stuff:
Is there precedent for
I can imagine scenarios where "don't get in the way" is useful, but also scenarios where "please raise if I accidentally try to add a datetime to a float" is more important. My intuition is that the latter is more common, but I don't have any good ideas for how to measure that.
If we can retain np.nan because it has a different meaning from pd.NA, why not also have pd.NaT, as it also has a different meaning? |
It seems this is what Julia does. julia> x = [1, 2, 3, missing]
4-element Array{Union{Missing, Int64},1}:
1
2
3
missing
julia> x .== missing
4-element Array{Missing,1}:
missing
missing
missing
missing IIUC, the From what I can tell, Date / DateTime arithmetic doesn't handle missing well. I don't know if that's by design or it's just not implemented. julia> [Date(2014, 1, 1), missing] .+ Dates.Day(1)
ERROR: MethodError: no method matching +(::Missing, ::Day)
Closest candidates are:
+(::Any, ::Any, ::Any, ::Any...) at operators.jl:529
+(::Missing, ::Missing) at missing.jl:93
+(::Missing) at missing.jl:79
...
Stacktrace:
[1] _broadcast_getindex_evalf at ./broadcast.jl:625 [inlined]
[2] _broadcast_getindex at ./broadcast.jl:598 [inlined]
[3] getindex at ./broadcast.jl:558 [inlined]
[4] macro expansion at ./broadcast.jl:888 [inlined]
[5] macro expansion at ./simdloop.jl:77 [inlined]
[6] copyto! at ./broadcast.jl:887 [inlined]
[7] copyto! at ./broadcast.jl:842 [inlined]
[8] copy at ./broadcast.jl:818 [inlined]
[9] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(+),Tuple{Array{Union{Missing, Date},1},Base.RefValue{Day}}}) at ./broadcast.jl:798
[10] top-level scope at REPL[47]:1
|
Some comparisons (but note I am not an expert on any of those): Postgres Postgres does not have a "NULL type". From testing, it seems to just define a certain resulting type for a scalar NULL operation. For example: SELECT
interval '1 day' + interval '1 hour' AS result1, -- interval
interval '1 day' + timestamp '2012-01-01' AS result2, -- timestamp
interval '1 day' + NULL AS result3; -- interval has interval type for result1, timestamp type for result2, and interval type for result3 (where this could in principle be both interval or timestamp). Similar result for timestamp - timestamp = interval, timestamp - interval = timestamp and timestamp - NULL = interval. So here it seems to assume that the NULL is of its own type to determine the type of the resulting NULL (just a guess from the results I see). That's for scalar, NULL-involving arithmetic operations. But for the rest Postgres is very similar in what is discussed in this proposal: they have a single NULL scalar usable for all types. For logical operators involving booleans and NULLs, they also have a "three-valued logic" (https://www.postgresql.org/docs/9.1/functions-logical.html) similar to Julia and what I would propose for our BooleanArray. Julia As Tom already noted above, you get something like
where the result involving a scalar missing value effectively is of "missing type". But, I am not sure this is really comparable to our situation, as they have this "Type Unions" system. As noted by @shoyer above, that eg solves the isinstance problem (as Missing is an instance of Union{Missing, Int64}). For the rest, it also has similar behaviour in comparison and logical operators as SQL, and as what I would personally propose for our BooleanArray as well: https://docs.julialang.org/en/v1/manual/missing/ R Trying some things out with the tidyverse: > library(tidyverse)
> df <- tibble(x = c(1L, 2L, NA))
> df %>% mutate(x1 = x + 1L, x2 = x + 1.5, x3 = x + NA)
# A tibble: 3 x 4
x x1 x2 x3
<int> <int> <dbl> <int>
1 1 2 2.5 NA
2 2 3 3.5 NA
3 NA NA NA NA > library(lubridate)
> df <- tibble(x = c(today(), NA))
> df %>% mutate(x1 = x - ddays(10), x2 = x - ymd(20191001), x3 = x - NA)
# A tibble: 2 x 4
x x1 x2 x3
<date> <date> <drtn> <date>
1 2019-10-03 2019-09-23 2 days NA
2 NA NA NA days NA
(here drtn = duration, their "timedelta" type. So translating to our terminology: timestamp - NA = timestamp) Here it seems to preserve the original type of the column for arithmetic operations involving scalar NAs (if the original type is possible. For eg int / NA (division) it gives a float column of all NAs and not int column, as expected). So R also has no "NA type". If you create a column of all NAs, it gives a logical typed column (boolean). R also actually has multiple NA values ( Apache Arrow Arrow consistently handles missing values with a mask (so not with sentinel values as NaN or NaT). It also has a NullType for arrays with only Nulls without specified type (https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#null-layout). |
And now some specific answers (sorry for the long wall of text):
A third option could maybe also be to have an NA for each type, but let them all look like "NA". That could help the "scalar arithmetic" issues by preserving the originating type information in the NA value. On the other hand, this seems a lot more work to implement (would eg this "NA_period" have all the methods of a Period?) and also potentially to code against.
In the document, I give 3 main arguments: 1) inconsistent user interface, 2) proliferation of "not a value" types, 3) mis-use of NaN float (with longer descriptions for each in https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB). I think that 1) maps to your first 1., and 3) maps more or less to your 2. My 2) is basically an inversed argument of the alternative (consistently use "not a value" pattern for all types).
Assuming that we then want a "not a value" for each arithmetic type (otherwise you have the same problems as now that
Those things are indeed difficult to guess or measure. For the "don't get in the way", I was mainly thinking about the example of combining data such as in
Yes, given that the underlying numpy array has this "not a value" sentinel, we can keep that as well (if we combine it with a mask for actual missing values). However, in practice (in the far future where a NA system would be fully rolled out), I think most of the time people will have rather NA and not NaT in their values, as all operations that introduce missing values in pandas would give NA (IO with missing values, reindexing / alignment, ..). |
agree generally with having pa.NA; it’s going to be much simpler from a user viewpoint and implmentstion POV, but some concerns
so i would just say implement this using pyarrow masks (but not pyarrow memory for the strays themselves; we have differing enough semantics that we likely want to keep the values in numpy arrays at least until pyarrow grows stable operations) |
TL;DR: can we break any non-controversial pieces of this off? The scope here is overwhelming.
Yes, I think we are on the same page description-wise.
I don't think that is a necessary assumption. The default alternative to this proposal is not proliferation implementing a bunch of new things, it is the status quo. The existing inconsistencies with NaT are annoying, but generally we have a handle on them. This is already a complicated discussion and I'd prefer to keep #24983 separate to whatever extent that is feasible.
I think that would be great, would make our logical ops (which I'm currently working on and are a PITA) much more internally consistent. More importantly, it can be implemented independent of the rest of this proposal, which is kind of overwhelming in scope.
I understand the user pain point w/r/t non-arithmetic-dtypes (my point 2, part of Joris's 3), but I don't at all see how
AFAICT the contentious part of this is just about what |
...all of which I think is in agreement with what @jorisvandenbossche said above. |
Yes, that's indeed what I tried to explain in words, thanks for the examples. Assuming we have this "generic" untyped pd.NA as well, you can still run into surprising cases:
So if your |
(sorry posted this on the wrong thread previously) I actually ran into this in numpy a few days back. The However, the way this currently works I guess, the main question that I am wondering right now:
Note that this does not touch that |
(Posted in wrong issue before, so moving it here) The discussion by @seberg above leads me to the following idea, which I think I may have posted elsewhere. I think it is important to distinguish between NA meaning "data missing" versus NaN meaning "not a number". In the current pandas world, there is no differentiation. Consider this simple example: In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: s=pd.Series([0,4,np.nan], dtype=float)
In [4]: s2=pd.Series([0,2,1], dtype=float)
In [5]: s
Out[5]:
0 0.0
1 4.0
2 NaN
dtype: float64
In [6]: s2
Out[6]:
0 0.0
1 2.0
2 1.0
dtype: float64
In [7]: s3=s/s2
In [8]: s3
Out[9]:
0 NaN
1 2.0
2 NaN
dtype: float64 In this example IMHO, these should be represented differently. The arithmetic/boolean operations that occur with a floating point |
@seberg is there a numpy issue/thread with discussion about this? It's in your dtype / ufunc refactor that you want to get rid of value based dtype promotion?
I am personally not fully convinced we need type specialized NAs, but if we do, I personally think it is important for usability that the generic NA works in most cases out of the box.
As mentioned above (#28095 (comment)), this proposal allows to distinguish both for floating point data, in principle. In the example you give we can indeed preserve the NaN in case of an invalid arithmetic result. NaN values are part of the actual values, while NA values are kept track of in the mask. |
@jorisvandenbossche my thread would be the closest, but no. I do not want to get rid of value based promotion for python types. I.e. See it as creating a new function (a similar one exists in the C-api): In any case, this is nothing settled as such. But it is how I see things right now, so I think it is the most likely thing to happen (I am going to be gone for the next few weeks, may be around for a few days only). |
Anyone (@jorisvandenbossche?) care to summarize the state of this discussion, and maybe suggest paths to get it unstuck? |
With some delay, trying to summarize the above discussion I think in general there seems to be approval for the idea of a dedicated The main discussion item here can be reduced to: a single, dtype-agnostic Single
Dtype-spcialized
The summary ended up more as a list of open questions .. Some other points worth noting:
Further, I opened a PR with a basic BooleanArray extension array (but for now without any new NA behaviour): #29555 |
My preferred api would be dtype-specialized NAs along with a generic pd.Series([pd.NA]) # raises TypeError!
pd.Series([pd.NA], dtype=pd.NA['int64']) # works (casts the pd.NA to a pd.NA['int64'] specialized dtype) IMHO this gives the user the simplicity of a single I guess ideally
For backward compatibility I'd say they would need an
Ideally (for me), yes. As someone else mentioned above, missing is a different concept to invalid and for certain usecases it would be very useful to be able to distinguish them. |
Thanks for the summary Joris. I agree that there's probably consensus for moving forward with some kind of I'm having a hard time judging which of a single I would like to see something for 1.0, especially for StringArray, BooleanArray, and IntegerArray. I already know that StringArray shouldn't be using If you expect a single |
I just opened #29597 for a quick (pure python) experiment for a single NA scalar. |
would prefer that we actually merge something and let it sit in master for a while, so -1 on doing this for 1.0 unless it’s either not used anywhere or we significantly delay 1.0 |
Reviewing this and #32265 in light of a little over 2 years of experience with pd.NA, the main question to which I think we need a definitive answer is "Does/Should If the answer is NO, then we can/need to:
If the answer is YES, then we can/need to:
We should also reconsider whether we want
To be clear I think the propagating behavior is "kind of neat," but there is a tradeoff with maintenance burden to consider. |
I am working on a new variable-width UTF8 string dtype for numpy that supports arbitrary null sentinels, with an eye to explicitly support NA so we can replace object string arrays in pandas. This week I discovered that it's very difficult to identify the NA singleton in C code that needs to be portable across python implementations. As far as I'm aware (please correct me if this is incorrect), the canonical way to identify NA is something like: if obj is pd.NA:
# handle nulls This is problematic for pypy, particularly if I'm writing code against pypy's cpyext CPython C API emulation, since there is currently no straightforward way to spell the equivalent of the python In CPython, you just do pointer equality, so it's tempting to do. However, as I discovered this week any C extension that uses pointer equality like this will be subtly broken on pypy under cpyext. The closest I have to an If instead NA behaved more like NaN - e.g. Sorry to resurrect this old issue but seeing how the documentation still refers to NA as experimental I thought it would still be worth bringing up. Also if this has come up already please point me to the old discussion, I couldn't find any previous discussion about this perhaps niche issue. |
I cleaned up my initial write up on the consistent missing values proposal (#27825 (comment)), and incorporated the items brought up in the last video chat. So I think it is ready for some more detailed discussion.
The last version of the full proposal can be found here: https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB
TL;DR:
pd.NA
that can be used as the missing value indicator (when accessing a single value, not necessarily how it is stored under the hood).np.nan
orpd.NaT
in new data types (eg nullable integers, potential string dtype)cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: