-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing values proposal: concrete steps for 1.0 #29556
Comments
My initial reaction is +1 on having StringArray and IntegerArray start using this new If we're able to get BooleanArray done for 1.0, then that'd be great. I think time-wise, it's OK to settle for a simple ndarray + a private mask attribute. We optimize to use pyarrow or a bitmask later. (edit: just saw #29555 (review) Nice!) So to be explicit, +1 on the proposed breaking changes to IntegerArray. |
well we would need a hard dependency on pyarrow to allow BoolenArray since we haven’t had this discussion then -0 on it pd.NA is fine but i see this as too aggressive to use in IntegerArray for 1.0 this should be in master for a while and get usage |
It seems to me like we could do integer array using a Boolean ndarray as a mask, and use pyarrow / a bitmask as an optimization later. It’s 2x the memory for now, but that’s maybe Ok.
And I think we’ll eventually want IntegerArray ops to return nullable booleans, so better to do before 1.0?
… On Nov 11, 2019, at 18:20, Jeff Reback ***@***.***> wrote:
well we would need a hard dependency on pyarrow to allow BoolenArray
since we haven’t had this discussion then -0 on it
pd.NA is fine but i see this as too aggressive to use in IntegerArray for 1.0
this should be in master for a while and get usage
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
it’s not 2x it’s like 10x compared to a native impl |
I can't find one issue for it but I still think we should just take pyarrow on as a hard dependency if it simplifies our development. If import time is the big blocker IMO that's not really that big of a deal and would make it easier to manage / communicate BoolArray and similar developments +0 on pd.NA. Not sure I see understand the entire scope of it from what is described but maybe something that just needs to be developed out |
Sorry, I should have been clearer about this: all what I said above is meant to be pure numpy-based masked arrays, without any dependency on pyarrow, for now. This at least enables us to experiment with the API design without needing a heavy dependency such as pyarrow for the storage. Although I am clearly biased towards pyarrow, I don't think any of this needs pyarrow on the short term, we can perfectly implement this using numpy (and doing so doesn't complicate it, on the contrary, there is still a lot more functionality available for numpy masks compared to pyarrow arrays). I indeed started a PR for the BooleanArray part of this: #29555. The numpy mask-based approach will not be the most efficient (both in memory and speed), but IMO it is now more important to first experiment with the design (the same can be said about the already existing IntegerArray).
For IntegerArray it is 2x (which is what Tom was saying I think), for BooleanArray the optimization could indeed be a lot bigger. |
Agreed (can add StringArray here too). I want to have 1.0 be as object-dtype free as possible :) |
For BooleanArrays with no missing values, isn't it exactly 2x more memory (assuming we store the mask when there are no missing values)? Then it's just the boolean ndarry of values + a boolean mask, so 2x. And we can, like pyarrow, skip allocating the mask if there are no NAs (though that's an optimization, not a requirement). For one with missing values, isn't it less memory than our current implementation? In [13]: s = pd.Series([None] + [True] * 1000)
In [14]: s.memory_usage(deep=True, index=False)
Out[14]: 36024 versus In [15]: mask = np.array([True] + [False] * 1000)
In [16]: 2 * mask.nbytes
Out[16]: 2002 Even if we're using NaN for a boolean with missing values, we still have an improvement, since we use float64 by default In [19]: s.astype(float).nbytes
Out[19]: 8008 |
It's indeed already an optimization compared to object dtype array of True/False/None. But I think Jeff meant the potential optimization of using pyarrow's boolean array storage, which can give a lot of additional memory improvement (8x compared to the double numpy array):
But anyway, as I argued above, let's leave this for the future, and focus now on the API using numpy arrays under the hood (that use some more memory). |
From the call today, there were no objections to changing IntegerArray to start using NA instead of @jorisvandenbossche do you intend to work on that? Otherwise, I can push up a WIP tomorrow (which will depend on #29597). |
Feel free to already do a WIP on that, there is plenty to do ;) |
BooleanArray is in master. Getting #29597 in will open up more followups. I think the order is roughly
@jorisvandenbossche feel free to update this list (or maybe the original post) as needed [done] |
Thanks Tom, the follow-ups I had listed in #29555 (comment) are more or less the same (added boolean indexing + any/all) |
And updated the top post with a to-do list |
Terji
can’t find this comment on github but
i would say that if we had a DictionaryEncoded type (that backs Categorical and maybe StringArray) then you could certainly use it in the implementation here
… On Nov 25, 2019, at 4:44 PM, Terji Petersen ***@***.***> wrote:
I'm obviously late to this party, but to me it seems like a BooleanArray could be implemented more efficient in a Categorical-like structure, where the "categories" would be enforced to [False, True] and the "codes" would be a ndarray with possible values of -1, 0 and 1.
If we'd do that, we'd avoid the mask and the ops would be faster and the memory usage lower?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I think a big advantage of the mask approach is the consistent data model with other arrays such as IntegerArray now (and possibly others in the future). Also, in operations you often need or produce a mask (eg in a IntegerArray comparison producing a boolean array, or in |
@TomAugspurger are you planning to do a PR for using it in IntegerArray? (or at least you had a beginning branch for this, I think?) I opened a PR to do the BooleanArray case: #29961 |
@jorisvandenbossche I'm not finding a brach for IntegerArray using pd.NA right now. Do you have time to work on it this week? |
I think it is here: TomAugspurger/pandas@NA-scalar...TomAugspurger:NA-scalar+IntegerArray. Feel free to just open it as a PR, and then can see who has time first |
Thanks, missed that somehow. Opened a draft in #29964 |
@jorisvandenbossche should the subitem |
I've added an item for returning BooleanArray from boolean |
@jorisvandenbossche I added a TODO for implementing In [4]: import numpy as np; import pandas as pd
In [5]: np.bool_(True) | pd.NA
Out[5]: NA I just hardcoded that to return NA as a test, but we'll want to do broadcasting, etc. We have to define some behaviors: Logical operations use Kleene logic. Returns >>> np.bool_(True) | NA
True
>>> np.array([True, False]) | NA
BooleanArray([True, NA]) Arithmetic operations treat >>> 1 + pd.NA
NA
>>> np.array([1, 2], dtype='uint64') + NA
IntegerArray([NA, NA], dtype='uint64) With floats, we return ..., what? An ndarray[float] using NaN (I vote no)? A PandasArray (maybe)? >>> np.array([1.1, 2.2]) + NA
? For other ufuncs like |
@jorisvandenbossche one of the items is
That's not quite right for StringArray right? Those won't implement logical ops? I think I may have added StringArray, when I meant to edit the list for comparison ops. And have you started on |
Yes, but I think it is also not correct for Integer, as integer logical ops return integers (which is also the case for Series):
(logical ops are not yet implemented for IntegerArray, but that's a separate issue) Maybe we were confused with comparison ops for a moment, but I think the full line can be removed.
No, not yet |
I think we have open PRs for everything except
which are blocked by the outstanding PRs |
My question is prosaic and based on I ask as NA is a common abbreviation for 'Not Applicable', 'North America' et al, in a way that in my experience that Given this, if |
@anisotropi4 Thanks for joining in! Would you like to move that comment to a new issue? I think the several interactions of NA with string representation and conversion warrants a dedicated issue to discuss. |
As requested @jorisvandenbossche I have raised this as #30415 with some more detail about current behaviour. I hope that is what you had in mind. |
@jorisvandenbossche I think we can close this, unless you have additional tasks in mind. |
Closing, though will address any remaining tasks in dedicated issues. |
Yes, sorry for the slow reply, this was fine to close. The combined docs of all PRs about this issues could maybe use a check / review, but no need to keep it open for that. |
Updated with to do list:
pd.NA
scalar -> ENH: add NA scalar for missing value indicator, use in StringArray. #29597pd.NA
in BooleanArray -> Use new NA scalar in BooleanArray #29961any
/all
reductions withskipna=False
(API: any/all in context of boolean dtype with missing values #29686) -> API: BooleanArray any/all with NA logic #30062pd.NA
in IntegerArray -> API: Uses pd.NA in IntegerArray #29964.str
methods. -> API: Return BoolArray for string ops when backed by StringArray #30239NA.__array_ufunc__
-> Implement NA.__array_ufunc__ #30245Original issue:
Issue to discuss the implementation strategy for #28095.
Opening a new issue, as the other one already has a lot of discussion in several discussion, and would propose to keep this one focused on the practical aspects of how to implement this (regardless of certain aspects of the NA proposal such as single NA vs dtype-specific NAs -> for that will post a summary of the discussion on #28095 tomorrow).
I would like to propose the following way forward:
On the short term (ideally for 1.0):
pd.NA
scalar, and recognize it in the appropriate places as missing value (e.g.pd.isna
). This way, it can already be used in external ExtentionArraysBooleanArray
with support for missing values and appropriate NA behaviour. To start, we can just use a numpy masked array approach (similar to the existing IntegerArray), not involving any pyarrow memory optimizations.pd.NA
as the missing value indicator for Integer/String/BooleanArray (breaking change for nullable integers)On the intermediate term (after 1.0)
I think the main discussion point is if we are OK with such a breaking change for IntegerArray.
I would personally do this: IntegerArray was only introduced recently, still regarded as experimental, and the perfect use case for those changes. But, it's certainly a clear backwards incompatible, breaking change.
cc @pandas-dev/pandas-core
The text was updated successfully, but these errors were encountered: