-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add BooleanArray extension array #29555
Conversation
pandas/core/arrays/boolean.py
Outdated
data[self._mask] = self._na_value | ||
return data | ||
else: | ||
return self._data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adapted this compared to IntegerArray to return bool dtype if there are no NaNs (this helps to enable boolean indexing automatically for this case). But we should probably make a consistent choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think we want to return object-dtype here if we can avoid it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, a predictable dtype (not depending on the values in the array) is also nice ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I thought more about it the dtype stability was winning me over. Not sure what's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched back to always object dtype for now. There are methods (like astype(bool)) that force you to give bool dtype if possible. We should maybe think about adding a to_numpy
to our arrays as well (the same for IntegerArray).
pandas/tests/arrays/test_boolean.py
Outdated
# def test_to_boolean_array_error(values): | ||
# # error in converting existing arrays to BooleanArray | ||
# with pytest.raises(TypeError): | ||
# pd.array(values, dtype="boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently those actually work, as the coercion routine is using numpy coercion: np.array(['a'], dtype=bool)
actually works and use bool-ness of the values.
So the question here is if we want to allow this, or rather be more strict (eg only allow actual bools + integer 0/1)? (personally prefer to be strict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, as long as actual bools and integer 0/1 as accepted then either works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have a decision here, and get tests for it. I think we can justify either, since np.array([np.nan], dtype=bool)
will be array([True])
. So we're already deviating from NumPy in this limited case.
Do you have a plan for how to do the actual validation? Will it be somewhat expensive?
What are we allowing in __setitem__
? Do we convert the value to bool
, or do we require booleans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So an option could be to allow: bool + int/float of 0's and 1's (the float is needed in case it are 0/1's with np.nan), and raise for the rest.
Do you have a plan for how to do the actual validation? Will it be somewhat expensive?
We can check for actual bool/int/float dtypes (and then for int and float check that the original is equal to the result converted back to numeric; to ensure it were only 0's and 1's). For object we can first do an infer_dtype, and then apply the same logic.
What are we allowing in setitem? Do we convert the value to bool, or do we require booleans?
Currently, we use the same coerce_to_array
function in __setitem__
to convert the value(s) to be set to boolean values+mask. So if we restrict like the above, then that follows the same pattern for __setitem__
(but we could also be more strict in __setitem__
if that is preferred)
pandas/tests/arrays/test_boolean.py
Outdated
|
||
|
||
# @pytest.mark.parametrize("ufunc", [np.add]) | ||
# def test_ufuncs_binary(ufunc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to implement arithmetic ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the expected return types? Would you use IntegerArray for things returning an integer dtype?
ndarray[bool] will sometimes return a float dtype...
In [17]: a = np.array([True, False])
In [18]: a / 1
Out[18]: array([1., 0.])
would we return an ndarray there, with NaN for the missing values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess returning a PandasArray is also an option... If we want to stay in the algebra of op(BooleanArray, other)
returning an ExtensionArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I went with tthe dtype that numpy ops result in, and then if result is bool -> BooleanArray, if is int -> IntegerArray, otherwise if result is float dtype (or something else) return the numpy array.
should this be marked as a WIP? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you did not inherit from IntegeryArray (and dtype) and just copied
surely these share much more code and could easily share a bae class at the very least
As I said in the top post: yes this needs to share code with IntegerArray through common base class, but initially focussed on a working BooleanArray and deciding on its behaviour. I don't think it should directly inherit IntegerArray, but personally prefer a common base class. I think that will make it clearer what is actually shared. |
i agree a common base class is good and as a follow up is fine |
@jbrockmendel not in the sense of "this is ready to be reviewed", and it also passing all extension array tests (except one, see below). The one failure in the base extension array tests, is for concatting mixed dtypes (
So it is expecting to return object dtype.
What to follow here? (we should probably have some discussion about clearer rules (and document this) for which dtypes coerce with which others. Or does this exist somewhere?) |
I'm confused by a potential double-negative. If this is not ready to be reviewed, then I suggest it be marked as WIP so that would-be reviewers focus elsewhere for the time being. |
Let me be clearer then: this is ready to be reviewed |
Are people OK with first reviewing / getting merged this version of a BooleanArray without using the new NA, and only in a follow-up adapt it to use NA ? |
yes let’s keep BooleanArray independent and merge first |
pandas/core/arrays/boolean.py
Outdated
else: | ||
mask = mask_values | ||
else: | ||
mask = np.asarray(mask, dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm when is this case reached? I'm a little confused about the mask | mask_values
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, you can pass a mask
manually to this function (and in that case it gets combined with the potential mask of the values). However, this functionality is actually not used right now by BooleanArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added tests for this (as I think we want to expose this publicly at some point)
@TomAugspurger boolean indexing does not yet work. Well, it did work when BooleanArray coerced to a bool ndarray, and then it worked automatically (for the case that there are no NAs). |
SGTM. Have we discussed the API on that anywhere? Do missing values propagate, or are they treated as False? |
That's where I want feedback, see the discussion in #28778 |
pandas/core/arrays/boolean.py
Outdated
# may need to fill infs | ||
# and mask wraparound | ||
if is_float_dtype(result): | ||
mask |= (result == np.inf) | (result == -np.inf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is copied from IntegerArray, and should not be an inplace operation. It risks alterting self._mask inplace. xref #27829
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is copied from IntegerArray, and should not be an inplace operation. It risks alterting self._mask inplace. xref #27829
OK, I removed this here altogether, as I don't think we should fill any infs in the BooleanArray division (we should separately also look at this for IntegerArray to see what behaviour we want)
This means that for BooleanArray I follow numpy's behaviour in returning float with inf:
In [19]: True / np.array([True, False, True])
/home/joris/miniconda3/envs/dev/bin/ipython:1: RuntimeWarning: divide by zero encountered in true_divide
#!/home/joris/miniconda3/envs/dev/bin/python
Out[19]: array([ 1., inf, 1.])
@jreback @TomAugspurger OK, made some substantial changes, recommend to look at the diff of the last commits: https://github.com/pandas-dev/pandas/pull/29555/files/a3e1e931392edd2631c3731a081e4eb16128c80a..90558d696c5407523e9fd0e8204c0a2ac8950653 |
have u tested this is indexing? i assume that would be a follow up |
Yep, see #29555 (comment) |
So to gather the follow-ups:
(will open an issue for this) |
Are you all fine with almost merging this? (or maybe a final round of review?) |
let me look one more round |
Other question: what do we want Currently, you get this behavour:
Do we want to return BooleanArray also when there are no missing values (which currently returns boolean PandasArray)? When a |
I would be +1 on using all of the extension arrays (SA, IA, BA) in |
I think our reasoning before was to keep this consistent with the main constructors (eg in Series), so that we could in principle move towards Series(..) being Series(array(..)) at some point. |
I don't think that we should have value-dependent behavior here. On the broader issue, I'd be fine with |
@jreback ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine. some questions, can be done on followup, mostly asking about test coverage. if you want to do a minor pass ok, or can merge.
pandas/core/arrays/boolean.py
Outdated
return True | ||
|
||
|
||
def coerce_to_array(values, mask=None, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type these as much as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure how to type those (I don't see anything in pandas._typing
for "list like" ?)
inferred_dtype = lib.infer_dtype(values_object, skipna=True) | ||
integer_like = ("floating", "integer", "mixed-integer-float") | ||
if inferred_dtype not in ("boolean", "empty") + integer_like: | ||
raise TypeError("Need to pass bool-like values") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests hit here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests hit here?
Yes, there is a test that passes all kinds of non-boolean-like values
In general, I ran locally pytest with coverage, and there is 97% coverage for this file. The main non-covered things are some parts of the ufunc related code, and some length mismatch errors in the ops code.
@classmethod | ||
def _from_sequence(cls, scalars, dtype=None, copy=False): | ||
if dtype: | ||
assert dtype == "boolean" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests for this?
Do we need that? This is an internal assertion, that should never be raised to the user but is here to help the developer (I don't think we should add tests for those asserts). So the "test" is that this actually never occurs in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I can also leave it out (the assert). For BooleanArray, the dtype
is not very useful. I think this parameter is mainly used for cases where multiple dtypes are possible per array (eg int64, int32 etc for IntegerArray)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right i think it actually should always be None here, as this is internally passed. however a user might try to pass something that is not None (so this should maybe be a ValueError)
"__rmod__", | ||
): | ||
# combine keeps boolean type | ||
expected = expected.astype("Int8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what numpy does, eg:
In [17]: np.array([True, False]) ** 2
Out[17]: array([1, 0], dtype=int8)
So for those ops, I just followed numpy's behaviour with boolean arrays.
|
||
|
||
class TestComparisonOps(base.BaseComparisonOpsTests): | ||
def check_opname(self, s, op_name, other, exc=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need these overrides here? (you are already doing it above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's to override that there is no exception being raised (will add a comment about that).
(above is for the arithmetic ones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
pandas/core/arrays/boolean.py
Outdated
return True | ||
|
||
|
||
def coerce_to_array(values, mask=None, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not really sure how to type those (I don't see anything in pandas._typing
for "list like" ?)
inferred_dtype = lib.infer_dtype(values_object, skipna=True) | ||
integer_like = ("floating", "integer", "mixed-integer-float") | ||
if inferred_dtype not in ("boolean", "empty") + integer_like: | ||
raise TypeError("Need to pass bool-like values") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests hit here?
Yes, there is a test that passes all kinds of non-boolean-like values
In general, I ran locally pytest with coverage, and there is 97% coverage for this file. The main non-covered things are some parts of the ufunc related code, and some length mismatch errors in the ops code.
raise TypeError( | ||
"mask should be boolean numpy array. Use " | ||
"the 'array' function instead" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only when you actually coerce, not here though
Not fully sure I understand. We don't do any coercing in this __init__
constructor, and a few lines above there are isinstance checks that the input can only be boolean ndarrays.
this is something we should check (here and in IntegerArray) as if you accidently pass a non ndim==1 then it would be an error (can be a followup)
As following your comment, I already added a ndim check; I now raise an error if ndim is not 1 (on the lines below)
@classmethod | ||
def _from_sequence(cls, scalars, dtype=None, copy=False): | ||
if dtype: | ||
assert dtype == "boolean" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have tests for this?
Do we need that? This is an internal assertion, that should never be raised to the user but is here to help the developer (I don't think we should add tests for those asserts). So the "test" is that this actually never occurs in the tests.
@classmethod | ||
def _from_sequence(cls, scalars, dtype=None, copy=False): | ||
if dtype: | ||
assert dtype == "boolean" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I can also leave it out (the assert). For BooleanArray, the dtype
is not very useful. I think this parameter is mainly used for cases where multiple dtypes are possible per array (eg int64, int32 etc for IntegerArray)
"__rmod__", | ||
): | ||
# combine keeps boolean type | ||
expected = expected.astype("Int8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's what numpy does, eg:
In [17]: np.array([True, False]) ** 2
Out[17]: array([1, 0], dtype=int8)
So for those ops, I just followed numpy's behaviour with boolean arrays.
|
||
|
||
class TestComparisonOps(base.BaseComparisonOpsTests): | ||
def check_opname(self, s, op_name, other, exc=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's to override that there is no exception being raised (will add a comment about that).
(above is for the arithmetic ones)
thanks @jorisvandenbossche other things can be done as followups. |
@jorisvandenbossche for those of us who didn't follow the thread in real-time, can you summarize the design decisions that got made in the end? and any decisions that remain un-decided? |
…ndexing-1row-df * upstream/master: (185 commits) ENH: add BooleanArray extension array (pandas-dev#29555) DOC: Add link to dev calendar and meeting notes (pandas-dev#29737) ENH: Add built-in function for Styler to format the text displayed for missing values (pandas-dev#29118) DEPR: remove statsmodels/seaborn compat shims (pandas-dev#29822) DEPR: remove Index.summary (pandas-dev#29807) DEPR: passing an int to read_excel use_cols (pandas-dev#29795) STY: fstrings in io.pytables (pandas-dev#29758) BUG: Fix melt with mixed int/str columns (pandas-dev#29792) TST: add test for ffill/bfill for non unique multilevel (pandas-dev#29763) Changed description of parse_dates in read_excel(). (pandas-dev#29796) BUG: pivot_table not returning correct type when margin=True and aggfunc='mean' (pandas-dev#28248) REF: Create _lib/window directory (pandas-dev#29817) Fixed small mistake (pandas-dev#29815) minor cleanups (pandas-dev#29798) DEPR: enforce deprecations in core.internals (pandas-dev#29723) add test for unused level raises KeyError (pandas-dev#29760) Add documentation linking to sqlalchemy (pandas-dev#29373) io/parsers: ensure decimal is str on PythonParser (pandas-dev#29743) Reenabled no-unused-function (pandas-dev#29767) CLN:F-string in pandas/_libs/tslibs/*.pyx (pandas-dev#29775) ... # Conflicts: # pandas/tests/frame/indexing/test_indexing.py
@jbrockmendel started at #29556 (comment) |
@jbrockmendel this PR has no NA-related design decisions (it's just adding a BooleanArray that has not yet any specific NA behaviour). All discussions are still in the respective issues (#28095, #28778) and in the open PR to add NA (#29597), and will come in a future PR to add NA to BooleanArray. |
Closes #21778
xref closed PRs #22226 and #25415, and xref the discussions on missing values in #28095 and #28778
This PR adds a BooleanArray extension array ("boolean" dtype):