-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Return BoolArray for string ops when backed by StringArray #30239
API: Return BoolArray for string ops when backed by StringArray #30239
Conversation
@@ -1825,7 +1825,7 @@ def test_extractall_same_as_extract_subject_index(self): | |||
|
|||
def test_empty_str_methods(self): | |||
empty_str = empty = Series(dtype=object) | |||
empty_int = Series(dtype=int) | |||
empty_int = Series(dtype="int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone with a 32-bit platform confirm the behavior on master for .str
methods returning int dtype? Is it int32 or int64?
We may have been inconsistent before, and returned int32 for empty, but int64 for non-empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct; not sure we return a platform int save maybe niche cases
Do we want to optimize the BooleanArray first before returning it from methods like this? I think could have non-trivial memory impacts doing this right now |
Recall this is just for StringDtype, not object-dtype backed Series. So there's no harm to the current users. And improving performance later is much easier than breaking API. |
|
||
mask = isna(arr) | ||
|
||
assert isinstance(arr, StringArray) | ||
arr = np.asarray(arr) | ||
|
||
if is_integer_dtype(dtype): | ||
if is_integer_dtype(dtype) or is_bool_dtype(dtype): | ||
if is_integer_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MyPy apparently doesn't like this...
pandas/core/strings.py:164: error: Incompatible types in assignment (expression has type "Type[BooleanArray]", variable has type "Type[IntegerArray]")
Any suggestions on how to please the type checker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea if you have assignment in an if...else
block the type is inferred from the first one that appears.
So before the block you can just declare constructor: Type[Union[IntegerArray, BooleanArray]]
or maybe even something simpler like constructor: Type[ExtensionArray]
depending on what is valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I think I got it.
|
||
mask = isna(arr) | ||
|
||
assert isinstance(arr, StringArray) | ||
arr = np.asarray(arr) | ||
|
||
if is_integer_dtype(dtype): | ||
if is_integer_dtype(dtype) or is_bool_dtype(dtype): | ||
constructor: Union[Type[IntegerArray], Type[BooleanArray]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constructor: Union[Type[IntegerArray], Type[BooleanArray]] | |
constructor: Type[Union[IntegerArray, BooleanArray]] |
Optional but less verbose if you put the Union inside of the Type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -74,6 +74,7 @@ These are places where the behavior of ``StringDtype`` objects differ from | |||
l. For ``StringDtype``, :ref:`string accessor methods<api.series.str>` | |||
that return **numeric** output will always return a nullable integer dtype, | |||
rather than either int or float dtype, depending on the presence of NA values. | |||
Methods returning **boolean** output will return a nullable boolean dtype. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a doc-link here (can be followup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to hold off on this. I'm planning to restructure the docs for integer / boolean / NA once all these PRs are in.
if is_integer_dtype(dtype): | ||
constructor = IntegerArray | ||
else: | ||
constructor = BooleanArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to combine the above if/else reading this is super confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see an easy way. dtype.construct_array_type
isn't an option since we have a the functions using na_map
use dtypes like bool
, int
, since they work with both object-dtype arrays returning numpy arrays, or StringArray returning EAs. So we can have either a NumPy dytpe or an extension type here.
Planning to merge this in a few hours. |
What's your current plan? |
…as-dev#30239) * API: Return BoolArray for string ops
ref #29556