-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322
ENH: add ExtensionArray.to_numpy to have control over conversion to numpy array #30322
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice overall, thanks.
pandas/core/arrays/boolean.py
Outdated
Whether to ensure that the returned value is a not a view on | ||
the array. Note that ``copy=False`` does not *ensure* that | ||
``to_numpy()`` is no-copy. Rather, ``copy=True`` ensure that | ||
a copy is made, even if not strictly necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give some guidance on when no-copy is possible? Is it only when there are no missing values and we're going to the numpy dtype (bool in this case)?
And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is only possible with no NAs and bool dtype. I first thought int8 would also be possible, but numpy doesn't seem to do such conversion without copy.
And thinking forward, can a pyarrow array with no NAs be converted to an ndarray without any copies?
For boolean not (since it is bits, not bytes), but in general (eg for IntegerArray without nulls) yes
pandas/core/arrays/boolean.py
Outdated
else: | ||
raise ValueError( | ||
"cannot convert to bool numpy array in presence of missing values" | ||
) | ||
data = self._data.astype(dtype) | ||
data[self._mask] = na_value | ||
if self.isna().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this up to before L325? It seems like an isna().any()
is going to happen at least once, so should avoid ever doing it twice.
Just a note: The signature here differs from Series/Index.to_numpy and DataFrame.to_numpy, since this includes an Should we add |
Would that already be useful for any of the other implementations / dtypes? Right now I was planning to pass-through additional keywords to the EA version (if the Series is backed with EA and has a to_numpy method, or something). But of course adding the keyword specifically makes it clearer to document / find it. |
Just passing through makes sense. I think we can actually add a default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think merge whenever you're comfortable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. do you need to add to api.rst?
Shouldn't need to. I think we autodoc all the methods here. |
I added the pass-through from Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, but the signature change to kwargs is not great, can you directly specify it?
pandas/core/base.py
Outdated
@@ -780,7 +780,7 @@ def array(self) -> ExtensionArray: | |||
|
|||
return result | |||
|
|||
def to_numpy(self, dtype=None, copy=False): | |||
def to_numpy(self, dtype=None, copy=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instead add na_value here?
pandas/core/base.py
Outdated
@@ -780,7 +780,7 @@ def array(self) -> ExtensionArray: | |||
|
|||
return result | |||
|
|||
def to_numpy(self, dtype=None, copy=False): | |||
def to_numpy(self, dtype=None, copy=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is changing the signature of all EA, but are you adding tests? (e.g. StringArray / IntArray), or as followup?
Note this is not changing the signature of the EA method (this is the Series/Index method)
I think in general it might be useful to pass through kwargs, in that way ExtensionArray authors can have more specific control over the conversion to numpy (if we do this, we should add a test for it for one of the test EAs, like DecimalArray, with an additional keyword in the to_numpy
method).
Specifically for na_value
, we could maybe add it to the actual signature. Although it is right now not implemented for most of the dtypes (only for boolean EA dtype).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, actually meant that. yeah I would be explict here I think. the question is can you? e.g. you want the default to be libmissing.NA
, but that is not the default for anything but BA now? I think it would be better to be explicit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what makes it more complicated here. It's really a EA-specific keyword, and the general solution for that is passing through keywords (as done here), I think.
Of course, for our own EAs, we can make exceptions and explicitly add them.
Now, since the other dtypes don't yet implement this, the default of libmissing.NA
might not necessarily be a problem (this default is not valid for other dtypes, but it's not used for those dtypes anyway). Now, having this keyword with such default might also be confusing, since the NA won't be used for most keywords.
If we add it here explicitly, I would probably use the lib._no_default
trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add it here explicitly, I would probably use the lib._no_default trick
I think this would be the better long term soln ,yes?
can you rebase & update to comments |
I've pushed a somewhat large change that implements the discussion in #30322 (comment). It adds two main things
A few things to note
|
The CI failure is unrelated and is being fixed in #30660 |
Thanks for the updates! Indeed, the |
@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
return objects | |||
|
|||
|
|||
_no_default = object() | |||
# Note: _no_default is exported to the public API in pandas.api.extensions | |||
_no_default = object() #: Sentinel indicating the default value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename this to no_default
? (no leading underscore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it, don't have a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree we should just rename this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we have a number of type we use object as the marker (so should change those in a followup)
This is passing again. |
Would like to merge this in a few hours, and put up a PR creating a base class for BooleanArray & IntegerArray later today. |
I already have a branch lying around with an initial take on this, in case you didn't start yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. I would do the rename of _no_default -> no_default though
@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
return objects | |||
|
|||
|
|||
_no_default = object() | |||
# Note: _no_default is exported to the public API in pandas.api.extensions | |||
_no_default = object() #: Sentinel indicating the default value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree we should just rename this
@@ -2232,7 +2232,8 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0, | |||
return objects | |||
|
|||
|
|||
_no_default = object() | |||
# Note: _no_default is exported to the public API in pandas.api.extensions | |||
_no_default = object() #: Sentinel indicating the default value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we have a number of type we use object as the marker (so should change those in a followup)
There is still a difference in the handling of NA in conversion to float between integer and boolean. For boolean, I implemented it as strict. So a float cannot hold NA, so if there are missing values, this errors, and you need to explicitly use
While for integer, you implemented this less strict (it's the base EA.to_numpy implementation that is getting this behaviour of
It would be nice to align this. This relates to the discussion in the issue #30038 |
Yes, +1. If we make it public to use for external EA authors, let's make that clear with the name you can use it (and the actual object itself is already in the private |
Rereading that again, it seems the conclusion was somewhat that we want to raise for this (so no automatic conversion of pd.NA to np.nan) |
OK with doing the _no_default -> no_default in a followup? |
FYI, I have a followup PR doing IntegerArray.to_numpy, which will fix the inconsistencies when converting to an ndarary (changing integer to match the behavior here). I think that needs to be its own PR, with a followup moving them to the base class. OK to merge this? |
You saw my comment above about this? (I have an initial branch for this, in case you didn't start it yet) But yes, I think that is good as a follow-up. |
xref #30038
Would still need to add this to the other arrays with NA (IntegerArray, StringArray), and need to pass through such option from Series.to_numpy. But already putting this up to check if we are OK with such interface and behaviour.