Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: EA._hasnans, EADtype.empty #45024

Merged
merged 3 commits into from
Dec 23, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 13 additions & 3 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,7 +425,7 @@ def __contains__(self, item: object) -> bool | np.bool_:
if not self._can_hold_na:
return False
elif item is self.dtype.na_value or isinstance(item, self.dtype.type):
return self.isna().any()
return self._hasnans
else:
return False
else:
Expand Down Expand Up @@ -603,6 +603,16 @@ def isna(self) -> np.ndarray | ExtensionArraySupportsAnyAll:
"""
raise AbstractMethodError(self)

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not use a cached_property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, EAs are mutable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right (though actually we could do this as we know when they are mutated, e.g keep a dirty bit). but out of scope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've looked into this w/r/t DatetimeArray.freq and found that propagating it through views (e.g. arr[:5][0] = np.nan) makes this a giant pain

def _hasnans(self) -> bool:
# GH#22680
"""
Equivalent to `self.isna().any()`.

Some ExtensionArray subclasses may be able to optimize this check.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a method that subclasses can override, can you add it to the documentation at the top of this file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also rename it to _hasna or so, to be consistent with our public NA related properties (we could also make this public?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also rename it to _hasna or so, to be consistent with our public NA related properties

I used _hasnans to be consistent with the datetimelike arrays (didn't realize the masked arrays had _hasna until just now). Also consistent-ish with the hasnans we have in Index and Series.

We should revisit the naming convention(s) once the NA/NaN API design discussion reaches some kind of conclusion.

(we could also make this public?)

I chose to keep this private so we would have the option of changing it later.

(Side-note: "private" for EA methods is really ambiguous, could mean a) user shouldn't use use, b) should only be called from self, c) 3rd party authors shouldn't use, d) 3rd party implementors shouldn't override. Not sure how we can clarify or tighten this)

Copy link
Member

@jorisvandenbossche jorisvandenbossche Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current situation is IMO implicitly: everything in the base class is automatically "public" for developers of third party EAs, unless explicitly documented.

_hasnans is being used in methods that EA authors can override, and is useful for them as well, and they are also allowed to override it. So I think this is "public" for developers (regardless of it being public for users or not), and thus should decide on a name now, IMO.

The question about making it public for users (no leading underscore) is indeed something that can also be handled later.

"""
return self.isna().any()

def _values_for_argsort(self) -> np.ndarray:
"""
Return values for sorting.
Expand Down Expand Up @@ -686,7 +696,7 @@ def argmin(self, skipna: bool = True) -> int:
ExtensionArray.argmax
"""
validate_bool_kwarg(skipna, "skipna")
if not skipna and self.isna().any():
if not skipna and self._hasnans:
raise NotImplementedError
return nargminmax(self, "argmin")

Expand All @@ -710,7 +720,7 @@ def argmax(self, skipna: bool = True) -> int:
ExtensionArray.argmin
"""
validate_bool_kwarg(skipna, "skipna")
if not skipna and self.isna().any():
if not skipna and self._hasnans:
raise NotImplementedError
return nargminmax(self, "argmax")

Expand Down
14 changes: 14 additions & 0 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from pandas._libs.hashtable import object_hash
from pandas._typing import (
DtypeObj,
Shape,
npt,
type_t,
)
Expand Down Expand Up @@ -208,6 +209,19 @@ def construct_array_type(cls) -> type_t[ExtensionArray]:
"""
raise AbstractMethodError(cls)

def empty(self, shape: Shape) -> type_t[ExtensionArray]:
"""
Construct an ExtensionArray of this dtype with the given shape.

Analogous to numpy.empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a Parameters section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, just added this to me next follow-ups branch.

Returns
-------
ExtensionArray
"""
cls = self.construct_array_type()
return cls._empty(shape, dtype=self)

@classmethod
def construct_from_string(
cls: type_t[ExtensionDtypeT], string: str
Expand Down
8 changes: 7 additions & 1 deletion pandas/tests/extension/base/constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,12 @@ def test_construct_empty_dataframe(self, dtype):
def test_empty(self, dtype):
cls = dtype.construct_array_type()
result = cls._empty((4,), dtype=dtype)

assert isinstance(result, cls)
assert result.dtype == dtype
assert result.shape == (4,)

# GH#19600 method on ExtensionDtype
result2 = dtype.empty((4,))
assert isinstance(result2, cls)
assert result2.dtype == dtype
assert result2.shape == (4,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this public, we should maybe a test that passing a scalar integer works as well (if we want to allow that, following np.empty)