-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: EA._hash_pandas_object #51319
ENH: EA._hash_pandas_object #51319
Conversation
@MarcoGorelli any thoughts here? |
I'll take a look this week, thanks for the ping |
@phofl thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@pytest.mark.xfail(reason="ValueError: setting an array element with a sequence") | ||
def test_hash_pandas_object(self, data): | ||
super().test_hash_pandas_object(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you expand on this please? why are we adding it, why does it fail, is it expected that it should pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSONArray is kind of semi-abandoned, about 20% of the existing tests in this file are xfailed. So it is not "expected" in that I do not expect it to pass ever, but it is "expected" in that if anyone ever wanted to make it usable they would want to make this test pass.
Also, does it need adding here pandas/pandas/core/arrays/base.py Lines 117 to 146 in 089aa0c
? |
Yes, will update |
the docs job is failing , showing
seems related? |
its complaining about No Examples section found? how do i tell it to ignore that? |
the ignore list is in code_checks.sh |
thx @jbrockmendel |
""" | ||
from pandas.core.util.hashing import hash_array | ||
|
||
values = self.to_numpy(copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel is there a reason you didn't use self._values_for_factorize()
as the values to hash (as default), which would preserve the current behaviour?
In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array
publicly (eg in pandas.api.extensions
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you didn't use self._values_for_factorize()
IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.
In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?
Documenting makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.
I have said that before, but I don't see any problem with using that for both, as we have always done. They are very related, and typically values suitable for factorization will also be suitable for hashing, given that factorization is hash-based. And now with this new method, EAs can still override it if they want to use something else than _values_for_factorize
.
We actually have a bunch of internal doc comments indicating that _values_for_factorize
is also used for hash_pandas_object. And we can add this to the actual _values_for_factorize
docstring in the base class to make this use case explicit.
For external EAs that relied on this, this is a regression that we no longer use this by default (e.g. for geopandas this will now fail in hash_object_array
, falling back to casting to string before trying it again. But the string representation is not that faithful, and thus can give wrong hashes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we can add this to the actual
_values_for_factorize
docstring in the base class to make this use case explicit.
It was actually documented, which was removed here.
@@ -1452,6 +1448,31 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs): | |||
# Non-Optimized Default Methods; in the case of the private methods here, | |||
# these are not guaranteed to be stable across pandas versions. | |||
|
|||
def _hash_pandas_object( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something else: you added this method in the "Non-Optimized Default Methods" section, which says that private methods in this section are not guaranteed to be stable (in contrast to well established methods with an underscore that are meant to be overridden like _values_for_factorize
, _from_sequence
, etc, which are defined more above).
Is that indeed the intention? (in that case it should not be mentioned in the documentation that this is a method to override? Currently it is added to the docstring higher up in this file)
(If we don't want that external EAs override this, that's a another reason we should certainly keep using _values_for_factorize
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in contrast to well established methods with an underscore
Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.
IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in contrast to well established methods with an underscore
Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.
It's what the documented methods (that happen to have a underscore or not) mean: we currently document which methods can (or have to / are meant to) be overriden by EA implementations. And you added _hash_pandas_object
to the documentation.
For example _from_sequence
or _reduce
have an underscore, but for sure external EAs can rely on implementing those, regardless of the underscore.
I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)
IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)
Yes, and that's fine I think. But in general when adding methods to the EA base class, we need to be conscious about whether we see such a method as something external EA authors can also override or not. Because if we explicitly are OK with external EA authors using it, we need to give it some higher level of stability.
We could also consider adding methods to our own classes and not to the base class, or have a some pandas BaseArray(ExtensionArray)
class that does a bunch of extra things we only want to do for our own ones, where we can be more flexible and don't have to worry about backwards compatibility.
That of course has the disadvantage of creating a split (and having to check, whenever we call that, if the methods exists and have some fallback), but could be easier for certain optimizations where we don't directly want to promise stability / expose this to external EAs.
Speaking as such an external EA author, I think I would be fine with this (and when there is request, we could "promote" a method to the official public EA interface). As a pandas maintainer, it might give some more complexity to know if some method is in which category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)
I'm specifically not interested in reviving the naming convention discussion.
I assume anything in the EA namespace is liable to be overridden by subclass authors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume anything in the EA namespace is liable to be overridden by subclass authors.
The reason of my original comment here is because we have the following comment in the base EA class (one that you added I think):
pandas/pandas/core/arrays/base.py
Lines 1447 to 1450 in d06f2d3
# ------------------------------------------------------------------------ | |
# Non-Optimized Default Methods; in the case of the private methods here, | |
# these are not guaranteed to be stable across pandas versions. | |
So while you say that you think anything in the EA namespace can be overridden by EA authors, the above comment at least seems to indicate there are some methods for which we guarantee stability, while for others it's at your own risk (to potentially keep it compatible with every new minor version).
So essentially all that my original comment wanted to say is: if we want that EA authors need to implement _hash_object_array
(as you also suggest in #53501), we should move that method a bit higher up in this file to be above that comment about private methods not being stable.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.