-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow EAs to override MergeOperation._get_join_indexers #53696
Comments
As an FYI, my EA (which extends PintArray to support uncertain magnitudes, hgrecco/pint-pandas#140) is getting hung up on a mismatch between what I understood to be the interface to _values_from_factorize and what rizer._value_from_factorize is returning. My grief comes from here (pandas/tests/extension/base/reshaping.py):
Down in the merge we have:
and the obviously NONUNIQUE values here:
My understanding was that values returned by _values_for_factorize should be unique (and should not contain the na_value if use_na_sentinel is true). When I allow my _values_for_factorize to return duplicated values, other tests fail. Clearly this code (from pandas/core/arrays/base.py) is doing nothing to unique-ify the values of the data, though the documentation snippet does refer to
Help? |
An EA's _values_for_factorize does not need to be unique. The reference to "uniques" in the _values_for_factorize docstring is about what is returned by factorize. I agree this is confusing, and in fact am of the opinion that _values_for_factorize is a bad pattern that should go in general (xref #53501). As for things failing when you make values_for_factorize return something non-unique, let's find a dedicated place to discuss that. Is pint-pandas#140 appropriate for that? |
I got it sorted by aligning the non-unique _values_for_factorize with my EA factorize code that does the unique-ification itself. All good. I'm now passing. Please close. |
AsOfMerge._get_join_indexers calls to_numpy() on EAs, which can be costly. _MergeOperation._get_join_indexers is a bit more forgiving, using _values_for_factorize in _factorize_keys. The latter still requires a cast to numpy, which makes this a non-starter for (hypothetical) distributed/GPU EAs.
Can we push this up into an EA method that can be potentially overridden? This might be a use case for @mroeschke 's "ExtensionManager" idea.
The text was updated successfully, but these errors were encountered: