Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: use _values_for_argsort for join_non_unique, join_monotonic #32467

Merged
merged 2 commits into from
Mar 11, 2020

Conversation

jbrockmendel
Copy link
Member

With the .copy() removed from Categorical._values_for_argsort, ea_backed_index._data._values_for_argsort() matches ea_backed_index._ndarray_values in all extant cases.

cc @jorisvandenbossche @TomAugspurger need to confirm

a) this is an intended-adjacent use of _values_for_argsort, and not just a coincidence that it matches extant behavior
b) the .copy() this removes from Categorical._values_for_argsort is not important for some un-tested reason

xref #32452, #32426

@TomAugspurger
Copy link
Contributor

I believe the only requirement on values_for_argsort is that it's a monotonic transformation (it preserves the ordering).

I'm not sure why that copy was there.

@jbrockmendel
Copy link
Member Author

I believe the only requirement on values_for_argsort is that it's a monotonic transformation (it preserves the ordering).

It isn't clear to me what distinguishes this from values_for_factorize, as the docstring there also says An array suitable for factorization. This should maintain order and be a supported dtype. (the "supported dtype" bit isnt in the values_for_argsort docstring, but seems implied since it returns an ndarray)

This also came up in #30673 (attempt to implement value_counts in terms of EA methods). See also #32412.

Do we know of any 3rd-party EAs where _ndarray_values, values_for_argsort() and _values_for_factorize()[0] are meaningfully distinct?

pandas/core/indexes/base.py Show resolved Hide resolved
@jreback jreback added Clean ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 8, 2020
@jreback
Copy link
Contributor

jreback commented Mar 8, 2020

are there any user facing things that this now allows? e.g. joins on EA?

@jbrockmendel
Copy link
Member Author

are there any user facing things that this now allows? e.g. joins on EA?

behavior is unchanged

@jreback jreback added this to the 1.1 milestone Mar 11, 2020
@jreback jreback merged commit d4815a5 into pandas-dev:master Mar 11, 2020
@jreback
Copy link
Contributor

jreback commented Mar 11, 2020

thanks, pls followon with consoilidations when you can

@jbrockmendel jbrockmendel deleted the join_non_unique branch March 11, 2020 03:02
@jorisvandenbossche
Copy link
Member

My feeling says that this should use _values_for_factorize, as joining is a factorize-based algo?

@jbrockmendel
Copy link
Member Author

My feeling says that this should use _values_for_factorize, as joining is a factorize-based algo?

I think you're right, but ATM _ndarray_values matches values_for_argsort for all our Index-backing EAs, but values_for_factorize()[0] is slightly different for DTA/TDA (_data vs asi8). If we determine that we can change DTA/TDA _values_for_factorize (possibly as part of the discussion in #32586) then ill switch over these usages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants