Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA interface - requirements for "hashable, value+order-preserving ndarray" #33276

Open
jorisvandenbossche opened this issue Apr 3, 2020 · 15 comments
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 3, 2020

Across a set of issues/PRs (eg #32586, #32673, #33064), there has lately been quite some discussion regarding _values_for_factorzie / _values_for_argsort, the need for roundtrippability, the need for order-preserving, the need for _ndarray_values, ...

Those different questions and topics that came up (as far as I kept track, probably not complete! but already too long .. ;)):

  • In EA: revisit interface #32586, the question was raised what the difference is between _values_for_argsort and _values_for_factorize, and whether we need both.

    Some difference that came up:

    • The main difference might be that the values in _values_for_factorize need to be hashable, while the ones in _values_for_argsort don't need to be (although this difference is not properly documented). Looking back at ENH: Sorting of ExtensionArrays #19957 from @TomAugspurger, it was mentioned the that "sortable" is an easier requirement than what other algos like factorize might need.
      Related to this difference is that _values_for_factorize needs to return a dtype supported by the hashtables (int64, uint64, float64, object), while _values_for_argsort can return any sortable dtype (so also int8, int32, etc).
    • The return type is different: _values_for_factorize also returns a na_value sentinel, which means you can encode missing values in a different way than a "missing value" (eg nan in float dtype). While for _values_for_argsort, it simply returns one array (I would need to look into the details how missing values are handled here, it seems they are filtered out in nargsort, so it might not matter how they are encoded in the returned array).

    Is this sufficiently different to warrant two methods? Probably, with a bit work, they could be combined in a single method. However, their original purpose was only to help implement EA.factorize() and EA.argsort(). So for that purpose only, it might not necessarily be worth trying to combine them. And see the last bullet point for a more general "hashable, orderable array"

  • In addition, we actually also have _values_for_rank for Categorical, which we probably should try to get rid off as well -> BUG: Categorical.values_for_(factorize|argsort) dont preserve order #33245

  • I have argued that in general, we should also look at a "masked" version of eg _values_for_factorize. Having the option to return a (values, mask) tuple in addition to (values, na_sentinel) in case this is easier/cheaper to provide (which is especially the case for the masked arrays; this will need to support of the factorize algos for masks though -> eg ENH/PERF: use mask in factorize for nullable dtypes #33064)

  • We also had a vaguely defined _ndarray_values (API / internals: exact semantics of _ndarray_values #23565), that was recently removed (CLN: remove _ndarray_values #32768). It was eg used in indexing code (index engine, joining indexes), where it was replaced with _values_for_argsort (CLN: use _values_for_argsort for join_non_unique, join_monotonic #32467, REF: implement _get_engine_target #32611).

  • What else can they be used for internally? (_values_for_factorize / _values_for_argsort)
    As mentioned above, _values_for_argsort is since recently used for ExtensionIndex joining and engine values. Further, _values_for_factorize is used in the general merging code.

    However, the initial purpose of _values_for_factorize / _values_for_argsort was not to be used internally in pandas, but only has a helper to EA.factorize() and EA.argsort(). So following our current EA interface spec, we should not use them internally (which means we should fix the few cases where we started using them).
    The spec about factorize is clear that there are two ways to override its behaviour: implement _values_for_factorize/_from_factorized, or implement factorize itself:

    # Implementer note: There are two ways to override the behavior of
    # pandas.factorize
    # 1. _values_for_factorize and _from_factorize.
    # Specify the values passed to pandas' internal factorization
    # routines, and how to convert from those values back to the
    # original ExtensionArray.
    # 2. ExtensionArray.factorize.
    # Complete control over factorization.

    So external EAs are not guaranteed to have an efficient implementation of _values_for_factorize/_values_for_argsort (they still have the default astype(object) implementation).
    Fletcher is an example of external EAs that implement factorize and not _values_for_factorize.

    So ideally, for anything factorize/argsort-related, we should actually always call the EA.factorize() or EA.argsort() methods.

  • In API: should values_for_factorize and _from_factorized round-trip missing values? #32673, @jbrockmendel questioned whether the _values_for_factorize and _from_factorized combo should faithfully roundtrip? Currently, they do, but not necessarily when missing values are included.
    However, when only considering them as "internal" to the EA.factorize() implementation, this question doesn't actually matter. But it does matter when we want to use those values more generally.

  • I mentioned above that ideally we should use factorize() or argsort() directly as much as possible and avoid _values_for_factorize/argsort (since this is the official EA interface). However, there are still cases where such direct usage is not sufficient, and where we actually need some "values".

    For example, in the merging/joining code, you can't "just" factorize() the left and right array, because then the integer codes of left and right both don't necessarily can be matched (it depends on the uniques being present what those integers mean).

    I think it is clear we have some use case for "ndarray values", but so we should think about for which use cases we need that and what requirements we have for those.
    @jbrockmendel started to list some requirements here: EA: revisit interface #32586 (comment)


Having read again through all the recent issues and having written up the above, my current take-away point are:

  • To start, we should maybe put the questions around _values_for_factorize / _values_for_argsort aside for a moment. In principle they are internal to EA.factorize() / EA.argsort(), and so we could also remove those methods (if we wanted that) if we "just" require EA authors to implement factorize/argsort directly instead of through the helper _values_for.. .
    And if we figure out the general "ndarray values" requirements (see below), we can still come back to this to see if we can actually replace both _values_for_factorize / _values_for_argsort with this single "ndarray values" interface.

  • I now think that replacing _ndarray_values with _values_for_argsort to be able to remove _ndarray_values actually didn't solve much. We replaced one vaguely specified property (_ndarray_values) with another (_values_for_argsort for other purposes than just sorting, as there are currnetly also no guarantees / requirements outside of sorting specified for _values_for_argsort).

  • Instead, I would focus on figuring out what the requirements are for the "hashable / value preserving ndarray values":

    1. What are the exact use cases we need this for?
    2. What are the exact semantics needed for those use cases? (hashable, orderable, deterministic across arrays, ..)
    3. Do those use cases need roundtripping of those values?
    4. How would we implement those values for the internal EAs?*

    It might be that this ends up to be something close to what _values_for_argsort or _values_for_factorize now are. But I also think that my comment above about the possibility to include a mask in this interface is important for the nullable dtypes.

  • An alternative that we didn't really mention yet, is adding more to the EA interface instead of requiring this "ndarray values" concept. For example, if we want that external EAs can have control over joining, we could have a EA.__join__(other_EA, how) -> Tuple[ndarray[int], ndarray[int]] that returns indices into left and right EAs that determine how to join them.
    For joining that might be a relatively straightforward interface, for the indexing engine that looks more complex though (but so let's first define the use cases).

@jorisvandenbossche jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 3, 2020
@jorisvandenbossche
Copy link
Member Author

So you could summarize my long post as "let's put back _ndarray_values" ... ahum
(but then better defined! and maybe with a mask ;))

@jbrockmendel
Copy link
Member

Thanks for putting this together. You're definitely right that the threads have become hard to follow.

Is there anything in this that is a) currently actionable and b) to the best of your knowledge we have consensus on?

For the list of "places where we use _values_for_foo and maybe shouldnt": nargsort dispatches to EA._values_for_argsort instead of just dispatching to EA.argsort.

Short-term we can get rid of the _values_for_argsort usages in Index code by re-defining _get_engine_target on the relevant subclasses. I'll make an effort to clean up the other _values_for_foo usages that can be worked around.

Some more considerations:

  • mask works better than na_value for PandasArray[object] that may contain multiple distinct NA values
  • For ExtensionIndex.get_loc(..., tolerance=...) we would need _get_engine_target to be not just _ordinal_values but also _cardinal_values.

Quick Hits:

  • Does Categorical._values_for_factorize need the codes be cast to int64? Seems like a no-copy version of that would be preferred
  • Should BooleanArray._values_for_argsort cast like its _vff does?
  • EA._values_for_factorize uses self.astype(object); should that be np.asarray(self)? (similar, EA._values_for_argsort uses np.array, should be np.asarray?)
  • how should _values_for_argsort behave w/r/t na_location? PA returns i8, which will put NAs first, whereas DTA/TDA will out NAs last (in recent numpy)
  • I think PA._values_for_argsort (which returns i8 values) will be poorly behaved
  • Does IntegerArray._values_for_argsort risk overflows?

@jbrockmendel
Copy link
Member

Looking at removing the 4 _values_for_factorize usages in reshape.merge. The natural thing to do would be to use factorize directly. The trouble ATM is that in _factorize_keys we use the libhashtable.Factorizer object to call rizer.get_count() and rizer.uniques.to_array, which I dont think we have any analogue to in EA.factorize. pls advise.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Apr 4, 2020

Is there anything in this that is a) currently actionable and b) to the best of your knowledge we have consensus on?

Not yet, I think, so let's first discuss a bit more before jumping to PRs (now, properly thinking the requirements through might need some code experiments though).

Short-term we can get rid of the _values_for_argsort usages in Index code by re-defining _get_engine_target on the relevant subclasses. I'll make an effort to clean up the other _values_for_foo usages that can be worked around.

I would wait a bit with removing those, until we figure out what we want to replace it with. Yes, for our own Index subclasses, we can just redefine the engine values in the subclasses, but that's not a general solution for EAs. And ideally, our own index subclasses that are based on EAs, will be able to use this general solution (they are actually the best test cases for checking a solution).

Does Categorical._values_for_factorize need the codes be cast to int64? Seems like a no-copy version of that would be preferred

The hashtable is only implementd for int64, so that is the reason that a cast to int64 is indeed necessary. See eg this (used by pd.factorize before passing the values to the hashtable):

def _ensure_data(values, dtype=None):
"""
routine to ensure that our data is of the correct
input dtype for lower-level routines
This will coerce:
- ints -> int64
- uint -> uint64
- bool -> uint64 (TODO this should be uint8)

Should BooleanArray._values_for_argsort cast like its _vff does?

So since nargsort just sorts the values, there is no casting needed (numpy's argsort is used, which can handle all dtypes).
So this is actually one of the additional differences right now between _values_for_argsort and _values_for_factorize (will add this in the above overview).

I think PA._values_for_argsort (which returns i8 values) will be poorly behaved

Why do you think this would be the case? Only the non-missing values need to sort correctly within the array, which seems ok with the integer periods?

Does IntegerArray._values_for_argsort risk overflows?

Hmm, yes. But I think when we implemented that, we didn't look very well at what nargsort is actually doing with those values (me included). Looking at it now in more detail, nargsort masks out missing values, and only sorts the remaining values. So actually how you encode the missing values in the returned array should not matter?
(might be worth to have a separate issue to flesh this out, as it also seems that Series itself is masking missing values, which seems to duplicate what is in nargsort. This might also be related to the last point below, though).

For the list of "places where we use _values_for_foo and maybe shouldnt": nargsort dispatches to EA._values_for_argsort instead of just dispatching to EA.argsort.

Yeah, that was not how it was originally, but changed to fix a bug, apparently (looking back for it, I found discussion about this here #27137 (comment) (and comments below)). There is an open issue about fixing this: #27218

@jorisvandenbossche
Copy link
Member Author

Looking at removing the 4 _values_for_factorize usages in reshape.merge. The natural thing to do would be to use factorize directly. The trouble ATM is that in _factorize_keys we use the libhashtable.Factorizer object to call rizer.get_count() and rizer.uniques.to_array, which I dont think we have any analogue to in EA.factorize. pls advise.

Is it actually possible to use factorize for this?
To be clear, I also though it was possible earlier this week ;) (proven by my comment at #33276). But now I am not so sure anymore.

So simplified we have something like this right now in reshape/merge.py for EAs:

lk, _ = lk._values_for_factorize()
rk, _ = rk._values_for_factorize()

rizer = Factorizer(max(len(lk), len(rk)))

llab = rizer.factorize(lk)
rlab = rizer.factorize(rk)
uniques = rizer.uniques.to_array()

We could do like this:

llab, luniques = lk.factorize()
rlab, runiques = rk.factorzie()
uniques = union(luniques, runiques)

but the problem here is that the left keys and right keys get "factorized independently" (while in the actualy code as above, we are using the same Factorizer object to factorize both left and right keys).
So if we want do something like this with using EA.factorize(), we would need to either concatenate left and right keys in a single EA first to factorize them together, or otherwise need to "recode" the left and right labels based on the mapping of luniques/runiques to common uniques, or .. And those two possible solutions I could come up with, don't sound very efficient ...

@jreback
Copy link
Contributor

jreback commented Apr 7, 2020

discussed in an unrelated PR

i see quite a lot of code that must convert an EA to something we can pass to cython; so kind of like _values_for_argsort but with Nans filled (though if we always used masks that would maybe be better)

eg something like

vals = vals.to_numpy(dtype=float, na_value=np.nan)

for nullable integers, but ideally we just a method on EAs (and of course we want a non object type here)

@jorisvandenbossche
Copy link
Member Author

For future reference, the PR is this one: https://github.com/pandas-dev/pandas/pull/33138/files#r404458720, where we need to decide which values get passed into a cython algorithm (in this case a grouped one).

@jbrockmendel
Copy link
Member

AFAICT we haven't made any progress on getting rid of our own usages of _values_for_argsort and _values_for_factorize. If we don't expect that to change, we should move them from "optional convenience methods" to required methods.

@jbrockmendel
Copy link
Member

I've made some progress on _values_for_argsort. We're down to 3 places in the non-test code where we use it, only one of which is clearly A Problem and will be removed in an upcoming PR (in merge_asof).

The remaining two are in EA.argsort and in nargminmax. In EA.argsort is clearly OK. In nargminmax I'm not wild about but can live with. In both cases, we can be more specific about what is required/safe.

  1. It is safe to return a view on self as the caller will never alter the _values_for_argsort() inplace.
  2. The caller will ignore entries i with self.isna()[i].

In particular for MaskedArrays this means we can safely just return self._data and avoid making copies.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 2, 2022

In nargminmax I'm not wild about but can live with

I think for this case is also fine, because in the end this is 1) only used in the EA implementation (so it is similar as how _values_for_argsort is used in EA.argsort, only the implementation here lives in a different file), and 2) argsort and argmin/max are also very much related algorithms with the same requirements.

We should probably mainly update the implementers note in the base class that if you implement argsort directly instead of implementing _value_for_argsort, that you should also implement argmin/argmax directly.

In both cases, we can be more specific about what is required/safe.

  1. It is safe to return a view on self as the caller will never alter the _values_for_argsort() inplace.
  2. The caller will ignore entries i with self.isna()[i].

Yes, that's correct I think.

And in the meantime, you already have documented this better and updated it for the masked array implementation in #45434


So I think with that, _values_for_argsort is mainly resolved, which still leaves us the usage of _values_for_factorize?

I think the main use case we need to resolve is its usage in joining (merge.py::_factorize_keys). This usage might actually be fine in general (if we document it), but we should extend it to work for masked arrays as well (in the top post I mentioned the possibility to return (values, mask) instead of (values, na_sentinel) from _values_for_factorize).

@jbrockmendel
Copy link
Member

I think the main use case we need to resolve is its usage in joining (merge.py::_factorize_keys)

Correct. Also one in utill.hashing.hash_array

@jbrockmendel
Copy link
Member

I think for this case is also fine, because in the end this is 1) only used in the EA implementation (so it is similar as how _values_for_argsort is used in EA.argsort, only the implementation here lives in a different file), and 2) argsort and argmin/max are also very much related algorithms with the same requirements.

That's reasonable. We should document that authors that override argsort to not use _values_for_argsort probably also need to implement argmin/argmax.

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 22, 2022

Based on a review of the outstanding issues and conversation last week with @jorisvandenbossche and @TomAugspurger:

We have a bunch of places where we

  1. cast EA->ndarray in a X-preserving way ("X" described below)
  2. operate on that ndarray (often in cython)
  3. pseudo-round-trip the results ("pseudo" described below)

Some places that broadly fit this pattern:

The "pseudo" part of "pseudo-round-trip" is because some of the operations in question aren't quite dtype-preserving, but can be e.g. Int64->Float64

The "X" part of "X-preserving way" is bc we need to retain different characteristics in different contexts (could make this configurable by making an EA/EADtype method analogous to groupby.ops.WrappedCythonOp._disallow_invalid_ops)

  • value-preserving, requires both directions to be injective
  • order-preserving- for e.g. groupby.rank
  • cardinal-preserving- for e.g. std

Finally, we have a usage of _values_for_factorize in merge.py which (would be good if someone double-checks me on this) I think is relying on an undocumented assumption of stability in _values_for_factorize across different EAs of the same dtype, to the effect of:

lvals = left._values_for_factorize()[0]
rvals = right._values_for_factorize()[0]
result = np.concat([lvals, rvals])
expected = concat_compat([left, right)])._values_for_factorize()[0]
tm.assert_numpy_array_equal(result, expected)

This is a natural enough assumption that I think it's worth codifying.


Thoughts on a way forward.

I'm optimistic-but-not-certain that we can eventually combine all of these cases into a pattern like:

def _values_for_cython(self, method):
    if not self._supports_method(method):
        raise NotImplementedError/TypeError
    return ndarray, sentinel, optional_mask

def _from_cython_result(self, result):
     return EA

Shorter-term, some steps that I think will be useful regardless of whether we can ultimately get down to a single _values_for_X:

  1. Deprecate _values_for_factorize behavior so in the future it returns ndarray, sentinel, optional_mask
  2. Codify the concat-invariance requirement described above for _values_for_factorize
  3. Codify that _values_for_factorize()[0] has the same convenient characteristics as _values_for_argsort() w/r/t views documented in PERF: MaskedArray._values_for_argsort #45434
  4. Determine whether _values_for_argsort() and _values_for_factorize()[0] are redundant (AFAICT they are) and if so document/codify.
  5. Deprecate the behavior of _from_factorized, making it an instance method instead of a class method, removing the original arg.
  6. Add/document (with deprecation if necessary) a requirement on _from_factorized that it round-trip nicely.
    • The idea being to make it a special-case of the eventual more general _from_cython_result
  7. Incrementally refactor groupby.ops.WrappedCythonOp._disallow_invalid_ops, _get_result_dtype, _ea_wrap_cython_operation, _reconstruct_ea_result to EA methods, with an eye towards de-duplication.
    • Avoid making these part of the EA interface until they are reasonably pinned-down.

@jorisvandenbossche
Copy link
Member Author

We had some discussion about this at the last dev meeting. @jbrockmendel do you remember what was more or less the conclusion, or can you summarize how you are now thinking to move forward on this?

Finally, we have a usage of _values_for_factorize in merge.py which (would be good if someone double-checks me on this) I think is relying on an undocumented assumption of stability in _values_for_factorize across different EAs of the same dtype, to the effect of: ... This is a natural enough assumption that I think it's worth codifying.

Indeed, I think we said here that it is probably fine to simply document this de-facto stability requirement (since we are already relying on that anyway, and if this resulted in wrong merges for some downstream EA, we would probably have heard about it).

Although I think the more problematic aspect of this usage in the merge code is that EAs are not required to implement _values_for_factorize, but they can also implement factorize instead? Now, we do have a default object-fallback implementation for _values_for_factorize in the base class, but if an EA doesn't (or cannot easily) implement that, they get an suboptimal merge routine (which is what is currently happening for the nullable dtypes).

Making _values_for_factorize more easily implementable in general probably requires the (values, mask) return value in addition to (values, na_sentinel) (your point 1)). And if we have that, I suppose we can probably just require EAs to implement _values_for_factorize to be used in the merge code (in which case we need to update the implementer notes in ExtensionArray.factorize).

  1. Deprecate _values_for_factorize behavior so in the future it returns ndarray, sentinel, optional_mask

This could probably also be done without a deprecation, as it could be relatively easy to inspect the return value to see if it includes a mask or not.

2. Codify the concat-invariance requirement described above for _values_for_factorize
3. Codify that _values_for_factorize()[0] has the same convenient characteristics as _values_for_argsort() w/r/t views documented in

+1 on both.

4. Determine whether _values_for_argsort() and _values_for_factorize()[0] are redundant (AFAICT they are) and if so document/codify.

I mentioned this on the call as well, but an example where _values_for_argsort is not redundant is how we would want to use this in geopandas: we are defining a 1D "distance" metric between geometries in a 2D space, and then this distance metric gives sortable values. But those values can not be converted back into geometries.
So here we are explicitly relying on the fact that _values_for_argsort does not require roundtripping (in contrast to _values_for_factorize), and in that use case we also don't guarantee "different geometry -> different value" as would be required for _values_for_factorize (two different geometries could result in the same distance value, because the values are sorted based on their center)

5. Deprecate the behavior of _from_factorized, making it an instance method instead of a class method, removing the original arg.

What's the goal / reason for this?

@jbrockmendel
Copy link
Member

@jorisvandenbossche thanks for the reminder. I've gone down a non-nano rabbit hole and let things slip through the cracks.

Indeed, I think we said here that it is probably fine to simply document this de-facto stability requirement

I also recall there being consensus about this on the call.

  1. Determine whether _values_for_argsort() and _values_for_factorize()[0] are redundant (AFAICT they are) and if so document/codify.

I mentioned this on the call as well, [...]

Yes, this was a compelling example and has really helped clarify my thinking on the topic. Thanks for explaining it. I'll amend the relevant docstrings to make the differences clear.

Deprecate _values_for_factorize behavior so in the future it returns ndarray, sentinel, optional_mask

This could probably also be done without a deprecation, as it could be relatively easy to inspect the return value to see if it includes a mask or not.

I prefer the explicit-over-implciit, but don't care enough to make a stink over it.

  1. Deprecate the behavior of _from_factorized, making it an instance method instead of a class method, removing the original arg.

What's the goal / reason for this?

I find the status quo odd and complicated since AFAICT original is always an instance of cls. Is there a counterexample on hand?

Although I think the more problematic aspect of this usage in the merge code

Yah I don't think we landed on any great solutions here. IIRC there was some speculation about using factorize itself for [something that might have been merge?] in the same ballpark.


Big picture, I think the discussion is going to allow us to more precisely nail down the requirements for values_for_argsort and values_for_factorize, and do to so in a way that makes them potentially re-useable (the example I have in mind is reusing values_for_argsort for EA._rank) (probably with different names and deprecation cycles etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants