Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add support for nan* reductions #621

Open
betatim opened this issue Apr 18, 2023 · 23 comments
Open

RFC: add support for nan* reductions #621

betatim opened this issue Apr 18, 2023 · 23 comments
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.
Milestone

Comments

@betatim
Copy link
Member

betatim commented Apr 18, 2023

Is there advice around handling NaNs and how to translate Numpy code to using the Array API?

In particular I have code like np.nanmin(X, axis=0) that I would like to rewrite so that it works with Numpy, Torch, etc arrays.

To me the "obvious" translation seems to be xp.min(X[~xp.isnan(X)], axis=0) but this doesn't work if X has a shape like (7500, 4) (here xp is the namespace of the array X). Another option I looked for is a where= argument to min(), but that doesn't exist unfortunately.

Does anyone have advice on this topic or knows if there is work planned on this?

@rgommers
Copy link
Member

Good question, thanks @betatim. I think we want a general pattern here that can apply to any reduction. That's a bit of an unsolved problem even in NumPy, where we rejected more niche nan* feature requests like nanptp and nancov.

This discussion seems relevant: pytorch/pytorch#38349 (comment). SciPy's nan_policy is now in better shape, but I'm still not sure on what the best approach in general is. Options include:

  • a skip_nan keyword
  • a context manager or other generic method to do achieve the same effect
  • a compat library approach with a fast path for libraries known to offer specific functions like nanmin and a generic fallback implementation that iterates over the specified axis=.

A generic implementation is pretty fiddly

def nanreduction(func, x, axis=None):
    if axis is None:
        return func(x[~np.isnan(x)])
    else:
        # normalize axis, then:
        out_shape = list(x.shape).remove(x.shape[axis])
        out = np.empty_like(x, shape=out_shape)
        for i in range(axis):
            # call `func` in a loop here. this is pretty annoying to get right
            ...

NumPy's machinery is pure Python so can be looked at for guidance. It does also raise questions like how to deal with all-nan slices (e.g., https://github.com/numpy/numpy/blob/main/numpy/lib/nanfunctions.py#L355-L360); those warnings tend to not be useful.

@ogrisel
Copy link

ogrisel commented Apr 20, 2023

how to deal with all-nan slices

I think the reduction should allow to pass an explicit scalar value (e.g. typically 0.) as a kwarg and raise an error if all-nan slices occur while this kwarg was left to None.

@rgommers
Copy link
Member

I think the reduction should allow to pass an explicit scalar value (e.g. typically 0.) as a kwarg and raise an error if all-nan slices occur while this kwarg was left to None.

I'm not sure whether the better default is to return nan or raise an exception, but it should be one or the other I think. These warnings are quite annoying:

>>> np.nanmin([np.nan])
RuntimeWarning: All-NaN axis encountered
...
nan

@rgommers
Copy link
Member

rgommers commented May 18, 2023

I think I had a better idea:

def nanreduction(func : Callable) -> Callable:
    # Takes any reduction in the standard as input, and returns a `nan<reduction>`
    # function that does the expected thing. This is fully functional and has a
    # small API surface, and as a bonus would be more general than the 
    # 14 `nan*` functions that numpy has (numpy has rejected requests for other
    # nan-functions because of the impact on API size).

# Usage equivalent to `nanmax(x)`:
nanreduction(xp.max)(x)

The implementation would be quite straightforward, it could be a simple dict lookup that maps max to nanmax, and so on.

@betatim @ogrisel WDYT?

@rgommers rgommers added the API extension Adds new functions or objects to the API. label May 18, 2023
@ogrisel
Copy link

ogrisel commented May 19, 2023

Sounds good to me but ideally with an extra parameter to control the scalar value used in case of all nans. The default behavior could be to raise an exception.

@rgommers
Copy link
Member

but ideally with an extra parameter to control the scalar value used in case of all nans

Ugh, this seems like a pretty obvious bug in NumPy:

>>> np.max([])
ValueError: zero-size array to reduction operation maximum which has no identity
>>> np.max([], initial=3)
3.0

>>> np.nanmax([])
ValueError: zero-size array to reduction operation maximum which has no identity
>>> np.nanmax([], initial=3)
<ipython-input-13-63ff44c70a84>:1: RuntimeWarning: All-NaN axis encountered
  np.nanmax([], initial=3)
nan

Other than that, I think your request is basically to support the initial keyword in all reductions, right?

@ogrisel
Copy link

ogrisel commented May 22, 2023

Other than that, I think your request is basically to support the initial keyword in all reductions, right?

Indeed, I did not know about initial. I think the warning could be omitted.

@ogrisel
Copy link

ogrisel commented May 22, 2023

@betatim you want want to express your opinion on the proposed API above :)

@betatim
Copy link
Member Author

betatim commented May 22, 2023

Sounds reasonable to me. I'm not sure I understand the comment:

The implementation would be quite straightforward, it could be a simple dict lookup that maps max to nanmax, and so on.

The way I understand the proposal is that the Array API provides nanreduction and that it is left to the user to use that together with xp.max to make themselves a nanmax.

@rgommers
Copy link
Member

@betatim what I meant was that the implementation in numpy could look something like:

_nanreductions_mapping = dict(
    np.max: np.nanmax,
    np.min: np.nanmin,
    np.mean: np.nanmean,
     # and so on for all existing nan* funcs
)
def nanreduction(func : Callable) -> Callable:
    try:
        return _nanreductions_mapping[func]
    except KeyError:
        # generic implementation, create new function that:
        # 1. creates array with correct output shape
        # 2. per axis, fills output array with func(x[~isnan(x)])

@asmeurer
Copy link
Member

asmeurer commented May 23, 2023

The functional approach feels a little odd to me. Functions that return other functions aren't typically used in Python, except as decorators. Is the idea that this could in principle be applied to a user defined reduction?

I'm honestly not seeing why this is better than just adding a keyword to the various reductions.

For empty/all nan reductions, I think it should behave the same as the normal reduction. For min and max, we specify that it's implementation-defined, https://data-apis.org/array-api/latest/API_specification/generated/array_api.max.html, but for mean and std we say it must be nan https://data-apis.org/array-api/draft/API_specification/generated/array_api.mean.html https://data-apis.org/array-api/draft/API_specification/generated/array_api.std.html.

@rgommers
Copy link
Member

The functional approach feels a little odd to me. Functions that return other functions aren't typically used in Python, except as decorators. Is the idea that this could in principle be applied to a user defined reduction?

Yes, and also to reductions in numpy et al. that do not have a nanxxx variant.

I'm honestly not seeing why this is better than just adding a keyword to the various reductions.

The response I expect to that from numpy folks at least (and it'd be my own too) is "why a keyword, we already covered this with separate functions". If it's a more general API instead, the case is that it can apply to all reductions (including those outside of numpy), and there's a reason to introduce it beyond array API standard support.

@asmeurer
Copy link
Member

I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard.

@leofang
Copy link
Contributor

leofang commented May 31, 2023

I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard.

cc: @seberg for vis

@kgryte
Copy link
Contributor

kgryte commented May 31, 2023

The response I expect to that from numpy folks at least (and it'd be my own too) is "why a keyword, we already covered this with separate functions".

NumPy 2.0 could be a time to move away from separate functions for nan* variants and consolidate into a single API (e.g., mean() with a kwarg). This is the approach now followed by MATLAB, the presumed original inspiration for the nan* variants. Julia avoided NaN variants altogether; instead, delegating to function composition (e.g., mean(skipmissing(...))).

Given the precedence in dataframe-land (e.g., pandas), supporting a skipna or similar kwarg may be most consistent with community conventions and "modern" trends.

@seberg
Copy link
Contributor

seberg commented May 31, 2023

I am not in favor of making 2.0 an excuse for such a change, but I am happy to entertain it in general and deprecate the nan... functions slowly.

The question is really how it would interact with the ufunc machinery and be implemented in it. Maybe it is close to the existing where= or maybe it needs more thoughts and a concrete proposal.

@jorisvandenbossche
Copy link
Member

Maybe it is close to the existing where=

You can already do that right now?

In [3]: arr = np.array([1.0, 2.0, np.nan, 4.0])

In [4]: np.sum(arr, where=~np.isnan(arr))
Out[4]: 7.0

In [5]: np.mean(arr, where=~np.isnan(arr))
Out[5]: 2.3333333333333335

@seberg
Copy link
Contributor

seberg commented May 31, 2023

Not exactly:

In [2]: np.minimum.reduce(arr, where=~np.isnan(arr))
ValueError: reduction operation 'minimum' does not have an identity, so to use a where mask one has to specify 'initial'

the handling around having zero non-nan entries may not be trivial without an identity.

@rgommers
Copy link
Member

I am not in favor of making 2.0 an excuse for such a change, but I am happy to entertain it in general and deprecate the nan... functions slowly.

I agree with this. nan* is heavily used, so they could be kept for a long time (or forever), to not be too disruptive.

The question is really how it would interact with the ufunc machinery and be implemented in it.

I guess that is relevant from a pragmatic implementation-in-numpy perspective, but should it matter much for this discussion? There's only a subset of reductions (the ones with a binary ufunc equivalent) that are under the hood implemented by forwarding to <ufunc>.reduce (e.g. , min -> minimum.reduce, sum -> add.reduce), but for others like std, mean, ptp that doesn't apply. It seems to me like whatever is decided upon for a public API should be implementable either in the numpy ufunc machinery or as a preprocessing step. The latter is what is already done in pure Python in nanfunctions, so that can be reused, no performance will be lost.

I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard.

Agreed - time to post to the numpy-discussion list perhaps.

Most implementations, whether as a keyword like pandas/Matlab or functional like Julia, only provide boolean options: "skip" or "include". I'll note that Matlab does have multiple flags for treating NA/NaN/NaT differently.

SciPy on the other hand provides a string keyword, with skip/include/raise options. Here is the design doc for the implementation: http://scipy.github.io/devdocs/dev/api-dev/nan_policy.html. The "raise" option also comes up in other places, like screening for nans and infs in linalg functions (a la check_finite), and that's bound to come up in the near future I suspect.

There's also the need for some functions to specify identity value (e.g., with an initial keyword) for some reductions, as pointed out in previous comments. That makes it a one-keyword or two-keyword pattern, depending on whether the identity value is natural or must be user-specified.

Does anyone else see the need to include SciPy's "raise" flavor? I suspect it's better handled with a separate function or keyword, that checks and raises if any nan's and/or inf's are present.

@TomNicholas
Copy link

Just to say it would be really nice for xarray if this were solved, because practically all our aggregations try to skip NaNs by default by using nan*, so a strictly array-API-compliant type fails on all these aggregations (unless the user opts out of skipping NaNs explicitly).

@kgryte
Copy link
Contributor

kgryte commented Apr 4, 2024

Of the options discussed thus far, my personal preference is following SciPy in having a "nan_policy" kwarg. The "omit" and "propagate" options map cleanly to nan* and non-nan versions and the "raise" option is useful when wanting to assert against NaNs unintentionally creeping in during preceding calculations.

While decorators as a design pattern are common in Python, we don't have any precedent for the type of decorator/factory function proposed above in prominent array libraries. While that, in and of itself, is not disqualifying, we should consider whether such a proposal would get buy-in, not just in NumPy, but elsewhere in the ecosystem, and whether the path to standardization might be better achieved through more established design patterns already present in array libraries and the current standard.

Thus far, when we've wanted to overload behavior, we've either parameterized through kwargs or we've split into separate functions (e.g., unique*). As a straw example, we haven't, e.g., in the statistical reduction APIs, standardized factory functions for retaining dimensions (e.g., def keepdims_reduction(f: Callable) -> Callable) or similar parameterization. And generally, across the ecosystem, we've seen adoption of kwargs to control na (or nan) behavior; e.g., pandas and skipna or SciPy and nan_policy). We've also seen parameterization used in other languages (e.g., MATLAB).

With particular regard to NumPy, an advantage of SciPy's nan_policy is that it would ensure consistency in NaN handling between the two libraries, thus ensuring users have only one design pattern to remember when moving between NumPy (e.g., np.mean and scipy.stats.hmean). If standardized, this would ensure the mental model is (more or less) the same across both array producing and array consuming libraries (e.g., pandas, SciPy).

NumPy (and others) would, of course, be free to keep the nan* variants for backward compatibility and internally dispatch from np.mean(x, nan_policy="omit") to np.nanmean(x), so the implementation lift is not overly burdensome. The new behavior, of course, would be nan_policy="raise", and there is a question as to whether that is supported by default in NumPy in its ufunc machinary; however, the fact that SciPy can support suggests that this is doable, given SciPy's reliance on ufuncs.

So in short, I'd like to propose that we move to adopt SciPy's precedent of a nan_policy kwarg and add to the following reduction APIs:

  • min
  • max
  • mean
  • prod
  • std
  • sum
  • var
  • cumulative_sum

Adding nan_policy support across the ecosystem is likely to take time; however, I think we should be able to implement in the array-api-compat layer using a combination of standardized APIs (e.g., xp.where + xp.mean) and, on a library-by-library basis, already available specialized APIs (e.g., np.nanmean).

@kgryte kgryte changed the title How to translate nanmin and friends to the Array API? RFC: add support for nan* reductions Apr 4, 2024
@kgryte kgryte added the RFC Request for comments. Feature requests and proposed changes. label Apr 4, 2024
@kgryte kgryte added the Needs Discussion Needs further discussion. label Apr 4, 2024
@kgryte kgryte added this to the v2024 milestone Apr 4, 2024
@rgommers
Copy link
Member

rgommers commented Apr 4, 2024

I think a keyword would be a very reasonable outcome. What would perhaps make sense to untangle the chicken-and-egg problem here is:

  1. Come to a preliminary conclusion here on the desired API
  2. Open a draft PR with the spec addition - to review/approve, but blocked for merging until we have an implementation at least in NumPy (or a PR that's in good shape)
  3. Someone has to propose the addition for numpy and write the implementation
  4. Once we have both the spec and an implementation and everything seems happy, merge the spec for the next version of the standard

That offers a way forward, while avoiding adding something in the standard that has a potentially large implementation burden (which would help no one, and make future versions of the standard harder to adopt if the work isn't done).

@mdhaber
Copy link
Contributor

mdhaber commented Apr 24, 2024

I've worked on making nan_policy much more consistent in scipy.stats over the past few years. It is now accepted by most reduction functions / hypothesis tests, and all functions now follow the same rules for omit.

I'd agree that a standard nan_policy-like keyword argument would be appreciated, especially for the choice between behaviors like propagate, raise, and (sometimes requested) warn. However, I think it's unfortunate that nan is overloaded to mean both "undefined output" and "missing input", since they may need to be treated differently. Ideally, I would suggest (in lieu of a separate sentinel value) the use of a mask for missing data rather than an omit option. But I recognize that the array-API cannot necessarily change the way programmers use nan, so an omit option may be considered essential.

Either way, I think it's relevant to mention here: in scipy/scipy#20363, I drafted a function that accepts an array-API compatible namespace and returns a (nearly) array-API compatible masked array namespace (composed of calls from the provided namespace). This could be useful either as a default implementation for the nan_policy='omit' option in array-api-compat or as a way to avoid the need for all libraries to implement an omit behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

10 participants