Implement DataArray.idxmax() #60

shoyer · 2014-03-10T22:03:06Z

Should match the pandas function: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.idxmax.html

jcmgray · 2017-01-27T21:24:31Z

Just as I am interested in having this functionality, and the new apply_ufunc is available, would something along these lines suffice?

from wherever import argmax, take  # numpy or dask

def gufunc_idxmax(x, y, axis=None):
    indx = argmax(x, axis)
    return take(y, indx)

def idxmax(obj, dim):
    sig = ([(dim,), (dim,)], [()])
    kwargs = {'axis': -1}
    return apply_ufunc(gufunc_idxmin, obj, obj[dim],
                       signature=sig, kwargs=kwargs,
                       dask_array='allowed')

shoyer · 2017-01-30T00:54:09Z

See http://stackoverflow.com/questions/40179593/how-to-get-the-coordinates-of-the-maximum-in-xarray for examples of how to do this with the current version of xarray. @MaximilianR's answer using where is pretty clean, but maybe not the most efficient or exactly what we want. (I think it breaks in a few edge cases, such as if the max value appears multiple times, or the array is all NaN.)

@jcmgray Your proposal looks pretty close to me. But to handle higher dimension arrays, instead of take(y, indx), I think you need to NumPy style fancy indexing, y[indx,]. That doesn't work with dask, so you'll need to write a function function that uses dask.array.map_blocks when necessary.

I think something like the following would work:

def _index_from_1d_array(array, indices):
    return array[indices,]

def gufunc_idxmax(x, y, axis=None):
    # note: y is always a numpy.ndarray, because IndexVariable objects
    # always have their data loaded into memory
    indx = argmax(x, axis)
    func = functools.partial(_index_from_1d_array, y)

    if isinstance(array, dask_array_type):
        import dask.array as da
        return da.map_blocks(func, indx, dtype=indx.dtype)
    else:
        return func(indx)

jcmgray · 2017-01-31T00:06:02Z

So I thought take was just the functional equivalent of fancy indexing - I spotted it in the dask api and assumed it would work but having tried it does indeed just raise a 'not implemented error'. Just as a note, with the map_blocks approach above take is working for some cases where x[inds, ] is not -- related to #1237?

Regarding edge cases: multiple maxes is presumably fine as long as user is aware it just takes the first.
However, nanargmax is probably the actual desired function here, but looks like it will raise on all-nan slices. Would dropping these and then re-aligning be too much overhead?

shoyer · 2017-01-31T00:21:35Z

take is numpy function that only handles scalar or 1d arguments: https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html#numpy.take

I just merged #1237 -- see if it works with that.

multiple maxes is presumably fine as long as user is aware it just takes the first.

Yeah, that's not a problem here, only for the where based implementation.

However, nanargmax is probably the actual desired function here, but looks like it will raise on all-nan slices. Would dropping these and then re-aligning be too much overhead?

This behavior for nanargmax is unfortunate. The "right" behavior for xarray is probably to use NaN or NaT to mark the index in such locations but numpy makes this tricky. I think this could be achieved, though, with some mix of where, isnull and other vectorized operations. Basically you need to replace all NaN slices with some placeholder value before calculating nanargmax, and then use the locations of all NaN slices again to replace the results of nanargmax with the appropriate fill value.

jcmgray · 2017-02-01T00:26:24Z

Ah yes both ways are working now, thanks. Just had a little play around with timings, and this seems like a reasonably quick way to achieve correct NaN behaviour:

def xr_idxmax(obj, dim):
    sig = ([(dim,), (dim,)], [()])
    kwargs = {'axis': -1}
    
    allna = obj.isnull().all(dim)
    
    return apply_ufunc(gufunc_idxmax, obj.fillna(-np.inf), obj[dim],
                       signature=sig, kwargs=kwargs,
                       dask_array='allowed').where(~allna).fillna(np.nan)

i.e. originally replace all NaN values with -Inf, use the usual argmax, and remask the all-NaN values afterwards.

shoyer · 2017-02-01T00:30:32Z

Yes, that looks pretty reasonable. Two minor concerns:

obj.fillna(-np.inf) converts all dtypes to float. It would be better to stick to obj.fillna(0), though integers can't have NaNs anyways.
I'm pretty sure .fillna(np.nan) is a no-op, filling in NaNs with NaN.

jcmgray · 2017-02-01T00:43:52Z

Would using obj.fillna(0) not mess with argmax if for instance all the data is negative? Could fill with the min value instead?

Ah yes true. I was slightly anticipating e.g. filling with NaT if the dim was time-like, though time types are not something I am familiar with.

shoyer · 2017-02-01T01:01:27Z

Would using obj.fillna(0) not mess with argmax if for instance all the data is negative? Could fill with the min value instead?

Indeed, fillna(0) won't work right. For what I was thinking of, we could use the three argument version of where (#576) here, e.g., obj.where(allna, 0). But fillna with the min value could also work -- that's actually exactly how np.nanargmax works.

Ah yes true. I was slightly anticipating e.g. filling with NaT if the dim was time-like, though time types are not something I am familiar with.

Yes, ideally we would detect the dtype and find an appropriate fill or minimum value, similar to _maybe_promote. The argument to fillna would either be a scalar (for a DataArray)` or a dict (for a Dataset).

stale · 2019-01-24T03:22:02Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

shoyer · 2019-01-24T04:05:17Z

This is still relevant

HiperMaximus · 2019-09-03T18:15:10Z

this is still very relevant

joemcglinchy · 2019-09-17T18:54:40Z

I got around this with some (masked) numpy operations. perhaps it is useful? I was seeing the np.argmax results on entries with all NaN evaluate to zero, which was not useful since the axis I was computing argmax across had valid entries if the result was 0 (think 0-index month, i.e., January, within a year). So I did this instead:

# test_arr is some array with some nodata value, and is of dims [channels, rows, columns]
nodata = -32768
ma = np.ma.masked_equal(test_arr, nodata)

# use np.any to get a mask of rows/columns which have all masked entries
spec_axis = 0
all_na_mask = np.any(ma, axis=spec_axis)

# get the argmax across specified axis
argm = np.argmax(test_arr, axis=spec_axis)
argm = np.ma.masked_less(argm, -np.inf)
argm.mask = ~all_na_mask

big piece here is modifying the mask directly and making sure that is correct. numpy docs advise against this approach but it seems to be giving me what I want.

mathause · 2020-03-13T00:16:32Z

How would idxmax be different to argmax? E.g.

import xarray as xr
xr.DataArray([1, 3, 2]).argmax()

Could this be closed?

shoyer · 2020-03-13T00:43:48Z

idxmax() should return the coordinate labels, not integer positions, corresponding to the max.

e.g., xr.DataArray([1, 3, 2], dims=['x'], coords={'x': [10, 20, 30]}).argmax()
should return 20 (but probably inside an xarray.DataArray)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2.3.1 to 2.3.2. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v2.3.1...v2.3.2) --- updated-dependencies: - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

shoyer added the enhancement label Mar 10, 2014

shoyer changed the title ~~Implement DatasetArray.idxmax()~~ TODO: Implement DatasetArray.idxmax() Mar 24, 2014

shoyer changed the title ~~TODO: Implement DatasetArray.idxmax()~~ TODO: Implement DataArray.idxmax() Apr 11, 2014

shoyer changed the title ~~TODO: Implement DataArray.idxmax()~~ Implement DataArray.idxmax() May 6, 2014

shoyer added the todo label May 6, 2014

shoyer added the pandas features label Jul 15, 2014

shoyer added this to the 1.0 milestone Aug 3, 2014

shoyer mentioned this issue Jan 30, 2017

Insert minimal dimensions when broadcasting in apply_ufunc #1237

Merged

1 task

shoyer mentioned this issue Oct 18, 2017

DataArray.argsort should be deleted #1635

Open

stale bot added the stale label Jan 24, 2019

max-sixty removed the stale label Jan 24, 2019

toddrjen mentioned this issue Mar 20, 2020

Implement idxmax and idxmin functions #3871

Merged

4 tasks

max-sixty closed this as completed in #3871 Mar 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement DataArray.idxmax() #60

Implement DataArray.idxmax() #60

shoyer commented Mar 10, 2014

jcmgray commented Jan 27, 2017

shoyer commented Jan 30, 2017 •

edited

Loading

jcmgray commented Jan 31, 2017

shoyer commented Jan 31, 2017

jcmgray commented Feb 1, 2017

shoyer commented Feb 1, 2017

jcmgray commented Feb 1, 2017

shoyer commented Feb 1, 2017

stale bot commented Jan 24, 2019

shoyer commented Jan 24, 2019

HiperMaximus commented Sep 3, 2019

joemcglinchy commented Sep 17, 2019

mathause commented Mar 13, 2020

shoyer commented Mar 13, 2020

Implement DataArray.idxmax() #60

Implement DataArray.idxmax() #60

Comments

shoyer commented Mar 10, 2014

jcmgray commented Jan 27, 2017

shoyer commented Jan 30, 2017 • edited Loading

jcmgray commented Jan 31, 2017

shoyer commented Jan 31, 2017

jcmgray commented Feb 1, 2017

shoyer commented Feb 1, 2017

jcmgray commented Feb 1, 2017

shoyer commented Feb 1, 2017

stale bot commented Jan 24, 2019

shoyer commented Jan 24, 2019

HiperMaximus commented Sep 3, 2019

joemcglinchy commented Sep 17, 2019

mathause commented Mar 13, 2020

shoyer commented Mar 13, 2020

shoyer commented Jan 30, 2017 •

edited

Loading