Pointwise indexing -- something like sel_points #214

shoyer · 2014-08-15T23:17:42Z

@hdail suggested that it would be useful to be able to index points in addition to indexing dimensions separately. Right now, you can do this via something like:
xray.concat([ds.sel(x=x, y=y) for x, y in pts], dim='station')

This would be handy for sampling particular points out of multiple dimensions, e.g., to compare gridded and station data.

It's also probably worth implementing in xray because it can be done quickly using numpy's fancy indexing (I think) which would be more efficient than a loop, though I suspect the implementation is probably somewhat complex.

The text was updated successfully, but these errors were encountered:

shoyer · 2014-08-21T02:48:07Z

This operation is actually sort of like reindexing. So perhaps this should be spelled ds.reindex_like(other) or ds.reindex(other.coords). With labeled dimensions and variables there is enough metadata to make the reshaping unambiguous.

WeatherGod · 2014-10-03T19:56:16Z

Unless I am missing something about xray, that selection operation could only work if pts had values that exactly matched coordinate values in ds. In most scenarios, that would not be the case. One would have to first build pts from a computation of nearest-neighbor indexs between the stations and the model grid.

shoyer · 2014-10-03T20:03:53Z

@WeatherGod You are totally correct. The last dataset on which I have needed to do this was an unprojected grid with a constant increment of 0.5 degrees between points, so finding nearest neighbors was easy.

If you have a lot of points to select at, finding nearest neighbor points could be done efficiently with a tree, e.g., scipy.spatial.cKDTree: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query

WeatherGod · 2014-10-03T20:48:35Z

Just managed to implement this using your suggestion for my data:

from scipy.spatial import cKDTree as KDTree
kd = KDTree(zip(model['longitude'].values.ravel(), model['latitude'].values.ravel()))
dists, indx = kd.query(zip(obs['longitude'], obs['latitude']))
indx = np.unravel_index(indx, mod['longitude'].shape)
mod_points = xray.concat([mod.isel(x=x, y=y) for y, x in zip(*indx)], dim='station')

Not entirely certain why I needed to reverse y and x in that last part, but, oh well...

shoyer · 2014-10-03T21:19:57Z

@WeatherGod Very nice! I'm not entirely sure why you have to reverse y and x at the end, either -- what is the order of the dimensions on mod['longitude']?

WeatherGod · 2014-10-09T17:58:25Z

Starting using the above snippet for more datasets, some with interdependent coordinates and some without (so the coordinates would be 1-d). I think I have generalized it significantly...

def grid_to_points(grid, points, coord_names):
    not_spatial = set(grid.dims) - set(coord_names)
    spatial_selection = {n:0 for n in not_spatial}
    spat_only = grid.isel(**spatial_selection)
    coords = []
    for i, n in enumerate(spat_only.dims):
        if spat_only[n].ndim != len(spat_only.dims):
            # Needs new axes
            slices = [np.newaxis] * len(spat_only.dims)
            slices[i] = slice(None)
        else:
            slices = [slice(None)] * len(spat_only.dims)
        coords.append(spat_only[n].values[slices])
    coords = [c.flatten() for c in np.broadcast_arrays(*coords)]

    kd = KDTree(zip(*coords))
    _, indx = kd.query(zip(*[points[n].values for n in spat_only.dims]))
    indx = np.unravel_index(indx, spat_only.shape)

    return xray.concat((grid.isel(**{n:j for n, j in zip(spat_only.dims, i)})
                        for i in zip(*indx)), dim='station')

I can still imagine some situations where this won't work, such as a requested set of dimensions that are a mix of dependent and independent variables. Currently, if the dimensions are independent, then the number of dimensions of each one is assumed to be 1 and np.newaxis is used for the others. Meanwhile, if the dimensions are dependent, then the number of dimensions for each one is assumed to be the same as the number of dependent variables and is merely flattened (the broadcast is essentially no-op).

I should also note that this is technically not restricted to spatial coordinates even though the code says so. Just anything that can be represented in euclidean space.

WeatherGod · 2014-10-09T18:00:33Z

Oh, and it does take advantage of a bunch of python2.7 features such as dictionary comprehensions and generator statements, so...

WeatherGod · 2014-10-09T18:06:56Z

And, I think I just realized how I could generalize it even more. Right now, grid can only be a DataArray, but I would like this to work for a DataSet as well. I bet if I use .sel() instead of .isel() and access the elements of the broadcasted arrays, I could make this work very nicely for both DataArray and DataSet.

WeatherGod · 2014-10-09T18:21:16Z

And, actually, the example I gave above has a bug in the dependent dimension case. This one should be much better (not fully tested yet, though):

def grid_to_points2(grid, points, coord_names):
    if not coord_names:
        raise ValueError("No coordinate names provided")
    not_spatial = set(grid.dims) - set(coord_names)
    spatial_selection = {n:0 for n in not_spatial}
    spat_only = grid.isel(**spatial_selection)
    coords = []
    for i, n in enumerate(spat_only.dims):
        if spat_only[n].ndim != len(spat_only.dims):
            # Needs new axes
            slices = [np.newaxis] * len(spat_only.dims)
            slices[i] = slice(None)
        else:
            slices = [slice(None)] * len(spat_only.dims)
        coords.append(spat_only[n].values[slices])
    coords = np.broadcast_arrays(*coords)

    kd = KDTree(zip(*[c.flatten() for c in coords]))
    _, indx = kd.query(zip(*[points[n].values for n in spat_only.dims]))
    indx = np.unravel_index(indx, coords[0].shape)

    return xray.concat(
            (grid.sel(**{n:c[i] for n, c in zip(spat_only.dims, coords)})
             for i in zip(*indx)),
            dim='station')

shoyer · 2014-10-09T18:21:24Z

The only part that wouldn't work for a Dataset is spat_only.shape. On a dataset, you can get that information from the values of the dims dictionary (the difference between dims on a dataset and dataarray is definitely an ugly corner of the API).

Also, you probably want to use c.ravel() instead of c.flatten(), because the later always makes a copy.

shoyer · 2014-10-09T18:27:11Z

The main logic there -- it looks like this is a routine for broadcasting data arrays? I have something similar, but not exactly the same, in xray.core.variable.broadcast_variables. It's also very similar to the logic in xray.Dataset.to_dataframe, e.g., right now I think you could do the broadcasting by doing xray.Dataset.from_dataframe(spat_only.to_dataframe()).

WeatherGod · 2014-10-09T18:47:22Z

oooh, didn't realize that dims is different for DataSet and DataArray... Gonna have to fix that, too. I am checking out the broadcasting functions you pointed out. The one limitation I see right away with xray.core.variable.broadcast_variables is that it is limited to two variables (presumedly, I would be broadcasting N number of coordinates because the variables may or may not have extraneous dimensions that I don't care to broadcast)

WeatherGod · 2014-10-09T19:16:52Z

to/from_dateframe just ate up all my memory. I think I am going to stick with my broadcasting approach...

WeatherGod · 2014-10-09T19:43:08Z

Hmmm, limitation that I just encountered. When there are dependent coordinates, the variables representing those coordinates are not the index arrays (and thus, are not "dimensions" either), so my solution is completely broken for dependent coordinates.
If I were to go back to my DataArray-only solution, then I still need to correct the code to use the dimension names of the coordinate variables, and still need to fix the coordinates != dimensions issue.

shoyer · 2014-10-09T19:44:25Z

What do you mean by "dependent coordinates"?

WeatherGod · 2014-10-09T20:05:01Z

Consider the following Dataset:

<xray.Dataset>
Dimensions:           (lv_HTGL1: 2, lv_HTGL3: 2, lv_HTGL5: 2, lv_HTGL6: 2, lv_ISBL0: 37, lv_SPDL2: 6, lv_SPDL4: 3, time: 9, xgrid_0: 451, ygrid_0: 337)
Coordinates:
  * xgrid_0           (xgrid_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * ygrid_0           (ygrid_0) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * lv_ISBL0          (lv_ISBL0) float32 10000.0 12500.0 15000.0 17500.0 20000.0 ...
  * lv_HTGL6          (lv_HTGL6) float32 1000.0 4000.0
  * lv_HTGL1          (lv_HTGL1) float32 2.0 80.0
  * lv_HTGL3          (lv_HTGL3) float32 10.0 80.0
    latitude          (ygrid_0, xgrid_0) float32 16.281 16.3084 16.3356 16.3628 16.3898 ...
    longitude         (ygrid_0, xgrid_0) float32 233.862 233.984 234.106 234.229 ...
  * lv_HTGL5          (lv_HTGL5) int64 0 1
  * lv_SPDL2          (lv_SPDL2) int64 0 1 2 3 4 5
  * lv_SPDL4          (lv_SPDL4) int64 0 1 2
  * time              (time) datetime64[ns] 2014-09-25T01:00:00 ...
Variables:
    gridrot_0         (ygrid_0, xgrid_0) float32 -0.229676 -0.228775 -0.227873 ...
    TMP_P0_L103_GLC0  (time, lv_HTGL1, ygrid_0, xgrid_0) float64 295.8 295.7 295.7 295.7 ...

The latitude and longitude variables are both dependent upon xgrid_0 and ygrid_0. Meanwhile...

<xray.Dataset>
Dimensions:    (station: 120, time: 4)
Coordinates:
    latitude   (station) float32 34.805 34.795 34.585 36.705 34.245 34.915 34.195 36.075 ...
  * station    (station) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...
    sixhourly  (time) int64 0 1 2 3
    longitude  (station) float32 -98.025 -96.665 -99.335 -98.705 -95.665 -98.295 ...
  * time       (time) datetime64[ns] 2014-10-07 2014-10-07T06:00:00 ...
Variables:
    MaxGust    (station, time) float64 7.794 7.47 8.675 4.788 7.071 7.903 8.641 5.533 ...

the latitude and longitude variables are independent of each other (they are 1-D).

The variable in the first one can not be accessed directly by lat/lon values, while the MaxGust variable in the second one can. This poses some difficulties.

WeatherGod · 2014-10-09T20:19:12Z

Ok, I think I got it (for reals this time...)

def bcast(spat_only, coord_names):
    coords = []
    for i, n in enumerate(coord_names):
        if spat_only[n].ndim != len(spat_only.dims):
            # Needs new axes
            slices = [np.newaxis] * len(spat_only.dims)
            slices[i] = slice(None)
        else:
            slices = [slice(None)] * len(spat_only.dims)
        coords.append(spat_only[n].values[slices])
    return np.broadcast_arrays(*coords)

def grid_to_points2(grid, points, coord_names):
    if not coord_names:
        raise ValueError("No coordinate names provided")
    spat_dims = {d for n in coord_names for d in grid[n].dims}
    not_spatial = set(grid.dims) - spat_dims
    spatial_selection = {n:0 for n in not_spatial}
    spat_only = grid.isel(**spatial_selection)

    coords = bcast(spat_only, coord_names)

    kd = KDTree(zip(*[c.ravel() for c in coords]))
    _, indx = kd.query(zip(*[points[n].values for n in coord_names]))
    indx = np.unravel_index(indx, coords[0].shape)

    return xray.concat(
            (grid.isel(**{n:j for n, j in zip(spat_only.dims, i)})
             for i in zip(*indx)),
            dim='station')

Needs a lot more tests and comments and such, but I think this works. Best part is that it seems to do a very decent job of keeping memory usage low, and only operates upon the coordinates that I specify. Everything else is left alone. So, I have used this on 4-D data, picking out grid points at specified lat/lon positions, and get back a 3D result (time, level, station). And I have used this on just 2D data, getting back just a 1D result (dimension='station').

wholmgren · 2015-07-07T00:51:04Z

+1 for this proposal.

I made a slight modification to @WeatherGod's code so that I could use string indices for the "station" coordinate, though I'm sure there is a better way to implement this. Also note the addition of a few list calls for Python 3 compat.

def grid_to_points2(grid, points, coord_names):
    if not coord_names:
        raise ValueError("No coordinate names provided")
    spat_dims = {d for n in coord_names for d in grid[n].dims}
    not_spatial = set(grid.dims) - spat_dims
    spatial_selection = {n:0 for n in not_spatial}
    spat_only = grid.isel(**spatial_selection)

    coords = bcast(spat_only, coord_names)

    kd = KDTree(list(zip(*[c.ravel() for c in coords])))
    _, indx = kd.query(list(zip(*[points[n].values for n in coord_names])))
    indx = np.unravel_index(indx, coords[0].shape)

    station_da = xray.DataArray(name='station', dims='station', data=stations.index.values)

    return xray.concat(
            (grid.isel(**{n:j for n, j in zip(spat_only.dims, i)})
             for i in zip(*indx)),
            dim=station_da)

In [97]:
stations = pd.DataFrame({'XLAT':[32.13, 32.43], 'XLONG':[-110.95, -112.02]}, index=['KTUS', 'KPHX'])
stations
Out[97]:
XLAT    XLONG
KTUS    32.13   -110.95
KPHX    32.43   -112.02

In [98]:
station_ds = grid_to_points2(ds, stations, ('XLAT', 'XLONG'))
station_ds
Out[98]:
<xray.Dataset>
Dimensions:      (Times: 1681, station: 2)
Coordinates:
  * Times        (Times) datetime64[ns] 2015-07-02T06:00:00 ...
    XLAT         (station) float32 32.1239 32.4337
  * station      (station) object 'KTUS' 'KPHX'
    west_east    (station) int64 220 164
    XLONG        (station) float32 -110.947 -112.012
    south_north  (station) int64 116 134
Data variables:
    SWDNBRH      (station, Times) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    V10          (station, Times) float32 -2.09897 -1.94047 -1.55494 ...
    V80          (station, Times) float32 0.0 -1.95921 -1.87583 -1.86289 ...
    SWDNB        (station, Times) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    U10          (station, Times) float32 2.22951 1.89406 1.39955 1.04277 ...
    SWDDNI       (station, Times) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    SWDNBC       (station, Times) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    T2           (station, Times) float32 301.419 303.905 304.155 304.296 ...
    SWDDNIRH     (station, Times) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
    U80          (station, Times) float32 0.0 1.93936 1.7901 1.63011 1.69481 ...

In [100]:
station_ds.sel(station='KTUS')[['U10','V10']]
Out[100]:
<xray.Dataset>
Dimensions:      (Times: 1681)
Coordinates:
    west_east    int64 220
    south_north  int64 116
    XLONG        float32 -110.947
  * Times        (Times) datetime64[ns] 2015-07-02T06:00:00 ...
    station      object 'KTUS'
    XLAT         float32 32.1239
Data variables:
    U10          (Times) float32 2.22951 1.89406 1.39955 1.04277 1.16338 ...
    V10          (Times) float32 -2.09897 -1.94047 -1.55494 -1.34216 ...

shoyer · 2016-07-27T22:34:12Z

Fixed by #507

* update requirements and envs * whatsnew * added classifier for 3.11

shoyer mentioned this issue Sep 30, 2014

Add example showing how to sample gridded data at points #241

Closed

shoyer added the topic-indexing label Oct 8, 2014

IamJeffG mentioned this issue Feb 23, 2015

Proposal: allow tuples instead of slice objects in sel or isel #280

Closed

shoyer mentioned this issue May 31, 2015

unexpected positional indexing behavior with Dataset and DataArray "isel" #411

Closed

jhamman mentioned this issue Jul 15, 2015

API design for pointwise indexing #475

Open

shoyer closed this as completed Jul 27, 2016

jhamman mentioned this issue Jul 27, 2017

ENH: points coord from isel/sel_points should be a MultiIndex #1493

Closed

keewis pushed a commit to keewis/xarray that referenced this issue Jan 17, 2024

Deprecate python 3.8 (pydata#214)

1694ed1

* update requirements and envs * whatsnew * added classifier for 3.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pointwise indexing -- something like sel_points #214

Pointwise indexing -- something like sel_points #214

shoyer commented Aug 15, 2014

shoyer commented Aug 21, 2014

WeatherGod commented Oct 3, 2014

shoyer commented Oct 3, 2014

WeatherGod commented Oct 3, 2014

shoyer commented Oct 3, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

shoyer commented Oct 9, 2014

shoyer commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

shoyer commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

wholmgren commented Jul 7, 2015

shoyer commented Jul 27, 2016

Pointwise indexing -- something like sel_points #214

Pointwise indexing -- something like sel_points #214

Comments

shoyer commented Aug 15, 2014

shoyer commented Aug 21, 2014

WeatherGod commented Oct 3, 2014

shoyer commented Oct 3, 2014

WeatherGod commented Oct 3, 2014

shoyer commented Oct 3, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

shoyer commented Oct 9, 2014

shoyer commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

shoyer commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

WeatherGod commented Oct 9, 2014

wholmgren commented Jul 7, 2015

shoyer commented Jul 27, 2016