unexpected positional indexing behavior with Dataset and DataArray "isel" #411

jhamman · 2015-05-31T04:48:10Z

I may be missing something here but I think the indexing behavior in isel is surprisingly different to that of numpy and is incongruent with the xray documentation. Either this is a bug or a feature that I don't understand.

From the xray docs on positional indexing:

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray

My example below uses two 1d numpy arrays to select from a 3d numpy array. When using pure numpy, I get a 2d array back. In my view, this is the expected behavior. When using the xray.Dataset or xray.DataArray, I get an oddly shaped 3d array back with a duplicate dimension.

import numpy as np
import xray
import sys

print('python version:', sys.version)
print('numpy version:', np.version.full_version)
print('xray version:', xray.version.version)

python version: 3.4.3 |Anaconda 2.2.0 (x86_64)| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)]
numpy version: 1.9.2
xray version: 0.4.1

# A few numpy arrays
time = np.arange(100)
lons = np.arange(40, 60)
lats = np.arange(25, 70)
np_data = np.random.random(size=(len(time), len(lats), len(lons)))

# pick some random points to select
ys, xs = np.nonzero(np_data[0] > 0.8)
print(len(ys))

# create a xray.DataArray and xray.Dataset
xr_data = xray.DataArray(np_data, [('time', time), ('y', lats), ('x', lons)])  # DataArray
xr_ds = xr_data.to_dataset(name='data')  # Dataset

# numpy indexing 
print('numpy: ', np_data[:, ys, xs].shape)

# xray positional indexing
print('xray1: ', xr_data.isel(y=ys, x=xs).shape)
print('xray2: ', xr_data[:, ys, xs].shape)
print('xray3: ', xr_ds.isel(y=ys, x=xs))

numpy:  (100, 176)
xray1:  (100, 176, 176)
xray2:  (100, 176, 176)
xray3:  <xray.Dataset>
Dimensions:  (time: 100, x: 176, y: 176)
Coordinates:
  * x        (x) int64 46 47 57 45 48 50 51 54 57 59 48 52 49 50 52 53 55 57 43 46 47 48 53 ...
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ...
  * y        (y) int64 25 25 25 26 26 26 26 26 26 26 27 27 28 28 28 28 28 28 29 29 29 29 29 ...
Data variables:
    data     (time, y, x) float64 0.9343 0.8311 0.8842 0.3188 0.02052 0.4506 0.04177 ...

The text was updated successfully, but these errors were encountered:

shoyer · 2015-05-31T07:46:38Z

This is expected -- we do orthogonal indexing (treating each coordinate independently), not NumPy style fancy indexing. If you read on in the indexing docs, you'll find this under "Indexing details": http://xray.readthedocs.org/en/latest/indexing.html#indexing-details

Perhaps we should note this more prominently!

The style of indexing you're describing here is definitely useful, though (e.g., for selecting the locations of stations from a grid) -- it's something that I would really like to support.

jhamman · 2015-05-31T15:45:49Z

Okay. Glad to hear I was missing something. For large datasets, I was getting a MemoryError with orthogonal indexing as the array size increased like it did in my example.

The indexing details link you show does explain the behavior better but it is somewhat contradictory to the quote I listed above. I think it would be worth reconciling the docs on that point.

The station selection use case you bring up is exactly what I was going for. It would be nice if we could stay in xray / pandas after the indexing. Ideally, I don't want to do: xr_data.values[:, ys, xs] and then recreate a DataArray.

Lastly, is there an issue open for support of numpy fancy indexing? This is something that would be useful for my work and I could contribute to the effort beginning in July.

shoyer · 2015-05-31T22:03:15Z

The indexing details link you show does explain the behavior better but it is somewhat contradictory to the quote I listed above. I think it would be worth reconciling the docs on that point.

I'll definitely add a note to qualify that description in the docs -- sorry you went to all the trouble of writing up the bug report!

The station selection use case you bring up is exactly what I was going for. It would be nice if we could stay in xray / pandas after the indexing. Ideally, I don't want to do: xr_data.values[:, ys, xs] and then recreate a DataArray.

The open issue for this is #214, which describes how I currently do "diagonal" style indexing to extract stations. It preserves all the metadata, but is a little slow if you have a very long list of stations:.

Note that the example in that first comment will work even with arrays that don't fit into memory if you use dask (which will be in the next xray release, which will be out today or tomorrow).

Lastly, is there an issue open for support of numpy fancy indexing?

Full NumPy style fancy indexing is out of scope for xray. You can simply do way too many complex things with it for which preserving the metadata is impossible (e.g., you can scramble the elements of a 2D array into any arbitrary desired positions). Moreover, the underlying array indexing operations are only possible to do efficiently if the underlying array is backed by NumPy -- there's no way we'll do that with netCDF4 or dask.array. But more limited cases, such as the "diagonal" or intersection style array indexing you want here are definitely possible to support, and help for that would certainly be appreciated.

shoyer · 2015-05-31T23:14:01Z

Added a warning in #412. Here's what that doc page looks like now:

jhamman · 2015-06-01T05:03:29Z

Thanks @shoyer . I think #412 suffices for this issue.

I will touch base regarding some point-wise/intersection/diagonal indexing (#214) when I return in July.

jhamman closed this as completed Jun 1, 2015

jhamman mentioned this issue Jul 15, 2015

API design for pointwise indexing #475

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected positional indexing behavior with Dataset and DataArray "isel" #411

unexpected positional indexing behavior with Dataset and DataArray "isel" #411

jhamman commented May 31, 2015

shoyer commented May 31, 2015

jhamman commented May 31, 2015

shoyer commented May 31, 2015

shoyer commented May 31, 2015

jhamman commented Jun 1, 2015

unexpected positional indexing behavior with Dataset and DataArray "isel" #411

unexpected positional indexing behavior with Dataset and DataArray "isel" #411

Comments

jhamman commented May 31, 2015

shoyer commented May 31, 2015

jhamman commented May 31, 2015

shoyer commented May 31, 2015

shoyer commented May 31, 2015

jhamman commented Jun 1, 2015