Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected positional indexing behavior with Dataset and DataArray "isel" #411

Closed
jhamman opened this issue May 31, 2015 · 5 comments
Closed

Comments

@jhamman
Copy link
Member

jhamman commented May 31, 2015

I may be missing something here but I think the indexing behavior in isel is surprisingly different to that of numpy and is incongruent with the xray documentation. Either this is a bug or a feature that I don't understand.

From the xray docs on positional indexing:

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray

My example below uses two 1d numpy arrays to select from a 3d numpy array. When using pure numpy, I get a 2d array back. In my view, this is the expected behavior. When using the xray.Dataset or xray.DataArray, I get an oddly shaped 3d array back with a duplicate dimension.

import numpy as np
import xray
import sys

print('python version:', sys.version)
print('numpy version:', np.version.full_version)
print('xray version:', xray.version.version)
python version: 3.4.3 |Anaconda 2.2.0 (x86_64)| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)]
numpy version: 1.9.2
xray version: 0.4.1
# A few numpy arrays
time = np.arange(100)
lons = np.arange(40, 60)
lats = np.arange(25, 70)
np_data = np.random.random(size=(len(time), len(lats), len(lons)))

# pick some random points to select
ys, xs = np.nonzero(np_data[0] > 0.8)
print(len(ys))
176
# create a xray.DataArray and xray.Dataset
xr_data = xray.DataArray(np_data, [('time', time), ('y', lats), ('x', lons)])  # DataArray
xr_ds = xr_data.to_dataset(name='data')  # Dataset

# numpy indexing 
print('numpy: ', np_data[:, ys, xs].shape)

# xray positional indexing
print('xray1: ', xr_data.isel(y=ys, x=xs).shape)
print('xray2: ', xr_data[:, ys, xs].shape)
print('xray3: ', xr_ds.isel(y=ys, x=xs))
numpy:  (100, 176)
xray1:  (100, 176, 176)
xray2:  (100, 176, 176)
xray3:  <xray.Dataset>
Dimensions:  (time: 100, x: 176, y: 176)
Coordinates:
  * x        (x) int64 46 47 57 45 48 50 51 54 57 59 48 52 49 50 52 53 55 57 43 46 47 48 53 ...
  * time     (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ...
  * y        (y) int64 25 25 25 26 26 26 26 26 26 26 27 27 28 28 28 28 28 28 29 29 29 29 29 ...
Data variables:
    data     (time, y, x) float64 0.9343 0.8311 0.8842 0.3188 0.02052 0.4506 0.04177 ...
@shoyer
Copy link
Member

shoyer commented May 31, 2015

This is expected -- we do orthogonal indexing (treating each coordinate independently), not NumPy style fancy indexing. If you read on in the indexing docs, you'll find this under "Indexing details": http://xray.readthedocs.org/en/latest/indexing.html#indexing-details

Perhaps we should note this more prominently!

The style of indexing you're describing here is definitely useful, though (e.g., for selecting the locations of stations from a grid) -- it's something that I would really like to support.

@jhamman
Copy link
Member Author

jhamman commented May 31, 2015

Okay. Glad to hear I was missing something. For large datasets, I was getting a MemoryError with orthogonal indexing as the array size increased like it did in my example.

The indexing details link you show does explain the behavior better but it is somewhat contradictory to the quote I listed above. I think it would be worth reconciling the docs on that point.

The station selection use case you bring up is exactly what I was going for. It would be nice if we could stay in xray / pandas after the indexing. Ideally, I don't want to do: xr_data.values[:, ys, xs] and then recreate a DataArray.

Lastly, is there an issue open for support of numpy fancy indexing? This is something that would be useful for my work and I could contribute to the effort beginning in July.

@shoyer
Copy link
Member

shoyer commented May 31, 2015

The indexing details link you show does explain the behavior better but it is somewhat contradictory to the quote I listed above. I think it would be worth reconciling the docs on that point.

I'll definitely add a note to qualify that description in the docs -- sorry you went to all the trouble of writing up the bug report!

The station selection use case you bring up is exactly what I was going for. It would be nice if we could stay in xray / pandas after the indexing. Ideally, I don't want to do: xr_data.values[:, ys, xs] and then recreate a DataArray.

The open issue for this is #214, which describes how I currently do "diagonal" style indexing to extract stations. It preserves all the metadata, but is a little slow if you have a very long list of stations:.

Note that the example in that first comment will work even with arrays that don't fit into memory if you use dask (which will be in the next xray release, which will be out today or tomorrow).

Lastly, is there an issue open for support of numpy fancy indexing?

Full NumPy style fancy indexing is out of scope for xray. You can simply do way too many complex things with it for which preserving the metadata is impossible (e.g., you can scramble the elements of a 2D array into any arbitrary desired positions). Moreover, the underlying array indexing operations are only possible to do efficiently if the underlying array is backed by NumPy -- there's no way we'll do that with netCDF4 or dask.array. But more limited cases, such as the "diagonal" or intersection style array indexing you want here are definitely possible to support, and help for that would certainly be appreciated.

@shoyer
Copy link
Member

shoyer commented May 31, 2015

Added a warning in #412. Here's what that doc page looks like now:

screen shot 2015-05-31 at 4 06 40 pm

@jhamman
Copy link
Member Author

jhamman commented Jun 1, 2015

Thanks @shoyer . I think #412 suffices for this issue.

I will touch base regarding some point-wise/intersection/diagonal indexing (#214) when I return in July.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants