Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should indexing be possible on 1D coords, even if not dims? #934

Closed
max-sixty opened this issue Aug 2, 2016 · 6 comments
Closed

Should indexing be possible on 1D coords, even if not dims? #934

max-sixty opened this issue Aug 2, 2016 · 6 comments

Comments

@max-sixty
Copy link
Collaborator

max-sixty commented Aug 2, 2016

In [1]: arr = xr.DataArray(np.random.rand(4, 3),
    ...:    ...:                    [('time', pd.date_range('2000-01-01', periods=4)),
    ...:    ...:                     ('space', ['IA', 'IL', 'IN'])])
    ...:    ...: 

In [17]: arr.coords['space2'] = ('space', ['A','B','C'])

In [18]: arr
Out[18]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.05187049,  0.04743067,  0.90329666],
       [ 0.59482538,  0.71014366,  0.86588207],
       [ 0.51893157,  0.49442107,  0.10697737],
       [ 0.16068189,  0.60756757,  0.31935279]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
    space2   (space) |S1 'A' 'B' 'C'

Now try to select on the space2 coord:

In [19]: arr.sel(space2='A')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-19-eae5e4b64758> in <module>()
----> 1 arr.sel(space2='A')

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xarray/core/dataarray.pyc in sel(self, method, tolerance, **indexers)
    601         """
    602         return self.isel(**indexing.remap_label_indexers(
--> 603             self, indexers, method=method, tolerance=tolerance))
    604 
    605     def isel_points(self, dim='points', **indexers):

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xarray/core/dataarray.pyc in isel(self, **indexers)
    588         DataArray.sel
    589         """
--> 590         ds = self._to_temp_dataset().isel(**indexers)
    591         return self._from_temp_dataset(ds)
    592 

/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/xarray/core/dataset.pyc in isel(self, **indexers)
    908         invalid = [k for k in indexers if k not in self.dims]
    909         if invalid:
--> 910             raise ValueError("dimensions %r do not exist" % invalid)
    911 
    912         # all indexers should be int, slice or np.ndarrays

ValueError: dimensions ['space2'] do not exist

Is there an easier way to do this? I couldn't think of anything...

CC @justinkuosixty

@fmaussion
Copy link
Member

I tried to awake interest for this kind of indexing on the mailinglist without success so far:

https://groups.google.com/forum/#!topic/xarray/KTlG2snZabg

@fmaussion
Copy link
Member

In your case:

arr.isel(space=(arr.space2=='A'))

@shoyer
Copy link
Member

shoyer commented Aug 2, 2016

Yes, this would be nice to support automatically.

Doing indexing requiring constructing a hash table (in the form of a pandas.Index), which we currently cache on xarray.Coordinate variables. Coordinate is a Variable subclass used only for dimension coordinates (maybe we should rename it DimCoordinate or Coordinate1D).

The only material difference between Coordinate and Variable is that coordinate caches values in the form of a pandas.Index, whereas Variable caches values in the form of a numpy array. This means that Coordinate is currently immutable (because Index is immutable) and some subtle distinctions in terms of how different types of data are stored due to Index vs ndarray differences (basically, keeping things as an index is more efficient for handling native pandas types like Period, but a little less efficient if you don't need indexing).

So there are a few approaches we could take here:

  1. Convert 1D coordinates that are not dimensions into a pandas.Index via .to_index() when indexing happens with .sel. This approach would be non-ideal, because we would need to recreate the hash table every time indexing happens.
  2. Switch all 1D coordinates (even non-dimensions) to use the Coordinate class. This would be the preferred approach, except it would be a breaking change because it would make them immutable.
  3. Cache the result of .to_index() on Variable objects, too, and invalidate it when they are changed with __setitem__. The downside is that it makes Variable a little more complex.

@max-sixty
Copy link
Collaborator Author

That's very clear @shoyer.

I know you've discussed in the past whether indexes are really that different from arrays (they are treated very different in pandas, for example). To reiterate the above, the only real difference is one is designed for lookups (and so uses a hash table), and the other is designed for data access (and so mutation is easier).

We try to never use mutation, but our data is not that big, so making a copy is generally OK. But that's probably not the main use case.

Another option (potentially 1b in your list) is to slice the array rather than select from an index - i.e. sugar over @fmaussion 's solution above. Not as fast to do multiple times, but simple and probably as fast to do a single time.
Or to add that to the docs.

@stale
Copy link

stale bot commented Jan 27, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 27, 2019
@shoyer
Copy link
Member

shoyer commented Jan 27, 2019

This will be part of the explicit indexes refactor (#1603)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants