Iterating over a Dataset iterates only over its data_vars #884

max-sixty · 2016-06-15T19:35:50Z

This has been a small-but-persistent issue for me for a while. I suspect that my perspective might be dependent on my current outlook, but socializing it here to test if it's secular...

Currently Dataset.keys() returns both variables and coordinates (but not its attrs keys):

In [5]: ds=xr.Dataset({'a': (('x', 'y'), np.random.rand(10,2))})
In [12]: list(ds.keys())
Out[12]: ['a', 'x', 'y']

Is this conceptually correct? I would posit that a Dataset is a mapping of keys to variables, and the coordinates contain values that label that data.

So should Dataset.keys() instead return just the keys of the Variables?

We're often passing around a dataset as a Mapping of keys to values - but then when we run a function across each of the keys, we get something run on both the Variables' keys, and the Coordinate / label's keys.

In Pandas, DataFrame.keys() returns just the columns, so that conforms to what we need. While I think the xarray design is in general much better in these areas, this is one area that pandas seems to get correct - and because of the inconsistency between pandas & xarray, we're having to coerce our objects to pandas DataFrames before passing them off to functions that pull out their keys (this is also why we can't just look at ds.data_vars.keys() - because it breaks that duck-typing).

Does that make sense?

The text was updated successfully, but these errors were encountered:

shoyer · 2016-06-15T21:40:53Z

An early version of xarray actually worked exactly like this, but we switched it to the current functionality. Let me see if I can dig up the relevant issues....

shoyer · 2016-06-15T21:44:13Z

Here is one place where I discussed this with myself!
#211

Then I merged the functionality you are describing, but at some point switched it back....

fmaussion · 2016-06-15T21:58:06Z

What would happen to the actual variables? Would ds['longitude'] also return a KeyError? This would be very far from the NetCDF model and would brake many many things....

max-sixty · 2016-06-16T01:34:43Z

Then I merged the functionality you are describing, but at some point switched it back....

Ha!

What are your thoughts now?

shoyer · 2016-08-05T08:06:32Z

What would happen to the actual variables? Would ds['longitude'] also return a KeyError? This would be very far from the NetCDF model and would brake many many things....

There is no way we would change this -- ds['longitude'] will always return the longitude array.

What are your thoughts now?

I like this change for the reasons you outline above, and those I mentioned in the previous issue. I'm somewhat concerned about breaking a lot of user code, and also am not sure yet what the right solution here looks, because of concerns about breaking duck-type compatibility with dictionaries.

Which of these should no longer include coordinates? Can we make things consistent without changing all of them?

iter(ds) (currently we actually define Dataset.keys() implicitly via __iter__)
'x' in ds
del ds['x']
ds['x'] (this would be quite a compatibility break, as @fmaussion notes above)

We would also need to encourage users to use the low level Dataset.variables property to check whether something is either a coordinate or data variables.

max-sixty · 2016-08-05T14:04:22Z

From those, the latter three are trivial to keep.
The first (iter(ds)) would potentially break some contracts with the Mapping abc

shoyer · 2017-10-22T02:06:27Z

I'm pretty sure now that we should (at least) change iter(ds) and ds.keys() to only include data variables. This is a repeated source of annoyance.

For the most recent example, consider the API for argmin() suggested by @fujiisoup in #1388 (comment), where argmin() would return a Dataset. We'd like make arr.isel(**arr.argmin(dim)) work, but this requires no additional members (beyond data variables) in arr.argmin(dim).keys().

The question is how to do it: is it worth a deprecation cycle? My proposal:

In v0.10, start issuing FutureWarning when calling ds.__iter__. Suggest that users switch to iterating over .variables instead if they really want everything. (Many cases where every data variable and coordinate is being iterated over should probably already be using the low-level API.)
In v0.11, switch to the new behavior.

Note: __len__ should also be changed in lock-step, to ensure the invariant len(list(ds)) == len(ds).

shoyer · 2017-10-22T02:36:39Z

Another question is what to do with x in ds (i.e.,Dataset.__contains__). Currently it checks data variables and coordinates, but not dimensions (as @fujiisoup pointed out in #1632 (comment)), which already feels somewhat inconsistent.

If we were starting over, I might change how __contains__ works for Dataset to only include data variables, but I don't think it's worth it at this time:

Unlike the current version of __iter__, __contains__ is actually useful in its current state and I suspect is widely used. Issuing a deprecation warning for 'variable' in ds would annoy lots of users, for no particularly good reason. In many cases (checking data variables), the behavior would not actually change.
There is also the expectation that k in ds is equivalent to ds[k] for mapping types, which is currently mostly true for xarray.Dataset.
The main advantage I see to changing the behavior of k in ds is that it would remove most of the remaining use cases for Dataset.data_vars, which would let us eventually remove data_vars. But given that ds[k] for coordinates k will continue to be supported, I think I would still recommend explicitly writing k in ds.data_vars for cases where the data/coordinates distinction matters.

Related: over in #1645, I am deprecating the current behavior of DataArray.__contains__, so that we can make it check array values instead of DataArray.coords in the future.

If we aren't going to fundamentally change the behavior of k in ds to exclude coordinates, then we should probably update it, at @fujiisoup suggests, to also include MultiIndex levels and dimension names (i.e., the stuff checked in xarray.core.dataset._get_virtual_variable()).

fujiisoup · 2018-05-25T05:15:41Z

Do we need to change the behavior of dict(dataset) so that dict(dataset).keys() and dataset.keys() become consistent?

shoyer · 2018-05-25T06:09:07Z

Do we need to change the behavior of dict(dataset) so that dict(dataset).keys() and dataset.keys() become consistent?

No, I think these are guaranteed to be consistent because we inherit from collections.Mapping to implement dict methods like keys(), values() and items() (via __iter__ and __getitem__).

fujiisoup · 2018-05-25T07:17:33Z

So, the behavior of dict(dataset) will change if we changed the behavior of __iter__.
Can we issue a warning if dict(dataset) is called (or is it impossible)?

In #2162, you changed

for k, v in dataset.items():

to

dataset = OrderedDict(dataset)
for k, v in dataset.items():

to avoid the warning, but I am afraid it will cause the unexpected behavior
when we stop supporting iteration over coordinates.

shoyer mentioned this issue Oct 22, 2017

Support autocompletion dictionary access in ipython. #1632

Merged

4 tasks

shoyer added the API design label Oct 22, 2017

shoyer added this to the 0.10 milestone Oct 22, 2017

shoyer mentioned this issue Oct 25, 2017

Add a FutureWarning to Dataset.__iter__ and Dataset.__len__ #1658

Merged

4 tasks

jhamman modified the milestones: 0.10, 0.11 Nov 20, 2017

shoyer mentioned this issue Feb 4, 2018

argmin / argmax behavior doesn't match documentation #1388

Closed

fujiisoup mentioned this issue May 29, 2018

Test suite: explicitly ignore irrelevant warnings #2162

Merged

3 tasks

shoyer mentioned this issue Aug 2, 2018

Supplying a dataset to the dataset constructor #2330

Closed

4 tasks

max-sixty changed the title ~~Q: Should Dataset.keys() return only variable keys?~~ Iterating over a Dataset iterates only over its data_vars Aug 2, 2018

shoyer mentioned this issue Oct 24, 2018

xarray 0.11 release #2505

Closed

5 tasks

max-sixty mentioned this issue Oct 24, 2018

Iterate over data_vars only #2506

Merged

3 tasks

max-sixty closed this as completed in #2506 Oct 25, 2018

spencerkclark mentioned this issue May 10, 2023

Prevent unsafe concurrent coordinate writes spencerkclark/xpartition#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating over a Dataset iterates only over its data_vars #884

Iterating over a Dataset iterates only over its data_vars #884

max-sixty commented Jun 15, 2016

shoyer commented Jun 15, 2016

shoyer commented Jun 15, 2016

fmaussion commented Jun 15, 2016

max-sixty commented Jun 16, 2016

shoyer commented Aug 5, 2016 •

edited

Loading

max-sixty commented Aug 5, 2016

shoyer commented Oct 22, 2017 •

edited

Loading

shoyer commented Oct 22, 2017

fujiisoup commented May 25, 2018

shoyer commented May 25, 2018

fujiisoup commented May 25, 2018

Iterating over a Dataset iterates only over its data_vars #884

Iterating over a Dataset iterates only over its data_vars #884

Comments

max-sixty commented Jun 15, 2016

shoyer commented Jun 15, 2016

shoyer commented Jun 15, 2016

fmaussion commented Jun 15, 2016

max-sixty commented Jun 16, 2016

shoyer commented Aug 5, 2016 • edited Loading

max-sixty commented Aug 5, 2016

shoyer commented Oct 22, 2017 • edited Loading

shoyer commented Oct 22, 2017

fujiisoup commented May 25, 2018

shoyer commented May 25, 2018

fujiisoup commented May 25, 2018

shoyer commented Aug 5, 2016 •

edited

Loading

shoyer commented Oct 22, 2017 •

edited

Loading