Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterating over a Dataset iterates only over its data_vars #884

Closed
max-sixty opened this issue Jun 15, 2016 · 11 comments · Fixed by #2506
Closed

Iterating over a Dataset iterates only over its data_vars #884

max-sixty opened this issue Jun 15, 2016 · 11 comments · Fixed by #2506
Milestone

Comments

@max-sixty
Copy link
Collaborator

This has been a small-but-persistent issue for me for a while. I suspect that my perspective might be dependent on my current outlook, but socializing it here to test if it's secular...

Currently Dataset.keys() returns both variables and coordinates (but not its attrs keys):

In [5]: ds=xr.Dataset({'a': (('x', 'y'), np.random.rand(10,2))})
In [12]: list(ds.keys())
Out[12]: ['a', 'x', 'y']

Is this conceptually correct? I would posit that a Dataset is a mapping of keys to variables, and the coordinates contain values that label that data.

So should Dataset.keys() instead return just the keys of the Variables?

We're often passing around a dataset as a Mapping of keys to values - but then when we run a function across each of the keys, we get something run on both the Variables' keys, and the Coordinate / label's keys.

In Pandas, DataFrame.keys() returns just the columns, so that conforms to what we need. While I think the xarray design is in general much better in these areas, this is one area that pandas seems to get correct - and because of the inconsistency between pandas & xarray, we're having to coerce our objects to pandas DataFrames before passing them off to functions that pull out their keys (this is also why we can't just look at ds.data_vars.keys() - because it breaks that duck-typing).

Does that make sense?

@shoyer
Copy link
Member

shoyer commented Jun 15, 2016

An early version of xarray actually worked exactly like this, but we switched it to the current functionality. Let me see if I can dig up the relevant issues....

@shoyer
Copy link
Member

shoyer commented Jun 15, 2016

Here is one place where I discussed this with myself!
#211

Then I merged the functionality you are describing, but at some point switched it back....

@fmaussion
Copy link
Member

What would happen to the actual variables? Would ds['longitude'] also return a KeyError? This would be very far from the NetCDF model and would brake many many things....

@max-sixty
Copy link
Collaborator Author

Then I merged the functionality you are describing, but at some point switched it back....

Ha!

What are your thoughts now?

@shoyer
Copy link
Member

shoyer commented Aug 5, 2016

What would happen to the actual variables? Would ds['longitude'] also return a KeyError? This would be very far from the NetCDF model and would brake many many things....

There is no way we would change this -- ds['longitude'] will always return the longitude array.

What are your thoughts now?

I like this change for the reasons you outline above, and those I mentioned in the previous issue. I'm somewhat concerned about breaking a lot of user code, and also am not sure yet what the right solution here looks, because of concerns about breaking duck-type compatibility with dictionaries.

Which of these should no longer include coordinates? Can we make things consistent without changing all of them?

  • iter(ds) (currently we actually define Dataset.keys() implicitly via __iter__)
  • 'x' in ds
  • del ds['x']
  • ds['x'] (this would be quite a compatibility break, as @fmaussion notes above)

We would also need to encourage users to use the low level Dataset.variables property to check whether something is either a coordinate or data variables.

@max-sixty
Copy link
Collaborator Author

From those, the latter three are trivial to keep.
The first (iter(ds)) would potentially break some contracts with the Mapping abc

@shoyer
Copy link
Member

shoyer commented Oct 22, 2017

I'm pretty sure now that we should (at least) change iter(ds) and ds.keys() to only include data variables. This is a repeated source of annoyance.

For the most recent example, consider the API for argmin() suggested by @fujiisoup in #1388 (comment), where argmin() would return a Dataset. We'd like make arr.isel(**arr.argmin(dim)) work, but this requires no additional members (beyond data variables) in arr.argmin(dim).keys().

The question is how to do it: is it worth a deprecation cycle? My proposal:

  • In v0.10, start issuing FutureWarning when calling ds.__iter__. Suggest that users switch to iterating over .variables instead if they really want everything. (Many cases where every data variable and coordinate is being iterated over should probably already be using the low-level API.)
  • In v0.11, switch to the new behavior.

Note: __len__ should also be changed in lock-step, to ensure the invariant len(list(ds)) == len(ds).

@shoyer
Copy link
Member

shoyer commented Oct 22, 2017

Another question is what to do with x in ds (i.e.,Dataset.__contains__). Currently it checks data variables and coordinates, but not dimensions (as @fujiisoup pointed out in #1632 (comment)), which already feels somewhat inconsistent.

If we were starting over, I might change how __contains__ works for Dataset to only include data variables, but I don't think it's worth it at this time:

  • Unlike the current version of __iter__, __contains__ is actually useful in its current state and I suspect is widely used. Issuing a deprecation warning for 'variable' in ds would annoy lots of users, for no particularly good reason. In many cases (checking data variables), the behavior would not actually change.
  • There is also the expectation that k in ds is equivalent to ds[k] for mapping types, which is currently mostly true for xarray.Dataset.
  • The main advantage I see to changing the behavior of k in ds is that it would remove most of the remaining use cases for Dataset.data_vars, which would let us eventually remove data_vars. But given that ds[k] for coordinates k will continue to be supported, I think I would still recommend explicitly writing k in ds.data_vars for cases where the data/coordinates distinction matters.

Related: over in #1645, I am deprecating the current behavior of DataArray.__contains__, so that we can make it check array values instead of DataArray.coords in the future.

If we aren't going to fundamentally change the behavior of k in ds to exclude coordinates, then we should probably update it, at @fujiisoup suggests, to also include MultiIndex levels and dimension names (i.e., the stuff checked in xarray.core.dataset._get_virtual_variable()).

@fujiisoup
Copy link
Member

Do we need to change the behavior of dict(dataset) so that dict(dataset).keys() and dataset.keys() become consistent?

@shoyer
Copy link
Member

shoyer commented May 25, 2018

Do we need to change the behavior of dict(dataset) so that dict(dataset).keys() and dataset.keys() become consistent?

No, I think these are guaranteed to be consistent because we inherit from collections.Mapping to implement dict methods like keys(), values() and items() (via __iter__ and __getitem__).

@fujiisoup
Copy link
Member

So, the behavior of dict(dataset) will change if we changed the behavior of __iter__.
Can we issue a warning if dict(dataset) is called (or is it impossible)?

In #2162, you changed

for k, v in dataset.items():

to

dataset = OrderedDict(dataset)
for k, v in dataset.items():

to avoid the warning, but I am afraid it will cause the unexpected behavior
when we stop supporting iteration over coordinates.

@max-sixty max-sixty changed the title Q: Should Dataset.keys() return only variable keys? Iterating over a Dataset iterates only over its data_vars Aug 2, 2018
@shoyer shoyer mentioned this issue Oct 24, 2018
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants