Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supplying a dataset to the dataset constructor #2330

Closed
wants to merge 8 commits into from

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Jul 31, 2018

  • Closes Warning on supplying a Dataset to the Dataset constructor #2056

  • Tests added (for all bug fixes or enhancements)

  • Tests passed (for all non-documentation changes)

  • Fully documented, including whats-new.rst for all changes and api.rst for new API

  • Do we want to raise if coords / attrs are passed in addition to a Dataset? Do we want to warn for a couple versions first?

  • Is this the best way to implement this? Do we want a consistent method to "make this object a shallow copy of the object that's been passed in"?

@max-sixty max-sixty force-pushed the dataset-to-dataset branch from 0d29ea9 to 3fee598 Compare August 1, 2018 15:34
@max-sixty
Copy link
Collaborator Author

Do we want to raise if coords / attrs are passed in addition to a Dataset? Do we want to warn for a couple versions first?

An alternative approach - more friendly but less explicit - is for any other argument (coords, attrs) supplied to overwrite the Dataset's

@shoyer
Copy link
Member

shoyer commented Aug 2, 2018

An alternative approach - more friendly but less explicit - is for any other argument (coords, attrs) supplied to overwrite the Dataset's

I think this would be slightly preferred, and also more consistent with the model of a Dataset as only iterating over data_vars. This suggests that at the very least, Dataset(ds) and Dataset(ds.data_vars) should be consistent after we finish the changes described in #884. Note that ds.data_vars will pull in coordinates associated with the DataArray objects.

To be more precise, I would probably make:

ds2 = Dataset(ds1, coords, attrs)

equivalent to

ds2 = ds1.copy()
ds2.coords.update(coords)
ds2.attrs.update(attrs)

@max-sixty
Copy link
Collaborator Author

OK good!

And, to confirm, you think it's OK that :

xr.Dataset(ds, coords=some_coords).coords != some_coords

@max-sixty
Copy link
Collaborator Author

And, to confirm, you think it's OK that :

Moving ahead with this, please let me know any objections from anyone

@fujiisoup
Copy link
Member

xr.Dataset(ds, coords=some_coords).coords != some_coords

I thought that xr.Dataset(ds, coords=some_coords).coords == some_coords.
I think Dataset should be regarded as a dict of dataarrays (data_vars), and thus
xr.Dataset(ds, coords=some_coords) should be equivalent to xr.Dataset(ds.data_vars, coords=some_coords).

@shoyer
Copy link
Member

shoyer commented Aug 8, 2018

Now I'm having second thoughts. It does make me a little nervous to add special cases that distinguish between xr.Dataset(ds, coords=some_coords) and xr.Dataset(ds.data_vars, coords=some_coords).

Maybe my proposed solution above is too complex, and it would be better simply to default the coords and attrs argument to those arguments from a passed Dataset object. This is basically what we do for DataArray.

That said, I do think we should try to resist the notion that every class constructor must also work for coercion to that class. This tends to result in a slightly confused interface. A separate as_dataset() function (even in user code) could give exactly the desired result in only be a few lines of code.

@fmaussion
Copy link
Member

From #2056 and this discussion I don't really understand the use case for this constructor - maybe an example of a real use case that could be added to the doc would help?

@max-sixty
Copy link
Collaborator Author

@fmaussion good question

The main reason I think this is worthwhile is the change in behavior between 0.10.x and 0.11.0, when iterating over a Dataset will iterate only over its data_vars, and so will change Dataset(ds) from ds to Dataset(ds.data_vars).
I don't have a strong view on which option we choose, but I do think we should have a deliberate decision rather than making a breaking-change-by-coincidence.

To answer your question though, we hit this when inheriting from a Dataset: the object is initialized with a dataset, which is super-ed up to the Dataset constructor. (I realize this is narrow!). Alternatively, there is another repo that linked to the issue here

@max-sixty
Copy link
Collaborator Author

I think we should do something here, rather than change behavior by coincidence.

@shoyer do you want to make the decision?

I would vote +0.1 for the current version in the PR. V happy to be outweighed!

@max-sixty
Copy link
Collaborator Author

Any votes?

I think the current version dominates the current master (i.e. it doesn't disable any useful functionality, it adds some clarity). So I would vote to merge unless anyone has any alternative suggestions (which I will gladly implement)

Thanks team

@shoyer
Copy link
Member

shoyer commented Sep 18, 2018

Sorry for taking so long here.

I'm still not entirely comfortable with this change -- it adds some additional complexity to the Dataset constructor that seems unnecessary. There's no inherit reason why calling a constructor on an object of the same type needs to work. If anything, I would like to deprecate/remove this behavior on DataArray (e.g., in favor of using a more explicit call to DataArray.copy(data=new_data)).

To answer your question though, we hit this when inheriting from a Dataset: the object is initialized with a dataset, which is super-ed up to the Dataset constructor

If I understand this correctly, you've written something like:

class MyDataset(xarray.Dataset):
    def __init__(self, dataset):
        super().__init__(dataset)

Is there any reason why you couldn't just write:

class MyDataset(xarray.Dataset):
    def __init__(self, dataset):
        super().__init__(dataset.data_vars, dataset.coords, dataset.attrs)

I think we should do something here, rather than change behavior by coincidence.

I don't think it's quite a coincidence here -- we are changing the behavior of Dataset.__iter__/Dataset.keys(), and this is a logical consequence of interpreting a Dataset as a mapping of its data variables :).

@max-sixty
Copy link
Collaborator Author

Yes, I agree - given the reluctance on the principle around "calling a constructor on an object of the same type needs to work" - it's right not to make this change. Thanks for holding that principle.

I'll close this; let me know if there's any change around this I make in its place.

@max-sixty max-sixty closed this Sep 25, 2018
@max-sixty max-sixty deleted the dataset-to-dataset branch September 25, 2018 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants