Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Store variables in DataTree instead of storing Dataset #2

Closed
TomNicholas opened this issue Aug 18, 2021 · 20 comments · Fixed by #41
Closed

Store variables in DataTree instead of storing Dataset #2

TomNicholas opened this issue Aug 18, 2021 · 20 comments · Fixed by #41

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Aug 18, 2021

Should we prefer inheritance or composition when making the node of a datatree behave like an xarray Dataset?

Inheritance

We really want the data-containing nodes of the datatree to behave as much like xarray datasets as possible, as we will likely be calling functions/methods on them, assigning them, extracting from them and saving them as if they were actually xarray.Dataset objects. We could imagine a tree node class which directly inherits from xarray.Dataset:

class DatasetNode(xarray.Dataset, NodeMixin):
    ...

This would have all the attributes and API of a Dataset, and pass isinstance() checks, but also the attributes and methods needed to function as a node in a tree (e.g. .children, .parent). We would still need to decorate most inherited methods in order to apply them to all child nodes in the tree though.

Mostly these don't collide, except in the important case of getting/setting children of a node. xarray.Datasets already use up __getitem__ for variable selection (i.e. ds[var]) as well as the .some_variable namespace via property-like access. This means we can't immediately have an API allowing operations like dt.weather = dt.weather.mean('time') because .weather is a child of the node, not a dataset variable. (It's possible we could have both behaviours simultaneously by overwriting __getitem__, but then we might restrict the possible names of children/variables.)

I think this approach would also have the side-effect that accessor methods registered with @register_dataset_accessor would also be callable on the tree nodes.

Composition

The alternative is instead of each node being a Dataset, each node merely wraps a Dataset. This has the advantage of keeping the Data class and the Node class separate, though they would still share a large API to allow applying a method (e.g. .mean() to all child nodes in a tree.

The disadvantage is that then all the variables and dataset attributes are behind a .ds property.

This type of syntax dt.weather = dt.weather.mean('time') would then be possible (at least if we didn't allow the tree objects to have their own .attrs, else it would have to be dt['weather'] = dt['weather'].mean('time')) because we would be calling the method of a DatasetNode (rather than Dataset) and then assigning to a DatasetNode.

Selecting a particular variable from a dataset stored at a particular node would then look like dt['weather'].ds['pressure'], which has the advantage of clarifying which one is the variable, but the disadvantage of breaking up the path-like structure to get from the root down to the variable. EDIT: As there is no problem with collisions between names of groups and variables, we can actually just override __getitem__ to check in both the data variables and the children, so we can have access like dt['weather']['pressure'].


(There is also a possible third option described in #4)


For now the second approach seemed better, but I'm looking for other opinions!

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 20, 2021

I think this really comes down to two questions:

  1. Do we want isinstance(DataTree, Dataset) to return True?

  2. Do we want to automatically inherit all Dataset methods, or only wrap the methods we have specifically chosen?

Pros of automatic inheritance:

  • We would not have to write out the entire massive API of xarray.Dataset again within the definition of DataTree.
  • Methods which rely on other Dataset methods to work should then automatically work, for example lots of methods internally rely on Dataset._construct_direct() to rebuild the result before returning. These don't need to be wrapped with map_over_subtree, so inheriting them undecorated is what we want.
  • We also get all property methods defined immediately.

Pros of manual wrapping:

  • There won't be any unexpected behaviour resulting from inheriting a method we hadn't thought about. Any method we didn't explicitly consider would not be defined on DataTree.
  • We still have to decorate most of the Dataset methods with @map_over_subtree, and this is easier to control without mistakes if they are wrapped rather than inherited.
  • We have a separate namespace for Dataset methods vs DataTree methods. This is helpful if we want to make identically-named methods on both classes easily available to the user. For example DataTree.merge(another_dt) and Dataset.merge(another_ds) clearly need different implementations, but would collide if both defined on DataTree. Calling the latter via DataTree.ds.merge(another_ds) would allow both to be accessible from the DataTree interface.

@TomNicholas
Copy link
Member Author

Another, probably much better, approach to inheritance would be to make benign changes to xarray.Dataset's inheritance pattern upstream that make it easier to pick and choose the correct methods to inherit on DataTree. For example, we currently have to awkwardly create our own DatasetPropertiesMixin by wrapping properties in Dataset, but we could instead factor out those property definitions in xarray into a xarray.core.Dataset.DatasetPropertiesMixin class, then inherit something like this:

xarray.core.dataset.py:

class DatasetPropertiesMixin:
    @property
    def dims(self):
        ...  # etc.

class DatasetMethodsMixin:
    def isel(self, indexers, ...):
        ...  # etc.

class Dataset(DatasetPropertiesMixin, DatasetMethodsMixin, DataWithCoords, DatasetArithmetic):
    ...

datatree.datatree.py:

from xarray.core.dataset import DatasetPropertiesMixin, DatasetMethodsMixin

DataTreeMethodsMixin = map_all_methods_over_subtree(DatasetMethodsMixin)
DataTreeArithmetic = map_all_methods_over_subtree(DatasetArithmetic)

class Dataset(DatasetPropertiesMixin, DataTreeMethodsMixin, DataWithCoords, DataTreeArithmetic):
    ...

(where that example also has another mixin class for mapped Dataset methods too.)

This would not make isinstance(DataTree, Dataset) return True, but it would probably be the shortest and least error-prone way to get all the right methods defined on all the right classes.

@TomNicholas
Copy link
Member Author

TomNicholas commented Aug 26, 2021

@shoyer or @max-sixty, I would appreciate either of your opinions on the above fundamental design question for DataTree, because this is practically the most core question, and at the moment I'm just talking to myself 😅

@TomNicholas
Copy link
Member Author

This also came up in #26, in the context of whether an empty netCDF group is better represented by an empty Dataset or None (a question also related to #4.)

@max-sixty
Copy link

this is practically the most core question, and at the moment I'm just talking to myself

eventually software ends up at philosophy...


I do think it's an interesting question. I'm really not sure where we'll end up.

Without having thought as deeply as you, I would probably start with composing them. Basically for the reasons you describe — inheritance a) couples the classes together and b) stepping back, it relies on a known ontology — so mixins like DatasetMethodsMixin & DatasetPropertiesMixin might seem like a good structure now — but then for another case we're going to need a slightly different structure, and the number of Mixins propagate until they're no longer useful groupings.

Having everything behind a .ds property does seem like a worse API, but it might be a reasonable way of getting some cool examples going (I already see some good ones!). Or to reduce the amount of boilerplate to get started, you could intercept methods with __getattribute__ and decide where to route them?

And this doesn't preclude a Dataset becoming a DataTree in the future — it's much easier to go composition -> inheritance (or even integration) than the other way around.

To confirm, I haven't thought about this sufficiently, so please weigh my view appropriately!

@shoyer
Copy link

shoyer commented Aug 26, 2021

I would also start with composition. It's much more verbose, but otherwise you'll never be quite sure when DataTree methods will work properly. It should not be too painful to use a DataTree.dataset property for any missing methods.

Making issubclass(DataTree, Dataset) true is also problematic, unless a DataTree is in fact 100% substitutable for a Dataset. Even if this can be made true (which I doubt), most of us have a hard time reasoning about inheritance.

Reusing Xarray's mixins would be reasonable, but only if/when DataTree lives inside the Xarray codebase. Otherwise it would be too easy to introduce errors.

@TomNicholas
Copy link
Member Author

Thanks both for your input.

I did in fact start with composition, so I'll keep it that way for now.

Reusing Xarray's mixins would be reasonable, but only if/when DataTree lives inside the Xarray codebase. Otherwise it would be too easy to introduce errors.

I'm not sure if this is quite what you meant, but I think what I'll do for now is define my own mixins internally, but make sure none of them actually inherit from any of xarray's mixins. That way I can keep the internal organisational difference between "methods on Dataset that need wrapping", "methods on Dataset that need wrapping and mapping over child nodes", "properties on Dataset that need wrapping" etc., but without actually depending on any private xarray API.

For example, this mixin for Datatree to inherit from does not itself inherit from (or even reference) any of xarray's mixins:

class MappedDatasetMethodsMixin:
    """
    Mixin to add Dataset methods like .mean(), but wrapped to map over all nodes in the subtree.
    """

    __slots__ = ()
    _wrap_then_attach_to_cls(
        cls_dict=vars(), 
        copy_methods_from=Dataset, 
        methods_to_copy=_ALL_DATASET_METHODS_TO_MAP, 
        wrap_func=map_over_subtree,
    )


class DataTree(
    TreeNode,
    DatasetPropertiesMixin,
    MappedDatasetMethodsMixin,
    MappedDataWithCoordsMixin,
    DataTreeArithmeticMixin,
):

@jhamman
Copy link

jhamman commented Sep 21, 2021

It seems that the current discussion may be hedging toward inheritance over composition to address issues like name collisions (#38) and mutability. However, it may be worth considering the possibilities of a third option that doesn't use the Dataset as a container at the node level. The concept would basically define each node as either a DataArray or a TreeNode (subtree). TreeNodes would be responsible for insuring alignment, coordinating ops among children, and would be directly convertible to Datasets with 0 or more variables.

This would solve the name collisions problem but would require much more logic here in datatree. Its possible this would be much harder than a direct inheritance implementation but I thought I would through the idea out there.

@shoyer
Copy link

shoyer commented Sep 21, 2021

The concept would basically define each node as either a DataArray or a TreeNode (subtree). TreeNodes would be responsible for insuring alignment, coordinating ops among children, and would be directly convertible to Datasets with 0 or more variables.

I like this idea a lot, but let me suggest a slight variation: store Variable rather than DataArray objects in the dict.

The core data model for xarray.Dataset includes:

  1. attrs: dict of attributes
  2. variables: dict of xarray.Variable
  3. coord_names: set of coordinate names
  4. dims: dict of dimension names to sizes
  5. indexes: dict from dimension name to xarray.Index objects (currently optional and can be left as None, but will be required after the explicit indexes refactor).

My suggestion would be to make DataTree quite similar, except that variables can also hold DataTree objects.

To implement methods on DataTree, you would filter DataTree objects out of variables, and then convert the remainder directly into an xarray.Dataset.

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 21, 2021

Thanks @jhamman and @shoyer - I started typing this before Stephan's comment just now so I will finish the thought - sounds like we're on the same page though:

What you're suggesting is basically what I was thinking of as a last-resort solution to solve #38 . Basically Dataset is very roughly like a dict whose values are Variables, whereas you could imagine re-implementing DataTree as a similar dict whose values can be Variables but also can be more DataTree nodes. This would get around #38 because you would no longer have a "dict storing a dict", so to speak, instead treating variables and nodes at the same level.

You could implement that with inheritance, which sounds conceptually neat but Stephan is probably right that it would cause more problems than it would solve. The alternative as you say Joe is "integration": essentially writing something that reimplements all the high-level aligning behaviour of Dataset, but also allows for storing DataTrees. More effort but basically guaranteed to work in the intended manner. Would probably also mean more direct importing from/coupling with xarray's internals.

You would then have some kind of DataTree.to_dataset() method defined for when you really need the data to be stored in an instance of Dataset.

@TomNicholas
Copy link
Member Author

I would do this by having an API something like:

  • .items: can be either Variables or DataTrees
  • .variables: only the Variables
  • .nodes: only the DataTrees

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 21, 2021

This idea is also closely related to https://github.com/TomNicholas/datatree/issues/3, and would re-open the question of whether to expose only local dims etc. like was discussed in https://github.com/TomNicholas/datatree/issues/36.

@TomNicholas
Copy link
Member Author

If we did implement it like this I personally think there would then be a much stronger case for eventually integrating DataTree into xarray, rather than it just living in xarray-contrib. The current approach simply wraps (and sometimes emulates) Dataset, so it makes sense for it to live in a separate library that lies atop xarray, but this new suggestion is a Dataset-like object which shares a huge amount of functionality and probably internal code with xarray.Dataset. A lot of the functionality could probably be tested by the same tests that are currently used for xarray Dataset too... Not sure what your thoughts on that are Stephan.

@TomNicholas
Copy link
Member Author

(@alexamici you will probably find this whole discussion interesting, and I would like to hear if you do have any thoughts!)

@shoyer
Copy link

shoyer commented Sep 21, 2021 via email

@TomNicholas
Copy link
Member Author

One disadvantage of the "integration" approach is a lack of clear separation for users between operations that act on just that single node, and operations which map over all nodes in that subtree. For example whilst .coords should return only the coordinates stored in that node, what should assign_coords do? It could assign coords to that node alone, or attempt to assign the same coords to every node in the subtree. Sure, the method could have a new kwarg map_over_subtree, but it's not going to be intuitively obvious to the user whether the pattern is that that would default to True or False.

Compare to the current implementation, where to act on a single node's dataset you always have to pull out .ds, and every method on the DataTree object maps over the whole subtree. That is definitely clearer.

@shoyer
Copy link

shoyer commented Sep 22, 2021 via email

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 23, 2021

I think we could probably still make the interface using .ds work. This would require modifying xarray.Dataset so it can support a custom mapping object for storing variables, but that seems doable?

I'm sorry but I'm very much not following 😅 Could you elaborate?

What would the custom mapping object do? How would that help with the original name collision problem? If .ds is Dataset, even a modified one, how can it check what children are contained in the class which wraps it? Or are you suggesting some kind of FrozenDataset?

Or alternatively are you imagining that variables are still stored privately in the DataTree level, but only accessed via DataTree.ds? That makes more sense, but what would the .ds property return? An actual (mutable) Dataset? A frozen Dataset, with a mutable one only via .to_dataset()? Some kind of DatasetProxy? And would that require sending operations on .ds back up to DataTree, e.g. dt.ds.assign_coords() would look at the .children of dt before altering anything? That would almost be like having a .ds accessor on the DataTree object...

@TomNicholas
Copy link
Member Author

TomNicholas commented Sep 29, 2021

I asked Stephan about his suggestion in the xarray dev meeting just now, and I'm going to write down my understanding of it here now (by answering my own questions), because I think it's a cool idea and I don't want to forget it!

What would the custom mapping object do?

It would be a dict-like container storing variables and possibly children, but the key point is that it would be accessible separately from the Dataset object, so that both a DataTree node and the wrapped Dataset would be able to point to a common record of variables/children stored.

How would that help with the original name collision problem?

If someone tries to add a variable to the wrapped Dataset (or a child to the wrapping DataTree), then by consulting the custom mapping object you would be able to check all variables and children for any name collisions.

And would that require sending operations on .ds back up to DataTree?

That is essentially what the custom mapping object would allow - any change to the variables in the Dataset would then be automatically reflected in the wrapping DataTree.

what would the .ds property return? An actual (mutable) Dataset? A frozen Dataset, with a mutable one only via .to_dataset()?

This is still a valid question though I think - if .ds returns a Dataset that is still linked to the wrapping DataTree, then you could end up with some behaviour like

ds = dt.ds
ds = many_inplace_operations_on_ds_later(ds)

dt['new_variable'] = blah
# but now ds has also been changed

Maybe that's fine, but it's also perhaps unintuitive - something to consider.

@TomNicholas
Copy link
Member Author

I wanted to convince myself that this was the only way to solve this problem, so I wrote out the problem and all the possible approaches in this gist.

I think storing a custom mapping object under ._variables is probably the best way overall though.

@TomNicholas TomNicholas changed the title Should Node classes inherit from xarray.Dataset? Store variables in DataTree instead of storing Dataset Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants