-
Notifications
You must be signed in to change notification settings - Fork 41
Store variables in DataTree instead of storing Dataset #2
Comments
I think this really comes down to two questions:
Pros of automatic inheritance:
Pros of manual wrapping:
|
Another, probably much better, approach to inheritance would be to make benign changes to
class DatasetPropertiesMixin:
@property
def dims(self):
... # etc.
class DatasetMethodsMixin:
def isel(self, indexers, ...):
... # etc.
class Dataset(DatasetPropertiesMixin, DatasetMethodsMixin, DataWithCoords, DatasetArithmetic):
...
from xarray.core.dataset import DatasetPropertiesMixin, DatasetMethodsMixin
DataTreeMethodsMixin = map_all_methods_over_subtree(DatasetMethodsMixin)
DataTreeArithmetic = map_all_methods_over_subtree(DatasetArithmetic)
class Dataset(DatasetPropertiesMixin, DataTreeMethodsMixin, DataWithCoords, DataTreeArithmetic):
... (where that example also has another mixin class for mapped Dataset methods too.) This would not make |
@shoyer or @max-sixty, I would appreciate either of your opinions on the above fundamental design question for |
eventually software ends up at philosophy... I do think it's an interesting question. I'm really not sure where we'll end up. Without having thought as deeply as you, I would probably start with composing them. Basically for the reasons you describe — inheritance a) couples the classes together and b) stepping back, it relies on a known ontology — so mixins like Having everything behind a And this doesn't preclude a To confirm, I haven't thought about this sufficiently, so please weigh my view appropriately! |
I would also start with composition. It's much more verbose, but otherwise you'll never be quite sure when Making Reusing Xarray's mixins would be reasonable, but only if/when |
Thanks both for your input. I did in fact start with composition, so I'll keep it that way for now.
I'm not sure if this is quite what you meant, but I think what I'll do for now is define my own mixins internally, but make sure none of them actually inherit from any of xarray's mixins. That way I can keep the internal organisational difference between "methods on Dataset that need wrapping", "methods on Dataset that need wrapping and mapping over child nodes", "properties on Dataset that need wrapping" etc., but without actually depending on any private xarray API. For example, this mixin for Datatree to inherit from does not itself inherit from (or even reference) any of xarray's mixins: class MappedDatasetMethodsMixin:
"""
Mixin to add Dataset methods like .mean(), but wrapped to map over all nodes in the subtree.
"""
__slots__ = ()
_wrap_then_attach_to_cls(
cls_dict=vars(),
copy_methods_from=Dataset,
methods_to_copy=_ALL_DATASET_METHODS_TO_MAP,
wrap_func=map_over_subtree,
)
class DataTree(
TreeNode,
DatasetPropertiesMixin,
MappedDatasetMethodsMixin,
MappedDataWithCoordsMixin,
DataTreeArithmeticMixin,
): |
It seems that the current discussion may be hedging toward inheritance over composition to address issues like name collisions (#38) and mutability. However, it may be worth considering the possibilities of a third option that doesn't use the This would solve the name collisions problem but would require much more logic here in datatree. Its possible this would be much harder than a direct inheritance implementation but I thought I would through the idea out there. |
I like this idea a lot, but let me suggest a slight variation: store The core data model for
My suggestion would be to make To implement methods on |
Thanks @jhamman and @shoyer - I started typing this before Stephan's comment just now so I will finish the thought - sounds like we're on the same page though: What you're suggesting is basically what I was thinking of as a last-resort solution to solve #38 . Basically You could implement that with inheritance, which sounds conceptually neat but Stephan is probably right that it would cause more problems than it would solve. The alternative as you say Joe is "integration": essentially writing something that reimplements all the high-level aligning behaviour of You would then have some kind of |
I would do this by having an API something like:
|
This idea is also closely related to https://github.com/TomNicholas/datatree/issues/3, and would re-open the question of whether to expose only local |
If we did implement it like this I personally think there would then be a much stronger case for eventually integrating DataTree into xarray, rather than it just living in xarray-contrib. The current approach simply wraps (and sometimes emulates) Dataset, so it makes sense for it to live in a separate library that lies atop xarray, but this new suggestion is a Dataset-like object which shares a huge amount of functionality and probably internal code with |
(@alexamici you will probably find this whole discussion interesting, and I would like to hear if you do have any thoughts!) |
+1 for integrating this into Xarray proper, possibly first as an
"experimental" feature. If we want to reuse internal helper functions, that
is the way to do it.
…On Tue, Sep 21, 2021 at 11:36 AM Tom Nicholas ***@***.***> wrote:
***@***.*** <https://github.com/alexamici> you will probably find this
whole discussion interesting, and I would like to hear if you do have any
thoughts!)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/TomNicholas/datatree/issues/2#issuecomment-924259972>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVXPAJYTJJKZEDBDX3LUDDGEHANCNFSM5CMUPVHQ>
.
|
One disadvantage of the "integration" approach is a lack of clear separation for users between operations that act on just that single node, and operations which map over all nodes in that subtree. For example whilst Compare to the current implementation, where to act on a single node's dataset you always have to pull out |
I think we could probably still make the interface using .ds work. This
would require modifying xarray.Dataset so it can support a custom mapping
object for storing variables, but that seems doable?
…On Wed, Sep 22, 2021 at 2:02 PM Tom Nicholas ***@***.***> wrote:
One disadvantage of the "integration" approach is a lack of clear
separation for users between operations that act on just that single node,
and operations which map over all nodes in that subtree. For example whilst
.coords should return only the coordinates stored in that node, what
should assign_coords do? It could assign coords to that node alone, or
attempt to assign the same coords to every node in the subtree. Sure, the
method could have a new kwarg map_over_subtree, but it's not going to be
intuitively obvious to the user whether the pattern is that that would
default to True or False.
Compare to the current implementation, where to act on a single node's
dataset you always have to pull out .ds, and every method on the DataTree
object maps over the whole subtree. That is definitely clearer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/TomNicholas/datatree/issues/2#issuecomment-925327246>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVR6XPVMG2BFDJFWKPTUDI76HANCNFSM5CMUPVHQ>
.
|
I'm sorry but I'm very much not following 😅 Could you elaborate? What would the custom mapping object do? How would that help with the original name collision problem? If Or alternatively are you imagining that variables are still stored privately in the |
I asked Stephan about his suggestion in the xarray dev meeting just now, and I'm going to write down my understanding of it here now (by answering my own questions), because I think it's a cool idea and I don't want to forget it!
It would be a dict-like container storing variables and possibly children, but the key point is that it would be accessible separately from the
If someone tries to add a variable to the wrapped Dataset (or a child to the wrapping DataTree), then by consulting the custom mapping object you would be able to check all variables and children for any name collisions.
That is essentially what the custom mapping object would allow - any change to the variables in the
This is still a valid question though I think - if ds = dt.ds
ds = many_inplace_operations_on_ds_later(ds)
dt['new_variable'] = blah
# but now ds has also been changed Maybe that's fine, but it's also perhaps unintuitive - something to consider. |
I wanted to convince myself that this was the only way to solve this problem, so I wrote out the problem and all the possible approaches in this gist. I think storing a custom mapping object under |
Should we prefer inheritance or composition when making the node of a datatree behave like an xarray Dataset?
Inheritance
We really want the data-containing nodes of the datatree to behave as much like xarray datasets as possible, as we will likely be calling functions/methods on them, assigning them, extracting from them and saving them as if they were actually xarray.Dataset objects. We could imagine a tree node class which directly inherits from
xarray.Dataset
:This would have all the attributes and API of a Dataset, and pass
isinstance()
checks, but also the attributes and methods needed to function as a node in a tree (e.g..children
,.parent
). We would still need to decorate most inherited methods in order to apply them to all child nodes in the tree though.Mostly these don't collide, except in the important case of getting/setting children of a node.
xarray.Datasets
already use up__getitem__
for variable selection (i.e.ds[var]
) as well as the.some_variable
namespace via property-like access. This means we can't immediately have an API allowing operations likedt.weather = dt.weather.mean('time')
because.weather
is a child of the node, not a dataset variable. (It's possible we could have both behaviours simultaneously by overwriting__getitem__
, but then we might restrict the possible names of children/variables.)I think this approach would also have the side-effect that accessor methods registered with
@register_dataset_accessor
would also be callable on the tree nodes.Composition
The alternative is instead of each node being a Dataset, each node merely wraps a Dataset. This has the advantage of keeping the Data class and the Node class separate, though they would still share a large API to allow applying a method (e.g.
.mean()
to all child nodes in a tree.The disadvantage is that then all the variables and dataset attributes are behind a
.ds
property.This type of syntax
dt.weather = dt.weather.mean('time')
would then be possible (at least if we didn't allow the tree objects to have their own.attrs
, else it would have to bedt['weather'] = dt['weather'].mean('time')
) because we would be calling the method of a DatasetNode (rather than Dataset) and then assigning to a DatasetNode.Selecting a particular variable from a dataset stored at a particular node would then look like
dt['weather'].ds['pressure']
, which has the advantage of clarifying which one is the variable, but the disadvantage of breaking up the path-like structure to get from the root down to the variable. EDIT: As there is no problem with collisions between names of groups and variables, we can actually just override__getitem__
to check in both the data variables and the children, so we can have access likedt['weather']['pressure']
.(There is also a possible third option described in #4)
For now the second approach seemed better, but I'm looking for other opinions!
The text was updated successfully, but these errors were encountered: