-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Refactor Dataset to store variables in a manifest #5961
Conversation
self._manifest.variables = variables | ||
# TODO if ds._variables properly pointed to ds._manifest.variables we wouldn't need this line | ||
self._variables = self._manifest.variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I wanted was for self._variables
to always point to self._manifest.variables
, such that updating the former by definition updates the latter. But I'm not really sure if that kind of pointer-like behaviour is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer What do you think about the way I've extended Dataset
by adding a ._manifest
attribute?
I could have instead only altered ._variables
to point to the manifest. but then I end up with confusing code such as
class Dataset:
@property
def variables(self):
return self._variables.variables
(I was pleasantly surprised by the very small number of places I needed to change the code to accommodate the manifest - the way that practically everything goes through ._replace()
makes it pretty easy to keep all the tests passing.)
@classmethod | ||
def _construct_from_manifest( | ||
cls, | ||
manifest, | ||
coord_names, | ||
dims=None, | ||
attrs=None, | ||
indexes=None, | ||
encoding=None, | ||
close=None, | ||
): | ||
"""Creates a Dataset that is forced to be consistent with a DataTree node that shares its manifest.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to use this when someone calls dt.ds
- so that the DataTree
returns an actual Dataset
, but one whose contents are linked to the wrapping tree node.
How about making custom |
It just seems confusing to have an object referred to as `._variables`
which actually is just as much a container of child `DataTree` nodes as it
is of variables...
Also the `DataTree` class needs to wrap this "manifest" too (otherwise I'm
storing children in multiple places at once), and so I want it to
make sense in that context too.
…On Tue, 9 Nov 2021, 13:46 Stephan Hoyer, ***@***.***> wrote:
How about making custom Mapping for use as Dataset._variables directly,
which directly is a mapping of dataset variables? You could still be
storing the underlying variables in a different way.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#5961 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AISNPI4LEDDTEGYBWJ7VK6LULFT75ANCNFSM5HVPT5AA>
.
|
From a xarray.Dataset perspective, My tentative suggestion would be to use a mixed dictionary with either |
That sounds nice, and might not require any changes to
I think it's a lot easier to have a dict of DataTree objects rather than a nested dict of data, as then each node just points to its child nodes instead of having a node which knows about all the data in the whole tree (if that's what you meant).
So this is my understanding of what you're suggesting - I'm just not sure if it solves all the requirements: class DataManifest(MutableMapping):
"""
Acts like a dict of keys to variables, but
prevents setting variables to same key as any
children
"""
def __init__(self, variables={}, children={}):
# check for collisions here
self._variables = {}
self._children = {}
def __getitem__(self, key):
# only expose the variables so this acts like a normal dict of variables
return self._variables[key]
def __setitem__(self, key, var):
if key in self._children:
raise KeyError(
"key already in use to denote a child"
"node in wrapping DataTree node"
)
self.__dict__[key] = var
class Dataset:
self._variables = Mapping[Any, Variable]
# in standard case just use dict of vars as before
# Use ._construct_direct as the constructor
# as it allows for setting ._variables directly
# therefore no changes to Dataset required!
class DataTree:
def __init__(self, name, data, parent, children):
self._children
self._variables
self._coord_names
self._dims
...
@property
def ds(self):
manifest = DataManifest(variables, children)
return Dataset._from_treenode(
variables=manifest,
coord_names=self._coord_names,
dims=self._dims,
...
)
@ds.setter
def ds(self, ds):
# check for collisions between ds.data_vars and self.children
...
----------------
ds = Dataset({'a': 0})
subtree1 = Datatree('group1')
dt = Datatree('root', data=ds, children=[subtree])
wrapped_ds = dt.ds
wrapped_ds['group1'] = 1 # raises KeyError - good!
subtree2 = Datatree('b')
dt.ds['b'] = 2 # this will happily add a variable to the dataset
dt.add_child(subtree2) # want to ensure this raises a KeyError as it conflicts with the new variable, but with this design I'm not sure if it will... EDIT: Actually maybe this would work? So long as in class DataTree:
self._variables = manifest
self._children = manifest.children Then adding a new child node would also update the manifest, meaning that the linked dataset should know about it too... |
Update: I tried making a custom mapping class (code in drop-down below), then swapping out It kind of works?
It's not quite as simple as this - you need a To get tests to pass I can either relax those type constraints (which leads to >2/3 of (EDIT: Though maybe inheriting from dict is more trouble than it's worth) Code for custom mapping classfrom collections.abc import MutableMapping
from typing import Dict, Hashable, Mapping, Iterator, Sequence
from xarray.core.variable import Variable
#from xarray.tree.datatree import DataTree
class DataTree:
"""Purely for type hinting purposes for now (and to avoid a circular import)"""
...
class DataManifest(MutableMapping):
"""
Stores variables like a dict, but also stores children alongside in a hidden manner, to check against.
Acts like a dict of keys to variables, but prevents setting variables to same key as any children. It prevents name
collisions by acting as a common record of stored items for both the DataTree instance and its wrapped Dataset instance.
"""
def __init__(
self,
variables: Dict[Hashable, Variable] = {},
children: Dict[Hashable, DataTree] = {},
):
if variables and children:
keys_in_both = set(variables.keys()) & set(children.keys())
if keys_in_both:
raise KeyError(
f"The keys {keys_in_both} exist in both the variables and child nodes"
)
self._variables = variables
self._children = children
@property
def children(self) -> Dict[Hashable, DataTree]:
"""Stores list of the node's children"""
return self._children
@children.setter
def children(self, children: Dict[Hashable, DataTree]):
for key, child in children.items():
if key in self.keys():
raise KeyError("Cannot add child under key {key} because a variable is already stored under that key")
if not isinstance(child, DataTree):
raise TypeError
self._children = children
def __getitem__(self, key: Hashable) -> Variable:
"""Forward to the variables here so the manifest acts like a normal dict of variables"""
return self._variables[key]
def __setitem__(self, key: Hashable, value: Variable):
"""Allow adding new variables, but first check if they conflict with children"""
if key in self._children:
raise KeyError(
f"key {key} already in use to denote a child"
"node in wrapping DataTree node"
)
if isinstance(value, Variable):
self._variables[key] = value
else:
raise TypeError(f"Cannot store object of type {type(value)}")
def __delitem__(self, key: Hashable):
"""Forward to the variables here so the manifest acts like a normal dict of variables"""
if key in self._variables:
del self._variables[key]
elif key in self.children:
# TODO might be better not to del children here?
del self._children[key]
else:
raise KeyError(f"Cannot remove item because nothing is stored under {key}")
def __contains__(self, item: object) -> bool:
"""Forward to the variables here so the manifest acts like a normal dict of variables"""
return item in self._variables
def __iter__(self) -> Iterator:
"""Forward to the variables here so the manifest acts like a normal dict of variables"""
return iter(self._variables)
def __len__(self) -> int:
"""Forward to the variables here so the manifest acts like a normal dict of variables"""
return len(self._variables)
def copy(self) -> "DataManifest":
"""Required for consistency with dict"""
return DataManifest(variables=self._variables.copy(), children=self._children.copy())
# TODO __repr__ |
Question: Does this change to
In [34]: ds = xr.Dataset({'a': 0})
In [35]: ds.coords['c'] = 2
In [36]: ds
Out[36]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
c int64 2
Data variables:
a int64 0 (That's a bit weird given that the docstring of
In [30]: da = xr.DataArray(0)
In [31]: da.coords['c'] = 1
In [32]: da
Out[32]:
<xarray.DataArray ()>
array(0)
Coordinates:
c int64 1
In [37]: ds = xr.Dataset({'a': 0})
In [38]: ds['a'].coords['c'] = 2
In [39]: ds
Out[39]:
<xarray.Dataset>
Dimensions: ()
Data variables:
a int64 0
In [40]: ds['a']
Out[40]:
<xarray.DataArray 'a' ()>
array(0)
In [41]: ds['a'].coords
Out[41]:
Coordinates:
*empty* Bizarrely this does change though: In [42]: coords = ds['a'].coords
In [43]: coords['c'] = 2
In [44]: coords
Out[44]:
Coordinates:
c int64 2 If altering |
Closing for same reasons as in #6086 (comment) |
This PR is part of an experiment to see how to integrate a
DataTree
into xarray.What is does is refactor
Dataset
to store variables in aDataManifest
class, which is also capable of maintaining a ledger of child tree nodes. The point of this is to prevent name collisions between stored variables and child datatree nodes, as first mentioned in https://github.com/TomNicholas/datatree/issues/38 and explained further in https://github.com/TomNicholas/datatree/issues/2.("Manifest" in the old sense, of a noun meaning "a document giving comprehensive details of a ship and its cargo and other contents")
pre-commit run --all-files
whats-new.rst