-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement map_over_datasets #2
Conversation
* test unary op * implement and generate unary ops * test for unary op with inherited coordinates * re-enable arithmetic tests * implementation for binary ops * test ds * dt commutativity * ensure other types defer to DataTree, thus fixing pydata#9365 * test for inplace binary op * pseudocode implementation of inplace binary op, and xfail test * remove some unneeded type: ignore comments * return type should be DataTree * type datatree ops as accepting dataset-compatible types too * use same type hinting hack as Dataset does for __eq__ not being same as Mapping * ignore return type * add some methods to api docs * don't try to import DataTree.astype in API docs * test to check that single-node trees aren't broadcast * return NotImplemented * remove pseudocode for inplace binary ops * map_over_subtree -> map_over_datasets
* sketch of migration guide * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * whatsnew * add date * spell out API changes in more detail * details on backends integration * explain alignment and open_groups * explain coordinate inheritance * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * re-trigger CI * remove bullet about map_over_subtree * Markdown formatting for important warning block Co-authored-by: Matt Savoie <[email protected]> * Reorder changes in order of importance Co-authored-by: Matt Savoie <[email protected]> * Clearer wording on setting relationships Co-authored-by: Matt Savoie <[email protected]> * remove "technically" Co-authored-by: Matt Savoie <[email protected]> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Matt Savoie <[email protected]>
The main changes: - It is implemented using zip_subtrees, which means it should properly handle DataTrees where the nodes are defined in a different order. - For simplicity, I removed handling of `**kwargs`, in order to preserve some flexibility for adding keyword arugments. - I removed automatic skipping of empty nodes, because there are almost assuredly cases where that would make sense. This could be restored with a option keyword arugment.
8c224b2
to
739573a
Compare
As mentioned in pydata#2157, the docstring of `Dataset.groupby` does not reflect deprecation of squeeze (as the docstring of `DataArray.groupby` does) and states an incorrect default value.
* Add inherit=False option to DataTree.copy() This PR adds a inherit=False option to DataTree.copy, so users can decide if they want to inherit coordinates from parents or not when creating a subtree. The default behavior is `inherit=True`, which is a breaking change from the current behavior where parent coordinates are dropped (which I believe should be considered a bug). * fix typing * add migration guide note * ignore typing error
* Bug fixes for DataTree indexing and aggregation My implementation of indexing and aggregation was incorrect on child nodes, re-creating the child nodes from the root. There was also another bug when indexing inherited coordinates that meant formerly inherited coordinates were incorrectly dropped from results. * disable broken test
I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implemented using
zip_subtrees
, which means it should properly
handle DataTrees where the nodes are defined in a different order.
👍
For simplicity, I removed handling of
**kwargs
, in order to preserve
some flexibility for adding keyword arguments.
So again the idea here is that that is more similar to apply_ufunc
?
I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.
The issue I was thinking of was xarray-contrib/datatree#262, but actually that would also be handled by pydata#9588.
I wonder if we should also change
map_over_datasets
from being called likemap_over_datasets(func)(*args)
tomap_over_datasets(func, *args)
, which would be more consistent withapply_ufunc
.
👍 The existing signature was motivated by me wanting to decorate inherited methods, but we don't really need that now. Consistency with apply_ufunc
is a good point though. I agree that map_over_datasets(func, *args)
might be better - that's also more similar to how the method version DataTree.map_over_datasets
currently works, and similar to Dataset.map
.
node_dataset_args.insert(i, arg) | ||
|
||
path = ( | ||
"/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that's intended? I think this might effectively orphan the result (possibly similar to pydata#9626 (review)).
* type hints for datatree ops tests * type hints for datatree aggregations tests * type hints for datatree indexing tests * type hint a lot more tests * more type hints
* Add zip_subtrees for paired iteration over DataTrees This should be used for implementing DataTree arithmetic inside map_over_datasets, so the result does not depend on the order in which child nodes are defined. I have also added a minimal implementation of breadth-first-search with an explicit queue the current recursion based solution in xarray.core.iterators (which has been removed). The new implementation is also slightly faster in my microbenchmark: In [1]: import xarray as xr In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)}) In [3]: %timeit _ = list(tree.subtree) # on main 87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # with this branch 55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) * fix pytype error * Tweaks per review
The main changes:
handle DataTrees where the nodes are defined in a different order.
**kwargs
, in order to preservesome flexibility for adding keyword arugments.
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.
I wonder if we should also change
map_over_datasets
from being called likemap_over_datasets(func)(*args)
tomap_over_datasets(func, *args)
, which would be more consistent withapply_ufunc
.