Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement map_over_datasets #2

Closed
wants to merge 10 commits into from
Closed

Conversation

shoyer
Copy link
Owner

@shoyer shoyer commented Oct 15, 2024

The main changes:

  • It is implemented using zip_subtrees, which means it should properly
    handle DataTrees where the nodes are defined in a different order.
  • For simplicity, I removed handling of **kwargs, in order to preserve
    some flexibility for adding keyword arugments.
  • I removed automatic skipping of empty nodes, because there are almost
    assuredly cases where that would make sense. This could be restored
    with a option keyword arugment.

I wonder if we should also change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.

TomNicholas and others added 3 commits October 15, 2024 14:14
* test unary op

* implement and generate unary ops

* test for unary op with inherited coordinates

* re-enable arithmetic tests

* implementation for binary ops

* test ds * dt commutativity

* ensure other types defer to DataTree, thus fixing pydata#9365

* test for inplace binary op

* pseudocode implementation of inplace binary op, and xfail test

* remove some unneeded type: ignore comments

* return type should be DataTree

* type datatree ops as accepting dataset-compatible types too

* use same type hinting hack as Dataset does for __eq__ not being same as Mapping

* ignore return type

* add some methods to api docs

* don't try to import DataTree.astype in API docs

* test to check that single-node trees aren't broadcast

* return NotImplemented

* remove pseudocode for inplace binary ops

* map_over_subtree -> map_over_datasets
* sketch of migration guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* whatsnew

* add date

* spell out API changes in more detail

* details on backends integration

* explain alignment and open_groups

* explain coordinate inheritance

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* re-trigger CI

* remove bullet about map_over_subtree

* Markdown formatting for important warning block

Co-authored-by: Matt Savoie <[email protected]>

* Reorder changes in order of importance

Co-authored-by: Matt Savoie <[email protected]>

* Clearer wording on setting relationships

Co-authored-by: Matt Savoie <[email protected]>

* remove "technically"

Co-authored-by: Matt Savoie <[email protected]>

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Matt Savoie <[email protected]>
The main changes:

- It is implemented using zip_subtrees, which means it should properly
  handle DataTrees where the nodes are defined in a different order.
- For simplicity, I removed handling of `**kwargs`, in order to preserve
  some flexibility for adding keyword arugments.
- I removed automatic skipping of empty nodes, because there are almost
  assuredly cases where that would make sense. This could be restored
  with a option keyword arugment.
Sibgatulin and others added 4 commits October 15, 2024 09:05
As mentioned in pydata#2157, the docstring of `Dataset.groupby` does not
reflect deprecation of squeeze (as the docstring of `DataArray.groupby`
does) and states an incorrect default value.
* Add inherit=False option to DataTree.copy()

This PR adds a inherit=False option to DataTree.copy, so users can
decide if they want to inherit coordinates from parents or not when
creating a subtree.

The default behavior is `inherit=True`, which is a breaking change from
the current behavior where parent coordinates are dropped (which I
believe should be considered a bug).

* fix typing

* add migration guide note

* ignore typing error
* Bug fixes for DataTree indexing and aggregation

My implementation of indexing and aggregation was incorrect on child
nodes, re-creating the child nodes from the root.

There was also another bug when indexing inherited coordinates that meant
formerly inherited coordinates were incorrectly dropped from results.

* disable broken test
@TomNicholas
Copy link

I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.

Copy link

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implemented using zip_subtrees, which means it should properly
handle DataTrees where the nodes are defined in a different order.

👍

For simplicity, I removed handling of **kwargs, in order to preserve
some flexibility for adding keyword arguments.

So again the idea here is that that is more similar to apply_ufunc?

I removed automatic skipping of empty nodes, because there are almost
assuredly cases where that would make sense. This could be restored
with a option keyword arugment.

I will look at this in more detail in a bit, but there are definitely issues on the original datatree repo caused by not skipping empty nodes.

The issue I was thinking of was xarray-contrib/datatree#262, but actually that would also be handled by pydata#9588.

I wonder if we should also change map_over_datasets from being called like map_over_datasets(func)(*args) to map_over_datasets(func, *args), which would be more consistent with apply_ufunc.

👍 The existing signature was motivated by me wanting to decorate inherited methods, but we don't really need that now. Consistency with apply_ufunc is a good point though. I agree that map_over_datasets(func, *args) might be better - that's also more similar to how the method version DataTree.map_over_datasets currently works, and similar to Dataset.map.

node_dataset_args.insert(i, arg)

path = (
"/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that's intended? I think this might effectively orphan the result (possibly similar to pydata#9626 (review)).

TomNicholas and others added 3 commits October 16, 2024 08:42
* type hints for datatree ops tests

* type hints for datatree aggregations tests

* type hints for datatree indexing tests

* type hint a lot more tests

* more type hints
* Add zip_subtrees for paired iteration over DataTrees

This should be used for implementing DataTree arithmetic inside
map_over_datasets, so the result does not depend on the order in which
child nodes are defined.

I have also added a minimal implementation of breadth-first-search with
an explicit queue the current recursion based solution in
xarray.core.iterators (which has been removed). The new implementation
is also slightly faster in my microbenchmark:

    In [1]: import xarray as xr

    In [2]: tree = xr.DataTree.from_dict({f"/x{i}": None for i in range(100)})

    In [3]: %timeit _ = list(tree.subtree)
    # on main
    87.2 μs ± 394 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

    # with this branch
    55.1 μs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

* fix pytype error

* Tweaks per review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants