Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add copy option in DataTree.from_dict #9193

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Jun 29, 2024

Significantly speed up from_dict by removing unwanted copying.

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
import xarray as xr
from xarray.core.datatree import DataTree


run1 = DataTree.from_dict({"run1": xr.Dataset({"a": 1})})
d = {"run1": run1}
%timeit DataTree.from_dict(d, copy=True)  # 1.36 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit DataTree.from_dict(d, copy=False) # 178 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@Illviljan
Copy link
Contributor Author

FAILED xarray/tests/test_treenode.py::TestSetNodes::test_set_child_node - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_set_grandchild - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_create_intermediate_child - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestSetNodes::test_overwrite_child - TypeError: _set() got an unexpected keyword argument 'copy'
FAILED xarray/tests/test_treenode.py::TestPruning::test_del_child - TypeError: _set() got an unexpected keyword argument 'copy'

Is treenode used anywhere else besides as a mixin-class for DataTree? DataTree overrides TreeNode._set anyway.

xarray/core/treenode.py Outdated Show resolved Hide resolved
@Illviljan
Copy link
Contributor Author

Illviljan commented Jun 29, 2024

FAILED xarray/tests/test_datatree.py::TestSetItem::test_grafted_subtree_retains_name - AssertionError: assert 'new_subtree_name' == 'original_subtree_name'
  
  - original_subtree_name
  ? ----- ^^
  + new_subtree_name
  ?  ^^

This test is a mutability test and a typo caused this error. This test used root["new_subtree_name"] = subtree # noqa which uses _set_item under the hood and should use copy=True by default.

@Illviljan Illviljan marked this pull request as ready for review June 29, 2024 21:02
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label Jun 29, 2024
@Illviljan Illviljan added the topic-DataTree Related to the implementation of a DataTree class label Jun 29, 2024
@shoyer
Copy link
Member

shoyer commented Jun 30, 2024

Elsewhere in Xarray (e.g., on Dataset.copy()), the copy keyword argument refers specifically to whether copying array data is copied, and metadata fields are always copied. This seems potentially inconsistent with that?

@Illviljan
Copy link
Contributor Author

Maybe _fastpath is better?

@shoyer
Copy link
Member

shoyer commented Jun 30, 2024

Maybe _fastpath is better?

👍 for a private constructor, if we want to use this internally inside Xarray.

@Illviljan
Copy link
Contributor Author

Illviljan commented Jul 1, 2024

Well it wouldn't be for internal xarray usage for me, it would be for a backend/mfdatatree function. But it is a typical way we use to speed things up and implying checks are being bypassed, do we have any other public examples?

My real problem is pretty similar to the intro example but I have 100+ datatrees instead, such a simple operation takes 1.5 minutes because of all the copies, it goes down to around 19ms with this pr.

@shoyer
Copy link
Member

shoyer commented Jul 1, 2024

My real problem is pretty similar to the intro example but I have 100+ datatrees instead, such a simple operation takes 1.5 minutes because of all the copies, it goes down to around 19ms with this pr.

Yikes, that is pretty bad!

Do you have a slightly more realistic benchmark you could share, and/or add to the asv tests?

I am sure we can do much better for performance in from_dict, but there are a few related issues here that I would like get right first, including a major refactor to DataTree internals (#9063) and changing the DataTree constructor (#9196).

Once we have the right behavior, it makes sense to start working on performance. I am optimistic that we can figure out how to improve performance without needing to expand the public API.

asv_bench/benchmarks/datatree.py Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/datatree.py Outdated Show resolved Hide resolved
xarray/core/treenode.py Outdated Show resolved Hide resolved
xarray/core/treenode.py Outdated Show resolved Hide resolved
xarray/core/treenode.py Outdated Show resolved Hide resolved
xarray/core/treenode.py Outdated Show resolved Hide resolved
xarray/tests/test_datatree.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-benchmark Run the ASV benchmark workflow topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants