-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of deep DataTrees #9511
Comments
I'd say my usecase is a lot of files with a lot of variables with different resolutions/groups. My experience with datatree is still limited, since I have had a hard time getting past working with
|
Trees with thousands of nodes are certainly a compelling use-case, especially with lazy data. A simple improvement would be to automatically truncate reprs when they get too large. I guess we might be able to improve performance of large trees by up to ~10x with clever optimizations of the existing code, but if we need ~100x performance gains we will need to think about alternative strategies. There are limits on how far you can optimize pure Python code with thousands or millions of objects. One solution that comes to mind, with minimal implications for Xarray's API, is lazy creation/loading of sub-trees. You would write something like |
About the performance of the (static html) reprs, I'm afraid there's no way around truncating them for large trees. There are much more possibilities with dynamic (widget) reprs supporting bi-directional communication. https://github.com/benbovy/xarray-fancy-repr doesn't work with DataTree yet but I think it would be pretty straightforward to support it (all the repr parts are already available as reusable react components). It would also work seamlessly with lazy loading of sub-trees. The "hardest" task would be to design some UI elements for navigating into large trees. |
Hi everyone, I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently. Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using This data model stores historical weather radar datasets in I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the For ~15 GB in size, For ~80 GB in size, I've worked with larger datasets, which take more time to open/read. The datatree structure contains 11 nodes, each representing a point where live-updating data is appended. This is a minimal reproducible example, in case you want to look at it. import s3fs
import xarray as xr
from time import time
def main():
print(xr.__version__)
st = time()
## S3 bucket connection
URL = 'https://js2.jetstream-cloud.org:8001/'
path = f'pythia/radar/erad2024'
fs = s3fs.S3FileSystem(anon=True,
client_kwargs=dict(endpoint_url=URL))
file = s3fs.S3Map(f"{path}/zarr_radar/Guaviare_test.zarr", s3=fs)
# opening datatree stored in zarr
dtree = xr.backends.api.open_datatree(
file,
engine='zarr',
consolidated=True,
chunks={}
)
print(f"total time: {time() -st}")
if __name__ == "__main__":
main() and the output is 2024.9.1.dev23+g52f13d44
total time: 5.198976516723633 For more information about the data model, you can check this |
Following up on my previous post, I found out that when using However, I discovered that using the for path_group, store in stores.items():
# store_entrypoint = StoreBackendEntrypoint()
#
# with close_on_error(store):
# group_ds = store_entrypoint.open_dataset(
# store,
# mask_and_scale=mask_and_scale,
# decode_times=decode_times,
# concat_characters=concat_characters,
# decode_coords=decode_coords,
# drop_variables=drop_variables,
# use_cftime=use_cftime,
# decode_timedelta=decode_timedelta,
# )
group_ds = open_dataset(
filename_or_obj,
store=store,
group=path_group,
engine="zarr",
mask_and_scale=mask_and_scale,
decode_times=decode_times,
concat_characters=concat_characters,
decode_coords=decode_coords,
drop_variables=drop_variables,
use_cftime=use_cftime,
decode_timedelta=decode_timedelta,
**kwargs,
)
group_name = str(NodePath(path_group))
groups_dict[group_name] = group_ds I got the following results by running a test locally over the minimum reproducible example. 2024.9.1.dev23+g52f13d44
total time: 3.808659553527832 We went from ~5.2 to 3.8 seconds (around 1.37x faster). Please let me know your thoughts. |
@benbovy that's interesting! #9633 updates the datatree (static) html repr to be up to date, so you or anyone else who is interested in playing with this (@jsignell perhaps?) can start from there.
Presumably you're referring to more than just clickable dropdown arrows here? |
What is your issue?
The
DataTree
structure was not designed with performance of very large trees in mind. It doesn't do anything obviously wasteful, but the priority has been making decisions about the data model and user API, with performance secondary. Now that the model is more established (or soon should be), we're in a better position to talk about improving performance.There are two possible performance issues that @shoyer pointed out:
The internal structure is a lot of linked python classes, resulting in a lot of method calls to do things like tree traversal. This is good for clarity and evolving a prototype, but will introduce significant overhead per tree operation.
There are one or two places which might cause quadratic scaling with tree depth. In particular inserting a node via the
DataTree.__init__
constructor will cause the entire tree to be checked for consistency, creating a tree by repeatedly using this constructor could be quadratically expensive.DataTree.from_dict
could be optimized to remove this problem because it creates from the root, so you can just check subtrees as they are added.I personally think that the primary use case of a
DataTree
is small numbers of nodes, each containing large arrays (rather than large numbers of nodes containing small arrays). But I'm sure someone will immediately be like "well in my use case I need a tree with 10k nodes" 😆In fact because it is possible to represent huge amounts of archival data with a single
DataTree
, someone will probably do something like attempt to represent the entire CMIP6 catalog as aDataTree
and then complain after hitting a performance limit...If anyone has ideas for how to improve performance without changing user API let's use this issue to collate and track them.
(Note that this issue is different from the issue of dask in datatree. (xref #9355, #9502, #9504) Here I'm talking specifically about optimizations that can be performed even without dask installed.)
cc @Illviljan who I'm sure has thoughts about this
The text was updated successfully, but these errors were encountered: