Tree-aware dataset handling/selection #254

observingClouds · 2023-08-03T21:57:18Z

I'm looking for a good way to apply a function to a subset of nodes that share some common characteristics encoded in the subtree path.

Imagine the following data tree

import xarray as xr
import datatree
from datatree import map_over_subtree

dt = datatree.DataTree.from_dict({
    'control/lowRes' : xr.Dataset({'z':(('x'),[0,1,2])}),
    'control/highRes' : xr.Dataset({'z':(('x'),[0,1,2,3,4,5])}),
    'plus4K/lowRes' : xr.Dataset({'z':(('x'),[0,1,2])}),
    'plus4K/highRes' : xr.Dataset({'z':(('x'),[0,1,2,3,4,5])})
})

To apply a function to all control or all plus4K nodes is straight forward by just selecting the specific subtree, e.g. dt['control']. However, in case all lowRes dataset should be manipulated this becomes more elaborative and I wonder what the best approach would be.

dt['control/lowRes','plus4K/lowRes'] is not yet implemented and would also be complex for large data trees
dt['*/lowRes'] could be one idea to make the subtree selection more straight forward, where * is a wildcard
dt.search(regex) could make this even more general

Currently, I use the @map_over_subtree decorator, which also has some limitations as the function does not know its tree origin (as noted in the code) and it needs to be inferred from the dataset itself, which is sometimes possible (here the length of the dataset) but does not need to be always the case.

@map_over_subtree
def resolution_specific_func(ds):
    if len(ds.x) == 3:
        ds = ds.z*2
    elif len(ds.x) == 6:
        ds = ds.z*4
    return ds

z= resolution_specific_func(dt)

I do not know how the tree information could be passed through the decorator, but maybe it is okay if the DatasetView class has an additional property (e.g. _path) that could be filled with dt.path during the call of DatasetView._from_node()?. This would lead to

@map_over_subtree
def resolution_specific_func(ds):
    if 'lowRes' in ds._path:
        ds = ds.z*2
    if 'highRes' in ds._path:
        ds = ds.z*4
    return ds

and would allow for tree-aware manipulation of the datasets.

What do you think? Happy to open a PR if this makes sense.

The text was updated successfully, but these errors were encountered:

dcherian · 2023-08-04T17:20:07Z

I like this idea: dt['*/lowRes']
I have an extract_leaf utility function that I like so far: dt.dc.extract_leaf('lowRes') This returns a new tree with two nodes'control' and 'plus4K'.
I've also wondered if a reorganize_nodes would be useful. This would take input in symbolic form so dt.reorganize_nodes('a/b -> b/a')would reorganize so

lowRes
- Control
- plus4K
highRes
- Control
- plus4K

observingClouds · 2023-08-04T21:26:58Z

Thanks for your comments! I also thought about the restructuring of the datatree similar to your point 3 and think it would be a nice addition. For this particular case I like 1 a bit more because it also allows for data trees with different depths.

TomNicholas · 2023-10-24T14:13:03Z

I have been thinking about adding similar functions for a while, thanks to both of you for laying out your ideas clearly! In fact your two-level-combinatoric tree structure @observingClouds is the same thing @jbusecke and I discussed in #186.

I like this idea: dt['*/lowRes']

I agree this could be useful, but it's a shortcut for the more obvious dt.search(regex), so I think I should add that method first and see how people get on with it.

I have an extract_leaf utility function that I like so far: dt.dc.extract_leaf('lowRes') This returns a new tree with two nodes 'control' and 'plus4K'.

I can also see this being useful, but I'm not totally sure what exact behavior you're after from your description. Does it return {/control/lowRes/, /plus4k/lowRes/}? Or just a shallower tree like {/control/, /plus4k/}? The idea is conceptually similar to using dt.leaves.search('lowRes'), but the types wouldn't currently work out if you tried that.

take input in symbolic form so dt.reorganize_nodes('a/b -> b/a')

@dcherian I think this suggestion of taking input in symbolic form might solve the ambuigities @jbusecke and I discussed in #186. The main questions I would have is what to do with nodes that don't have exactly 2 levels, i.e. a/ or a/b/c?

@observingClouds your DatasetView._path idea is a really interesting suggestion - I've opened a new issue for it in #266.

TomNicholas · 2023-10-24T15:53:52Z

I agree this could be useful, but it's a shortcut for the more obvious dt.search(regex), so I think I should add that method first and see how people get on with it.

Actually the simplest "searching" utility would be use glob, not regex. Added in #267.

dcherian · 2023-10-24T16:26:14Z

I would have is what to do with nodes that don't have exactly 2 levels, i.e. a/ or a/b/c?

Raise and error! there's an implied minimum depth necessary to do this?

TomNicholas · 2023-10-24T16:40:56Z

Raise and error! there's an implied minimum depth necessary to do this?

Right, but what if some nodes have this depth and others don't? I'm thinking of something heterogenous like:

demographics/
weather/highres/model1
weather/highres/model2
weather/lowres/model1
weather/lowres/model1

Actually I guess in this particular case the user would just need to apply the operation to the weather/* subtree... Is there a scenario in which some leaves are deeper than others? 🤔

Thinking aloud more, @jbusecke 's CMIP6 case would look like:

model1/historical
model1/scenario1
model1/scenario2
model2/historical
model2/scenario1
model2/scenario2

which after dt.reorganize_nodes('a/b -> b/a') becomes

historical/model1
historical/model2
scenario1/model1
scenario1/model2
scenario2/model1
scenario2/model2

which seems fine

dcherian · 2023-10-24T16:42:22Z

Is there a scenario in which some leaves are deeper than others?

deeper than the symbolic depth is fine AFAICT.
shallower than the symbolic depth should raise an error.

TomNicholas · 2023-10-24T17:28:15Z

deeper than the symbolic depth is fine AFAICT.

Is it? If I have /1/2/3/, and I ask for 'a/b -> b/a', isn't it ambiguous whether the result should be /1/3/2/ or /2/1/3/?

You could imagine maybe supporting something like '*/a/b -> */b/a', but it's getting more complex.

EDIT: Or maybe you call the method .reorganize_leaves and have it always give /1/3/2

EDIT2: Or maybe use a flag: dt.reorganize('a/b -> b/a', from_root=True)

dcherian · 2023-10-24T17:51:04Z

I have /1/2/3/, and I ask for 'a/b -> b/a', isn't it ambiguous whether the result should be /1/3/2/ or /2/1/3/?

Personally i'd just expect /2/1/3

TomNicholas · 2023-10-24T17:59:06Z

Personally i'd just expect /2/1/3

Well I personally would have expected the opposite 🤣

Or maybe use a flag: dt.reorganize('a/b -> b/a', from_root=True)

I think this could be a good solution:

It will do something consistent by default, and if you don't like that default you can change it.
It's the minimum extra complexity to remove the root-first/leaves-first ambiguity.
It avoids using asterisks at all, which is good because they suggest string matching, which we are not doing here.
Avoiding asterisks prevents you from even expressing ambiguous requests, such as '*/a/b/* -> */b/a/*'
The fail cases are obvious: it will work for anything with a long enough path.

TomNicholas · 2023-10-24T20:18:38Z

Another idea: use preceding slashes to distinguish the two cases, i.e.
'/a/b -> /b/a' would result in 2/1/3, but
'a/b -> b/a' would result in 1/3/2.

This has all the advatages of the the previous suggestion, but is more filesystem-like, doesn't require an extra kwarg, and allows moving nodes from root to leaves (e.g. '/a/b -> b/a').

dcherian · 2023-10-24T20:23:47Z

I would suggest '...a/b -> ...b/a' Otherwise it's really hard to see at first glance.

dcherian · 2023-10-25T16:53:58Z

I can also see this being useful, but I'm not totally sure what exact behavior you're after from your description.

dt.dc.extract_leaf('lowRes') returns a tree with two nodes Control and plus4K. This new tree is shallower, which then lets me pass it to another function that can't handle a deeper tree.

Here's my list of helper utilities: https://github.com/dcherian/dcpy/blob/master/dcpy/datatree.py

I found them all quite useful, but obviously there's little error checking :) and it does assume trees with a nicely balanced structure,.

TomNicholas · 2023-10-25T17:15:16Z

Here's my list of helper utilities

Thanks @dcherian !

dt.dc.extract_leaf('lowRes') returns a tree with two nodes Control and plus4K. This new tree is shallower, which then lets me pass it to another function that can't handle a deeper tree.

I think I get it now - that does seem potentially useful!

I found them all quite useful, but obviously there's little error checking

What's the input type for order is to reorder_nodes? Is it a list of integer indices?

it does assume trees with a nicely balanced structure,.

I suspect you might be talking about a concept I've been referring to as a "hollow tree". I'll raise a separate issue for that.

dcherian · 2023-10-25T17:58:07Z

What's the input type for order is to reorder_nodes? Is it a list of integer indices?

it expects a list of node names : e.g. tree.reorder(["TAO", ...]) reorders so that "TAO" is the first node.
https://github.com/dcherian/dcpy/blob/7fb24b50410422ab9fc3257721c2727a075a34cb/dcpy/datatree.py#L24-L27

keewis · 2024-08-13T16:23:33Z

closed in favor of pydata/xarray#9346

TomNicholas mentioned this issue Oct 24, 2023

Make DatasetView in map_over_subtree aware of its path? #266

Closed

TomNicholas mentioned this issue Oct 24, 2023

Method to match node paths via glob #267

Merged

5 tasks

TomNicholas added enhancement New feature or request design question labels Oct 24, 2023

TomNicholas mentioned this issue Oct 25, 2023

Method to reorder nodes #271

Closed

5 tasks

TomNicholas mentioned this issue Dec 22, 2023

Track merging datatree into xarray pydata/xarray#8572

Closed

27 tasks

TomNicholas mentioned this issue Feb 14, 2024

Datatree design discussions - weekly meeting pydata/xarray#8747

Closed

11 tasks

keewis mentioned this issue Aug 13, 2024

datatree: Tree-aware dataset handling/selection pydata/xarray#9346

Open

keewis closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tree-aware dataset handling/selection #254

Tree-aware dataset handling/selection #254

observingClouds commented Aug 3, 2023 •

edited

Loading

dcherian commented Aug 4, 2023 •

edited

Loading

observingClouds commented Aug 4, 2023

TomNicholas commented Oct 24, 2023

TomNicholas commented Oct 24, 2023

dcherian commented Oct 24, 2023

TomNicholas commented Oct 24, 2023

dcherian commented Oct 24, 2023

TomNicholas commented Oct 24, 2023 •

edited

Loading

dcherian commented Oct 24, 2023

TomNicholas commented Oct 24, 2023 •

edited

Loading

TomNicholas commented Oct 24, 2023

dcherian commented Oct 24, 2023

dcherian commented Oct 25, 2023 •

edited

Loading

TomNicholas commented Oct 25, 2023

dcherian commented Oct 25, 2023

keewis commented Aug 13, 2024

Tree-aware dataset handling/selection #254

Tree-aware dataset handling/selection #254

Comments

observingClouds commented Aug 3, 2023 • edited Loading

dcherian commented Aug 4, 2023 • edited Loading

observingClouds commented Aug 4, 2023

TomNicholas commented Oct 24, 2023

TomNicholas commented Oct 24, 2023

dcherian commented Oct 24, 2023

TomNicholas commented Oct 24, 2023

dcherian commented Oct 24, 2023

TomNicholas commented Oct 24, 2023 • edited Loading

dcherian commented Oct 24, 2023

TomNicholas commented Oct 24, 2023 • edited Loading

TomNicholas commented Oct 24, 2023

dcherian commented Oct 24, 2023

dcherian commented Oct 25, 2023 • edited Loading

TomNicholas commented Oct 25, 2023

dcherian commented Oct 25, 2023

keewis commented Aug 13, 2024

observingClouds commented Aug 3, 2023 •

edited

Loading

dcherian commented Aug 4, 2023 •

edited

Loading

TomNicholas commented Oct 24, 2023 •

edited

Loading

TomNicholas commented Oct 24, 2023 •

edited

Loading

dcherian commented Oct 25, 2023 •

edited

Loading