Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Tree-aware dataset handling/selection #254

Closed
observingClouds opened this issue Aug 3, 2023 · 16 comments
Closed

Tree-aware dataset handling/selection #254

observingClouds opened this issue Aug 3, 2023 · 16 comments
Labels
design question enhancement New feature or request

Comments

@observingClouds
Copy link

observingClouds commented Aug 3, 2023

I'm looking for a good way to apply a function to a subset of nodes that share some common characteristics encoded in the subtree path.

Imagine the following data tree

import xarray as xr
import datatree
from datatree import map_over_subtree

dt = datatree.DataTree.from_dict({
    'control/lowRes' : xr.Dataset({'z':(('x'),[0,1,2])}),
    'control/highRes' : xr.Dataset({'z':(('x'),[0,1,2,3,4,5])}),
    'plus4K/lowRes' : xr.Dataset({'z':(('x'),[0,1,2])}),
    'plus4K/highRes' : xr.Dataset({'z':(('x'),[0,1,2,3,4,5])})
})

To apply a function to all control or all plus4K nodes is straight forward by just selecting the specific subtree, e.g. dt['control']. However, in case all lowRes dataset should be manipulated this becomes more elaborative and I wonder what the best approach would be.

  • dt['control/lowRes','plus4K/lowRes'] is not yet implemented and would also be complex for large data trees
  • dt['*/lowRes'] could be one idea to make the subtree selection more straight forward, where * is a wildcard
  • dt.search(regex) could make this even more general

Currently, I use the @map_over_subtree decorator, which also has some limitations as the function does not know its tree origin (as noted in the code) and it needs to be inferred from the dataset itself, which is sometimes possible (here the length of the dataset) but does not need to be always the case.

@map_over_subtree
def resolution_specific_func(ds):
    if len(ds.x) == 3:
        ds = ds.z*2
    elif len(ds.x) == 6:
        ds = ds.z*4
    return ds

z= resolution_specific_func(dt)

I do not know how the tree information could be passed through the decorator, but maybe it is okay if the DatasetView class has an additional property (e.g. _path) that could be filled with dt.path during the call of DatasetView._from_node()?. This would lead to

@map_over_subtree
def resolution_specific_func(ds):
    if 'lowRes' in ds._path:
        ds = ds.z*2
    if 'highRes' in ds._path:
        ds = ds.z*4
    return ds

and would allow for tree-aware manipulation of the datasets.

What do you think? Happy to open a PR if this makes sense.

@dcherian
Copy link

dcherian commented Aug 4, 2023

  1. I like this idea: dt['*/lowRes']

  2. I have an extract_leaf utility function that I like so far: dt.dc.extract_leaf('lowRes') This returns a new tree with two nodes'control' and 'plus4K'.

  3. I've also wondered if a reorganize_nodes would be useful. This would take input in symbolic form so dt.reorganize_nodes('a/b -> b/a')would reorganize so

  • lowRes
    • Control
    • plus4K
  • highRes
    • Control
    • plus4K

@observingClouds
Copy link
Author

Thanks for your comments! I also thought about the restructuring of the datatree similar to your point 3 and think it would be a nice addition. For this particular case I like 1 a bit more because it also allows for data trees with different depths.

@TomNicholas
Copy link
Member

I have been thinking about adding similar functions for a while, thanks to both of you for laying out your ideas clearly! In fact your two-level-combinatoric tree structure @observingClouds is the same thing @jbusecke and I discussed in #186.

I like this idea: dt['*/lowRes']

I agree this could be useful, but it's a shortcut for the more obvious dt.search(regex), so I think I should add that method first and see how people get on with it.

I have an extract_leaf utility function that I like so far: dt.dc.extract_leaf('lowRes') This returns a new tree with two nodes 'control' and 'plus4K'.

I can also see this being useful, but I'm not totally sure what exact behavior you're after from your description. Does it return {/control/lowRes/, /plus4k/lowRes/}? Or just a shallower tree like {/control/, /plus4k/}? The idea is conceptually similar to using dt.leaves.search('lowRes'), but the types wouldn't currently work out if you tried that.

take input in symbolic form so dt.reorganize_nodes('a/b -> b/a')

@dcherian I think this suggestion of taking input in symbolic form might solve the ambuigities @jbusecke and I discussed in #186. The main questions I would have is what to do with nodes that don't have exactly 2 levels, i.e. a/ or a/b/c?

@observingClouds your DatasetView._path idea is a really interesting suggestion - I've opened a new issue for it in #266.

@TomNicholas
Copy link
Member

I agree this could be useful, but it's a shortcut for the more obvious dt.search(regex), so I think I should add that method first and see how people get on with it.

Actually the simplest "searching" utility would be use glob, not regex. Added in #267.

@dcherian
Copy link

I would have is what to do with nodes that don't have exactly 2 levels, i.e. a/ or a/b/c?

Raise and error! there's an implied minimum depth necessary to do this?

@TomNicholas
Copy link
Member

Raise and error! there's an implied minimum depth necessary to do this?

Right, but what if some nodes have this depth and others don't? I'm thinking of something heterogenous like:

demographics/
weather/highres/model1
weather/highres/model2
weather/lowres/model1
weather/lowres/model1

Actually I guess in this particular case the user would just need to apply the operation to the weather/* subtree... Is there a scenario in which some leaves are deeper than others? 🤔

Thinking aloud more, @jbusecke 's CMIP6 case would look like:

model1/historical
model1/scenario1
model1/scenario2
model2/historical
model2/scenario1
model2/scenario2

which after dt.reorganize_nodes('a/b -> b/a') becomes

historical/model1
historical/model2
scenario1/model1
scenario1/model2
scenario2/model1
scenario2/model2

which seems fine

@dcherian
Copy link

Is there a scenario in which some leaves are deeper than others?

deeper than the symbolic depth is fine AFAICT.
shallower than the symbolic depth should raise an error.

@TomNicholas
Copy link
Member

TomNicholas commented Oct 24, 2023

deeper than the symbolic depth is fine AFAICT.

Is it? If I have /1/2/3/, and I ask for 'a/b -> b/a', isn't it ambiguous whether the result should be /1/3/2/ or /2/1/3/?

You could imagine maybe supporting something like '*/a/b -> */b/a', but it's getting more complex.

EDIT: Or maybe you call the method .reorganize_leaves and have it always give /1/3/2

EDIT2: Or maybe use a flag: dt.reorganize('a/b -> b/a', from_root=True)

@dcherian
Copy link

I have /1/2/3/, and I ask for 'a/b -> b/a', isn't it ambiguous whether the result should be /1/3/2/ or /2/1/3/?

Personally i'd just expect /2/1/3

@TomNicholas
Copy link
Member

TomNicholas commented Oct 24, 2023

Personally i'd just expect /2/1/3

Well I personally would have expected the opposite 🤣

Or maybe use a flag: dt.reorganize('a/b -> b/a', from_root=True)

I think this could be a good solution:

  1. It will do something consistent by default, and if you don't like that default you can change it.
  2. It's the minimum extra complexity to remove the root-first/leaves-first ambiguity.
  3. It avoids using asterisks at all, which is good because they suggest string matching, which we are not doing here.
  4. Avoiding asterisks prevents you from even expressing ambiguous requests, such as '*/a/b/* -> */b/a/*'
  5. The fail cases are obvious: it will work for anything with a long enough path.

@TomNicholas TomNicholas added enhancement New feature or request design question labels Oct 24, 2023
@TomNicholas
Copy link
Member

Another idea: use preceding slashes to distinguish the two cases, i.e.
'/a/b -> /b/a' would result in 2/1/3, but
'a/b -> b/a' would result in 1/3/2.

This has all the advatages of the the previous suggestion, but is more filesystem-like, doesn't require an extra kwarg, and allows moving nodes from root to leaves (e.g. '/a/b -> b/a').

@dcherian
Copy link

I would suggest '...a/b -> ...b/a' Otherwise it's really hard to see at first glance.

@dcherian
Copy link

dcherian commented Oct 25, 2023

I can also see this being useful, but I'm not totally sure what exact behavior you're after from your description.

dt.dc.extract_leaf('lowRes') returns a tree with two nodes Control and plus4K. This new tree is shallower, which then lets me pass it to another function that can't handle a deeper tree.

Here's my list of helper utilities: https://github.com/dcherian/dcpy/blob/master/dcpy/datatree.py

I found them all quite useful, but obviously there's little error checking :) and it does assume trees with a nicely balanced structure,.

@TomNicholas
Copy link
Member

Here's my list of helper utilities

Thanks @dcherian !

dt.dc.extract_leaf('lowRes') returns a tree with two nodes Control and plus4K. This new tree is shallower, which then lets me pass it to another function that can't handle a deeper tree.

I think I get it now - that does seem potentially useful!

I found them all quite useful, but obviously there's little error checking

What's the input type for order is to reorder_nodes? Is it a list of integer indices?

it does assume trees with a nicely balanced structure,.

I suspect you might be talking about a concept I've been referring to as a "hollow tree". I'll raise a separate issue for that.

@dcherian
Copy link

What's the input type for order is to reorder_nodes? Is it a list of integer indices?

it expects a list of node names : e.g. tree.reorder(["TAO", ...]) reorders so that "TAO" is the first node.
https://github.com/dcherian/dcpy/blob/7fb24b50410422ab9fc3257721c2727a075a34cb/dcpy/datatree.py#L24-L27

@keewis
Copy link
Contributor

keewis commented Aug 13, 2024

closed in favor of pydata/xarray#9346

@keewis keewis closed this as completed Aug 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
design question enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants