-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import datatree in xarray? #7418
Conversation
xarray/__init__.py
Outdated
try: | ||
from datatree import DataTree, open_datatree, register_datatree_accessor | ||
except ImportError: | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is almost certainly a better way to make from xarray import DataTree
work than this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine as a temporary solution.
Add a py.typed file in datatree to fix the mypy error: https://github.com/pydata/xarray/blob/main/xarray/py.typed |
There is also at least one bug in datatree that cannot be fixed without a (small) change to xarray, and having datatree as an optional import means I could fix it here. |
My opinion is that Datatree should move into Xarray now, ideally in a way that does not disrupt any existing user code, and that Datatree should become a first-class Xarray object (together with DataArray, and Dataset). Since it's a new feature, we don't necessarily have to be super conservative here. I think it is more than good enough / stable enough in its current state.
Since Datatree sits above DataArray and Dataset, it should not interfere with any of our existing API. As long as test coverage is good, documentation is solid, and the code style matches the rest of Xarray, I think we can just bring it in.
I think that it is inevitable that you Tom will be the main owner of the Datatree code at the beginning (as @shoyer was of all of Xarray when he first released it). Over time, if people use it, some fraction of users will become maintainers, starting with the existing dev team.
Why? Are its dependencies different from Xarray? |
No, datatree has no additional dependencies. I was just asking because if we went for the "import from second repository" plan we may want to test that the import works as part of our CI. Not a major issue though. |
I personally favor just copying the code into Xarray and archiving the old repo. |
I also lean in this direction. At this point, I see little downside to making this change at this point. My suggestion to import xarray-datatree into xarray was meant low-lift compromise. |
I don't have strong opinions for or against including datatree in Xarray. It indeed makes sense if it is using many Xarray internals and if there are many existing or potential applications for it. Additional load (CI) is fine if datatree doesn't bring any extra dependency and won't do so in the near future (which seems to be the case).
Would it mean that if someone wants to later add any feature "x" or "y" into Xarray, they just need implementing the feature for Dataset (and possibly DataArray) and it will be guaranteed to work with Datatree? (I guess so but I'm not familiar enough with Datatree to know it for sure). Otherwise, if there is any extra implementation effort required to make feature "x" or "y" work with Datatree, then I'm concerned about the additional burden or obstacle for future contributors and maintainers. Or we could say that this is OK to leave datatree support and wait for someone to take care of it later, but I don't think it is ideal to have such non-synchronized state within Xarray itself. |
Again, there is likely more good reasons merging the Datatree code with Xarray than not doing it, but IMHO such decision should be made very carefully. You certainly do know better than me what positive vs. negative impacts it would have here! I'm just speaking generally from my experience of having struggled while doing some heavy refactoring in Xarray recently :) |
Basically yes, it would immediately work with Datatree. Datatree currently implements most dataset methods by literally copying them and their docstrings, and they work by mapping the method over every node in the tree. We could integrate Datatree in such a way that the additional developer effort to get a method on dataset working on Datatree would be negligible (think adding a single element with the method name to an internal list, or copy-pasting a docstring).
This is an argument for waiting before integrating.
I appreciate the input @benbovy! I think the main difference between this effort and your (heroic) indexes effort is that Datatree doesn't touch any existing API. I guess my main concern is that integrating prematurely into Xarray might give a false sense of stability - I don't want to later realize I should redesign Datatree, and have people be annoyed because they thought it was as stable as the rest of xarray. |
Integrating upstream into xarray might also help with people trying to open their nested data formats as a datatree objects, because then we can immediately begin integrating with xarray's backend engines. See for example this datatree issue asking about opening grib files as a datatree. It would be nice to be able to do
|
I agree, datatree is an important data structure for Xarray. My preferred way to do this would be follow @rabernat's suggestion and to fork the code the existing repo into the Xarray main codebase. My main concern is that we should carefully evaluate the datatree API to make sure we won't want to change it soon. Once we bring it into Xarray, there will be a higher expectation that the interface will remain stable. |
I agree with this. We could use the PR process for this review. |
for more information, see https://pre-commit.ci
… into import_datatree
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
… into import_datatree
I think this PR is ready now, it just fails mypy (@Illviljan I added a Once this is merged we can push on with implementing |
Hmm, I don't understand. Adding py.typed should be all that's needed, did that in flox and it worked great there: xarray-contrib/flox#92 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to follow Tom's plan to simply import datatree for 1-2 release cycles.
sounds good to me!
xarray/__init__.py
Outdated
@@ -52,6 +52,12 @@ | |||
# Disable minimum version checks on downstream libraries. | |||
__version__ = "999" | |||
|
|||
try: | |||
from datatree import DataTree # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from datatree import DataTree # noqa | |
from datatree import DataTree, register_datatree_accessors # noqa: F401 |
I think this is why the docs build is failing. Also, not sure if the error code still works with ruff
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not sure if the error code still works with ruff?
I actually found that if I don't add that error code then ruff replaces
try:
from datatree import DataTree
with
try:
pass
which obviously caused an ImportError. I thought that was surprising behavior for a linter too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this to
from datatree import DataTree, register_datatree_accessor, open_datatree # noqa
now though
import xarray.testing as xrt | ||
from xarray import Dataset | ||
from xarray.tests import requires_datatree | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the whole module depends on datatree
, I'd call pytest.importorskip("datatree")
somewhere at the top of the module:
pytest.importorskip("datatree") | |
then we don't need to decorate every test with requires_datatree
If we want to reuse requires_datatree
, we can use:
pytestmark = [requires_datatree] | |
.. warning:: The DataTree structure is considered experimental, | ||
and the API is less solidified than for other xarray features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I just don't know enough about rst, but I wonder if it would be better to move the whole text into the block?
.. warning:: The DataTree structure is considered experimental, | |
and the API is less solidified than for other xarray features. | |
.. warning:: | |
The DataTree structure is considered experimental, and the API | |
is less solidified than for other xarray features. |
WARNING: The DataTree structure is considered experimental, | ||
and the API is less solidified than for other xarray features. | ||
|
||
The returned tree will only consist of a single node. | ||
That node will contain a copy of the dataarray's data, | ||
meaning including its coordinates, dimensions and attributes. | ||
|
||
Requires the xarray-datatree package to be installed. | ||
Find it at https://github.com/xarray-contrib/datatree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also be moved into a warning block?
.. _netcdf.group.warning: | ||
|
||
.. warning:: | ||
``DataTree`` objects do not follow the exact same data model as netCDF files, which means that perfect round-tripping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that intentionally preformatted, or would it make sense to convert it to a link? (that's really minor, though)
- pip: | ||
- git+https://github.com/xarray-contrib/datatree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we're installing from github here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I want to see if this commit to datatree fixes the mypy issue without releasing a whole new version of datatree just to check.
ci/install-upstream-wheels.sh
Outdated
@@ -20,6 +20,7 @@ conda uninstall -y --force \ | |||
bottleneck \ | |||
sparse \ | |||
flox \ | |||
datatree \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably fine as-is (and I'm always confused about the name), but should this be xarray-datatree
, given that that's the package we are installing below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably!
So I let this get stale, but today I met with @owenlittlejohns and @asteiker who are both very interested in helping get datatree merged into xarray upstream, which is great! I would like to:
@owenlittlejohns especially seemed interested in actually understanding the guts of datatree. It would be helpful if at least one other @pydata/xarray dev was interested and had time to review such PRs, so that I'm not just approving merging my own code. @jhamman ? @shoyer ? @keewis ? You guys have been particularly helpful with datatree so far. |
@flamingbear, @eni-awowale and @lsterzinger - I just wanted to tag you here as a starting point for discussing this effort a bit more, as you all were interested in helping with this work. |
Thanks for coming to the meeting today @owenlittlejohns , @eni-awowale and @flamingbear! We decided that opening hierarchical files into xarray is the first priority, so ideally integration of datatree into xarray should start with the backends. There is a prototype of this in #7437, but that's gone stale (cc @jthielen ?) so I think we should just use it as reference and forge ahead! That would give us a temporary situation in which |
Closing this in favour of tracking a proper integration in #8572. |
I want datatree to live in xarray main, as right now it's in a separate package but imports many xarray internals.
This presents a few questions:
Today @jhamman and @keewis suggested for now I make it so that you can
from xarray import DataTree
, using the current xarray-datatree package as an optional dependency. That way I can create a smoother on-ramp, get some more users testing it, but without committing all the code into this repo yet.@pydata/xarray what do you think? Any other thoughts about best practices when moving a good few thousand lines of code into xarray?
whats-new.rst
api.rst