-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fsspec/zarr/mfdataset #4461
Conversation
xarray/backends/api.py
Outdated
@@ -324,6 +324,8 @@ def open_dataset( | |||
backend_kwargs=None, | |||
use_cftime=None, | |||
decode_timedelta=None, | |||
storage_options=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't storage_options
and fs
be backend_kwargs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to move it - this is POC, so I just needed to put it somewhere.
However, it's not exactly a part of the backend kwargs, but it just so happens that it's only used by the zarr backend right now.
Question: to eventually get tests to pass, will need changes only just now going into zarr. Those may be released some time soon, but in the meantime is it reasonable to install from master? |
you might want to take a look at the |
We'll have to maintain backward compatibility with older zarr versions for a bit so you'll have to skip the tests appropriately using a version check EDIT: I didn't realize there are no tests in this PR yet. We definitely want current CI tests passing with older zarr versions. |
This can probably be cleaned up...
Signed-off-by: Martin Durant <[email protected]>
overwrite_encoded_chunks = backend_kwargs.pop( | ||
"overwrite_encoded_chunks", None | ||
) | ||
extra_kwargs["mode"] = "r" | ||
extra_kwargs["group"] = group | ||
if fs is not None: | ||
filename_or_obj = fs.get_mapper(filename_or_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we are working on refactor of the backend API that among other things aims at removing all knowledge of what backends can or can't do from open_dataset
. Adding logic inside if engine == "zarr"
will probably result in merge conflicts.
I would suggest to move the call to fs.get_mapper(filename_or_obj)
inside the zarr backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up. I already did one slightly complex conflict resolve.
It isn't totally clear, though, that the logic can be buried in the zarr engine for two reasons:
- when using open_mf, the globbing of remote files/directories happens early, before establishing individual zarr instances
- actually the file instances that fsspec makes from URLs can be used by some other backends; that just happens not the be the emphasis here
Happy to go whichever way is most convenient for the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.
In my view, some of the fsspec logic introduced in the PR should eventually move to the generic open_mfdataset
function, as it is not really specific to Zarr. However, I don't see a strong downside to adding it to open_zarr
right now. Eventually open_zarr
will be deprecated. But the pattern used here could be copied and incorporated into the backend refactor.
if fs is not None: | ||
filename_or_obj = fs.get_mapper(filename_or_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding the fs
keyword argument, why not just encourage passing in an appropriate mapping for filename_or_obj
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works already, and will continue to work. However, the whole point. of this PR is to allow working out those details in a single call to open_dataset, which turns out very convenient for encoding in an Intake catalog, for instance, or indeed for the open_mfdataset implementation in here.
@@ -876,6 +876,7 @@ can be omitted as it will internally be set to ``'a'``. | |||
|
|||
.. ipython:: python | |||
|
|||
ds1 = xr.Dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad merge? This makes the docs build fail with a SyntaxError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting. Correcting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martindurant , this looks good! (I'll wait to see if others have any final thoughts before merging)
Co-authored-by: Maximilian Roos <[email protected]>
Hello @martindurant! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-11-30 14:39:18 UTC |
(failures look like something in pandas dev) |
yep, that's #4516 |
One completely unrelated failure (test_polyfit_warnings). Can I please get a final say here (@max-sixty @alexamici ?) |
We let this go stale again. I just resolve the conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to decided what to do with this PR. I have a few comments, but in general I favor merging.
overwrite_encoded_chunks = backend_kwargs.pop( | ||
"overwrite_encoded_chunks", None | ||
) | ||
extra_kwargs["mode"] = "r" | ||
extra_kwargs["group"] = group | ||
if fs is not None: | ||
filename_or_obj = fs.get_mapper(filename_or_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.
In my view, some of the fsspec logic introduced in the PR should eventually move to the generic open_mfdataset
function, as it is not really specific to Zarr. However, I don't see a strong downside to adding it to open_zarr
right now. Eventually open_zarr
will be deprecated. But the pattern used here could be copied and incorporated into the backend refactor.
Co-authored-by: Ryan Abernathey <[email protected]>
ping @pydata/xarray |
ping again |
I'm really looking forward to getting this merged so I can open the National Water Model Zarr I created last week thusly: ds = xr.open_dataset(s3://noaa-nwm-retro-v2.0-zarr-pds', engine='zarr',
backend_kwargs={'consolidated':True, "storage_options": {'anon':True}}) @martindurant tells me this takes only 3 s with the new async capability! That would be pretty awesome, because now it takes 1min 15s to open this dataset! |
We are excited about adding this feature! We love fsspec and think this would be very useful for xarray's users. In the long term, we would love to support fsspec for all the file formats that can handle file objects, e.g., including The concern right now is that this adds special case logic for zarr in open_dataset(), which @alexamici and @aurghs are presently (simultaneously!) trying to remove as part of paying down technical debt in the ongoing backends refactor. I see two potential paths forwards:
@alexamici could you please take a look and weigh in here? In particular, it would be helpful if you could point to where this would belong in the new refactor. This is also a good motivation for deleting the "v1" API code as soon as possible in favor of the "v2" code -- nothing is worse than needing to implement a new feature twice! |
@rsignell-usgs: note that your example works without this PR (but with the just released zarr 2.6.1) as follows mapper = fsspec.get_mapper('s3://noaa-nwm-retro-v2.0-zarr-pds')
ds = xr.open_zarr(mapper, consolidated=True) Took 4s on my laptop (outside of AWS). |
@rabernat , awesome! I was stunned by the difference -- I guess the async loading of coordinate data is the big win, right? |
I think @shoyer has laid out the options in a very clear way. I weakly favor option 2, as I think it preferable in terms of software architecture and our broader roadmap for Xarray. However, I am cognizant of the significant effort that @martindurant has put into this, and I don't want his effort to go to waste. Some mitigating factors are:
|
I'm not sure, it's been a while now... |
@rsignell-usgs one other thing that can largely speed up loading of metadata / coordinates is ensuring coordinate variables are stored in one single chunk. For this particular dataset, chunk size for One thing we have been having performance issues with is with loading coordinates / metadata from zarr archives that have too many chunks (millions), even when metadata is consolidated and coordinates are in one single chunk. There is an open issue in dask about this. |
All interested parties, please see new attempt at #4823 |
Requires zarr-developers/zarr-python#606
Closes #xxxxisort . && black . && mypy . && flake8
whats-new.rst
api.rst