Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fsspec/zarr/mfdataset #4461

Closed
wants to merge 25 commits into from
Closed

Conversation

martindurant
Copy link
Contributor

@martindurant martindurant commented Sep 25, 2020

Requires zarr-developers/zarr-python#606

  • Closes #xxxx
  • Tests added
  • Passes isort . && black . && mypy . && flake8
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@@ -324,6 +324,8 @@ def open_dataset(
backend_kwargs=None,
use_cftime=None,
decode_timedelta=None,
storage_options=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't storage_options and fs be backend_kwargs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to move it - this is POC, so I just needed to put it somewhere.
However, it's not exactly a part of the backend kwargs, but it just so happens that it's only used by the zarr backend right now.

@martindurant
Copy link
Contributor Author

Question: to eventually get tests to pass, will need changes only just now going into zarr. Those may be released some time soon, but in the meantime is it reasonable to install from master?

@keewis
Copy link
Collaborator

keewis commented Sep 25, 2020

is it reasonable to install from master

you might want to take a look at the upstream-dev CI which installs zarr from github (and is currently passing)

@dcherian
Copy link
Contributor

dcherian commented Sep 25, 2020

We'll have to maintain backward compatibility with older zarr versions for a bit so you'll have to skip the tests appropriately using a version check

EDIT: I didn't realize there are no tests in this PR yet. We definitely want current CI tests passing with older zarr versions.

Martin Durant added 2 commits September 25, 2020 21:04
This can probably be cleaned up...
@martindurant martindurant marked this pull request as ready for review September 29, 2020 19:43
doc/io.rst Outdated Show resolved Hide resolved
overwrite_encoded_chunks = backend_kwargs.pop(
"overwrite_encoded_chunks", None
)
extra_kwargs["mode"] = "r"
extra_kwargs["group"] = group
if fs is not None:
filename_or_obj = fs.get_mapper(filename_or_obj)
Copy link
Collaborator

@alexamici alexamici Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we are working on refactor of the backend API that among other things aims at removing all knowledge of what backends can or can't do from open_dataset. Adding logic inside if engine == "zarr" will probably result in merge conflicts.

I would suggest to move the call to fs.get_mapper(filename_or_obj) inside the zarr backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up. I already did one slightly complex conflict resolve.

It isn't totally clear, though, that the logic can be buried in the zarr engine for two reasons:

  • when using open_mf, the globbing of remote files/directories happens early, before establishing individual zarr instances
  • actually the file instances that fsspec makes from URLs can be used by some other backends; that just happens not the be the emphasis here

Happy to go whichever way is most convenient for the library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.

In my view, some of the fsspec logic introduced in the PR should eventually move to the generic open_mfdataset function, as it is not really specific to Zarr. However, I don't see a strong downside to adding it to open_zarr right now. Eventually open_zarr will be deprecated. But the pattern used here could be copied and incorporated into the backend refactor.

Comment on lines +569 to +570
if fs is not None:
filename_or_obj = fs.get_mapper(filename_or_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding the fs keyword argument, why not just encourage passing in an appropriate mapping for filename_or_obj?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works already, and will continue to work. However, the whole point. of this PR is to allow working out those details in a single call to open_dataset, which turns out very convenient for encoding in an Intake catalog, for instance, or indeed for the open_mfdataset implementation in here.

@@ -876,6 +876,7 @@ can be omitted as it will internally be set to ``'a'``.

.. ipython:: python

ds1 = xr.Dataset(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad merge? This makes the docs build fail with a SyntaxError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting. Correcting...

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martindurant , this looks good! (I'll wait to see if others have any final thoughts before merging)

doc/io.rst Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Oct 19, 2020

Hello @martindurant! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-30 14:39:18 UTC

@martindurant
Copy link
Contributor Author

(failures look like something in pandas dev)

@keewis
Copy link
Collaborator

keewis commented Oct 19, 2020

(failures look like something in pandas dev)

yep, that's #4516

@martindurant
Copy link
Contributor Author

One completely unrelated failure (test_polyfit_warnings). Can I please get a final say here (@max-sixty @alexamici ?)

@rabernat
Copy link
Contributor

We let this go stale again. I just resolve the conflicts.

@rabernat rabernat mentioned this pull request Nov 30, 2020
3 tasks
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decided what to do with this PR. I have a few comments, but in general I favor merging.

doc/io.rst Outdated Show resolved Hide resolved
overwrite_encoded_chunks = backend_kwargs.pop(
"overwrite_encoded_chunks", None
)
extra_kwargs["mode"] = "r"
extra_kwargs["group"] = group
if fs is not None:
filename_or_obj = fs.get_mapper(filename_or_obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.

In my view, some of the fsspec logic introduced in the PR should eventually move to the generic open_mfdataset function, as it is not really specific to Zarr. However, I don't see a strong downside to adding it to open_zarr right now. Eventually open_zarr will be deprecated. But the pattern used here could be copied and incorporated into the backend refactor.

Co-authored-by: Ryan Abernathey <[email protected]>
@dcherian
Copy link
Contributor

dcherian commented Dec 3, 2020

We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.

ping @pydata/xarray

@martindurant
Copy link
Contributor Author

ping again

@rsignell-usgs
Copy link

rsignell-usgs commented Dec 9, 2020

I'm really looking forward to getting this merged so I can open the National Water Model Zarr I created last week thusly:

ds = xr.open_dataset(s3://noaa-nwm-retro-v2.0-zarr-pds', engine='zarr', 
        backend_kwargs={'consolidated':True, "storage_options": {'anon':True}})

@martindurant tells me this takes only 3 s with the new async capability!

That would be pretty awesome, because now it takes 1min 15s to open this dataset!

@shoyer
Copy link
Member

shoyer commented Dec 9, 2020

We are excited about adding this feature! We love fsspec and think this would be very useful for xarray's users. In the long term, we would love to support fsspec for all the file formats that can handle file objects, e.g., including engine='h5netcdf' and engine='scipy'.

The concern right now is that this adds special case logic for zarr in open_dataset(), which @alexamici and @aurghs are presently (simultaneously!) trying to remove as part of paying down technical debt in the ongoing backends refactor.

I see two potential paths forwards:

  1. Merge this as is. It has good test coverage and porting should (hopefully!) be relatively straightforward to port.
  2. Insert this into the new backend API code instead, and require using the v2 backend API instead for this feature.

@alexamici could you please take a look and weigh in here? In particular, it would be helpful if you could point to where this would belong in the new refactor. This is also a good motivation for deleting the "v1" API code as soon as possible in favor of the "v2" code -- nothing is worse than needing to implement a new feature twice!

@rabernat
Copy link
Contributor

rabernat commented Dec 9, 2020

@rsignell-usgs: note that your example works without this PR (but with the just released zarr 2.6.1) as follows

mapper = fsspec.get_mapper('s3://noaa-nwm-retro-v2.0-zarr-pds')
ds = xr.open_zarr(mapper, consolidated=True)

Took 4s on my laptop (outside of AWS).

@rsignell-usgs
Copy link

@rabernat , awesome! I was stunned by the difference -- I guess the async loading of coordinate data is the big win, right?

@rabernat
Copy link
Contributor

rabernat commented Dec 9, 2020

I think @shoyer has laid out the options in a very clear way.

I weakly favor option 2, as I think it preferable in terms of software architecture and our broader roadmap for Xarray. However, I am cognizant of the significant effort that @martindurant has put into this, and I don't want his effort to go to waste.

Some mitigating factors are:

  • The example I gave above (Allow fsspec/zarr/mfdataset #4461 (comment)) shows that one high-impact feature that users want (async capabilities in Zarr) already works, albiet with a different syntax. So this PR is more about convenience.
  • Presumably the knowledge about Xarray that Martin has gained by implementing this PR is transferrable to a different context, and so we would not be starting from scratch if we went with 2.

@martindurant
Copy link
Contributor Author

Martin has gained by implementing this PR is transferrable

I'm not sure, it's been a while now...

@rafa-guedes
Copy link
Contributor

rafa-guedes commented Dec 20, 2020

@rabernat , awesome! I was stunned by the difference -- I guess the async loading of coordinate data is the big win, right?

@rsignell-usgs one other thing that can largely speed up loading of metadata / coordinates is ensuring coordinate variables are stored in one single chunk. For this particular dataset, chunk size for time coordinate is 672 yielding 339 chunks, which can take a while to load from remote bucket stores. If you rewrite time coordinate setting dset.time.encoding["chunks"] = (227904,) you should see a very large performance increase. One thing we have been doing for the cases of zarr archives that are appended in time, is defining time coordinate with a very large chunk size (e.g., dset.time.encoding["chunks"] = (10000000,)) when we first write the store. This ensures time coordinate will still fit in one single chunk after appending over time dimension, and does not affect chunking of the actual data variables.

One thing we have been having performance issues with is with loading coordinates / metadata from zarr archives that have too many chunks (millions), even when metadata is consolidated and coordinates are in one single chunk. There is an open issue in dask about this.

@martindurant
Copy link
Contributor Author

All interested parties, please see new attempt at #4823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants