Allow fsspec/zarr/mfdataset #4461

martindurant · 2020-09-25T18:14:38Z

Requires zarr-developers/zarr-python#606

~~Closes #xxxx~~
Tests added
Passes isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

alexamici · 2020-09-25T18:54:19Z

xarray/backends/api.py

@@ -324,6 +324,8 @@ def open_dataset(
    backend_kwargs=None,
    use_cftime=None,
    decode_timedelta=None,
+    storage_options=None,


Shouldn't storage_options and fs be backend_kwargs?

Happy to move it - this is POC, so I just needed to put it somewhere.
However, it's not exactly a part of the backend kwargs, but it just so happens that it's only used by the zarr backend right now.

martindurant · 2020-09-25T21:05:42Z

Question: to eventually get tests to pass, will need changes only just now going into zarr. Those may be released some time soon, but in the meantime is it reasonable to install from master?

keewis · 2020-09-25T21:54:06Z

is it reasonable to install from master

you might want to take a look at the upstream-dev CI which installs zarr from github (and is currently passing)

dcherian · 2020-09-25T22:07:47Z

We'll have to maintain backward compatibility with older zarr versions for a bit so you'll have to skip the tests appropriately using a version check

EDIT: I didn't realize there are no tests in this PR yet. We definitely want current CI tests passing with older zarr versions.

This can probably be cleaned up...

Signed-off-by: Martin Durant <[email protected]>

doc/io.rst

alexamici · 2020-10-01T06:01:28Z

xarray/backends/api.py

            overwrite_encoded_chunks = backend_kwargs.pop(
                "overwrite_encoded_chunks", None
            )
            extra_kwargs["mode"] = "r"
            extra_kwargs["group"] = group
+            if fs is not None:
+                filename_or_obj = fs.get_mapper(filename_or_obj)


Note that we are working on refactor of the backend API that among other things aims at removing all knowledge of what backends can or can't do from open_dataset. Adding logic inside if engine == "zarr" will probably result in merge conflicts.

I would suggest to move the call to fs.get_mapper(filename_or_obj) inside the zarr backend.

Thanks for the heads up. I already did one slightly complex conflict resolve.

It isn't totally clear, though, that the logic can be buried in the zarr engine for two reasons:

when using open_mf, the globbing of remote files/directories happens early, before establishing individual zarr instances

actually the file instances that fsspec makes from URLs can be used by some other backends; that just happens not the be the emphasis here

Happy to go whichever way is most convenient for the library.

We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.

In my view, some of the fsspec logic introduced in the PR should eventually move to the generic open_mfdataset function, as it is not really specific to Zarr. However, I don't see a strong downside to adding it to open_zarr right now. Eventually open_zarr will be deprecated. But the pattern used here could be copied and incorporated into the backend refactor.

shoyer · 2020-09-25T21:56:21Z

xarray/backends/api.py

+            if fs is not None:
+                filename_or_obj = fs.get_mapper(filename_or_obj)


Rather than adding the fs keyword argument, why not just encourage passing in an appropriate mapping for filename_or_obj?

That works already, and will continue to work. However, the whole point. of this PR is to allow working out those details in a single call to open_dataset, which turns out very convenient for encoding in an Intake catalog, for instance, or indeed for the open_mfdataset implementation in here.

keewis · 2020-10-18T13:13:47Z

doc/io.rst

@@ -876,6 +876,7 @@ can be omitted as it will internally be set to ``'a'``.

 .. ipython:: python

+    ds1 = xr.Dataset(


bad merge? This makes the docs build fail with a SyntaxError

Hm, interesting. Correcting...

max-sixty

Thanks @martindurant , this looks good! (I'll wait to see if others have any final thoughts before merging)

doc/io.rst

Co-authored-by: Maximilian Roos <[email protected]>

pep8speaks · 2020-10-19T13:09:32Z

Hello @martindurant! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-30 14:39:18 UTC

martindurant · 2020-10-19T14:22:23Z

(failures look like something in pandas dev)

keewis · 2020-10-19T14:48:23Z

(failures look like something in pandas dev)

yep, that's #4516

martindurant · 2020-11-03T20:46:57Z

One completely unrelated failure (test_polyfit_warnings). Can I please get a final say here (@max-sixty @alexamici ?)

rabernat · 2020-11-30T14:21:17Z

We let this go stale again. I just resolve the conflicts.

rabernat

We need to decided what to do with this PR. I have a few comments, but in general I favor merging.

doc/io.rst

rabernat · 2020-11-30T14:33:20Z

xarray/backends/api.py

            overwrite_encoded_chunks = backend_kwargs.pop(
                "overwrite_encoded_chunks", None
            )
            extra_kwargs["mode"] = "r"
            extra_kwargs["group"] = group
+            if fs is not None:
+                filename_or_obj = fs.get_mapper(filename_or_obj)


We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.

In my view, some of the fsspec logic introduced in the PR should eventually move to the generic open_mfdataset function, as it is not really specific to Zarr. However, I don't see a strong downside to adding it to open_zarr right now. Eventually open_zarr will be deprecated. But the pattern used here could be copied and incorporated into the backend refactor.

Co-authored-by: Ryan Abernathey <[email protected]>

dcherian · 2020-12-03T18:48:25Z

We need to resolve this discussion in order to decide what to do about this PR. Any more thoughts from other devs.

ping @pydata/xarray

martindurant · 2020-12-09T16:20:33Z

ping again

rsignell-usgs · 2020-12-09T16:31:37Z

I'm really looking forward to getting this merged so I can open the National Water Model Zarr I created last week thusly:

ds = xr.open_dataset(s3://noaa-nwm-retro-v2.0-zarr-pds', engine='zarr', 
        backend_kwargs={'consolidated':True, "storage_options": {'anon':True}})

@martindurant tells me this takes only 3 s with the new async capability!

That would be pretty awesome, because now it takes 1min 15s to open this dataset!

shoyer · 2020-12-09T17:37:01Z

We are excited about adding this feature! We love fsspec and think this would be very useful for xarray's users. In the long term, we would love to support fsspec for all the file formats that can handle file objects, e.g., including engine='h5netcdf' and engine='scipy'.

The concern right now is that this adds special case logic for zarr in open_dataset(), which @alexamici and @aurghs are presently (simultaneously!) trying to remove as part of paying down technical debt in the ongoing backends refactor.

I see two potential paths forwards:

Merge this as is. It has good test coverage and porting should (hopefully!) be relatively straightforward to port.
Insert this into the new backend API code instead, and require using the v2 backend API instead for this feature.

@alexamici could you please take a look and weigh in here? In particular, it would be helpful if you could point to where this would belong in the new refactor. This is also a good motivation for deleting the "v1" API code as soon as possible in favor of the "v2" code -- nothing is worse than needing to implement a new feature twice!

rabernat · 2020-12-09T17:44:55Z

@rsignell-usgs: note that your example works without this PR (but with the just released zarr 2.6.1) as follows

mapper = fsspec.get_mapper('s3://noaa-nwm-retro-v2.0-zarr-pds')
ds = xr.open_zarr(mapper, consolidated=True)

Took 4s on my laptop (outside of AWS).

rsignell-usgs · 2020-12-09T17:50:04Z

@rabernat , awesome! I was stunned by the difference -- I guess the async loading of coordinate data is the big win, right?

rabernat · 2020-12-09T18:02:03Z

I think @shoyer has laid out the options in a very clear way.

I weakly favor option 2, as I think it preferable in terms of software architecture and our broader roadmap for Xarray. However, I am cognizant of the significant effort that @martindurant has put into this, and I don't want his effort to go to waste.

Some mitigating factors are:

The example I gave above (Allow fsspec/zarr/mfdataset #4461 (comment)) shows that one high-impact feature that users want (async capabilities in Zarr) already works, albiet with a different syntax. So this PR is more about convenience.
Presumably the knowledge about Xarray that Martin has gained by implementing this PR is transferrable to a different context, and so we would not be starting from scratch if we went with 2.

martindurant · 2020-12-11T16:19:26Z

Martin has gained by implementing this PR is transferrable

I'm not sure, it's been a while now...

rafa-guedes · 2020-12-20T02:35:40Z

@rabernat , awesome! I was stunned by the difference -- I guess the async loading of coordinate data is the big win, right?

@rsignell-usgs one other thing that can largely speed up loading of metadata / coordinates is ensuring coordinate variables are stored in one single chunk. For this particular dataset, chunk size for time coordinate is 672 yielding 339 chunks, which can take a while to load from remote bucket stores. If you rewrite time coordinate setting dset.time.encoding["chunks"] = (227904,) you should see a very large performance increase. One thing we have been doing for the cases of zarr archives that are appended in time, is defining time coordinate with a very large chunk size (e.g., dset.time.encoding["chunks"] = (10000000,)) when we first write the store. This ensures time coordinate will still fit in one single chunk after appending over time dimension, and does not affect chunking of the actual data variables.

One thing we have been having performance issues with is with loading coordinates / metadata from zarr archives that have too many chunks (millions), even when metadata is consolidated and coordinates are in one single chunk. There is an open issue in dask about this.

martindurant · 2021-01-18T19:15:25Z

All interested parties, please see new attempt at #4823

Martin Durant added 2 commits September 25, 2020 13:54

Work for fsspec integration with zarr

a1a794f

Merge branch 'master' into fsspec

4bc674e

alexamici reviewed Sep 25, 2020

View reviewed changes

Martin Durant added 2 commits September 25, 2020 21:04

Remove explicit storage_options kwarg

ca029ea

This can probably be cleaned up...

lint

67a86f7

martindurant mentioned this pull request Sep 28, 2020

Reads multiple zarr files intake/intake-xarray#80

Closed

Martin Durant added 3 commits September 28, 2020 10:28

extra lint

3fe984d

Add a test

ee48ae2

Merge branch 'master' into fsspec

53ced82

Signed-off-by: Martin Durant <[email protected]>

martindurant marked this pull request as ready for review September 29, 2020 19:43

Martin Durant added 4 commits September 29, 2020 16:12

for old zarr

46e068a

Update docstrings and whatsnew

65d3862

Update IO doc

0596088

doc syntax

d34423b

max-sixty reviewed Sep 30, 2020

View reviewed changes

doc/io.rst Outdated Show resolved Hide resolved

alexamici reviewed Oct 1, 2020

View reviewed changes

Martin Durant added 4 commits October 6, 2020 12:29

Merge branch 'master' into fsspec

29305aa

Reorder IO zarr doc

0b8e25d

don't execute code

13855ff

Add line to kick CI

2da9b4d

dcherian added the topic-backends label Oct 7, 2020

shoyer reviewed Oct 16, 2020

View reviewed changes

keewis reviewed Oct 18, 2020

View reviewed changes

Remove stray line in io.rst

db4a84e

max-sixty approved these changes Oct 18, 2020

View reviewed changes

doc/io.rst Outdated Show resolved Hide resolved

martindurant and others added 2 commits October 18, 2020 15:15

Update doc/io.rst

cce42e2

Co-authored-by: Maximilian Roos <[email protected]>

Merge branch 'master' into fsspec

7516d3e

merge fail

bb4d2a9

Martin Durant added 3 commits November 3, 2020 12:31

Merge branch 'master' into fsspec

857077a

fix

3ad1448

format

c5f6249

weiji14 mentioned this pull request Nov 3, 2020

Allow engine='zarr' and passing args for new xarray API intake/intake-xarray#89

Closed

Merge branch 'master' into fsspec

bd26f79

rabernat mentioned this pull request Nov 30, 2020

Release 0.16.2? #4624

Closed

3 tasks

remove decode_cf block

64a6150

rabernat approved these changes Nov 30, 2020

View reviewed changes

Update doc/io.rst

ef85740

Co-authored-by: Ryan Abernathey <[email protected]>

This was referenced Jan 15, 2021

Non-HTTPS remote URLs no longer work as input for open_zarr #4691

Closed

Allow fsspec URLs in open_(mf)dataset #4823

Merged

martindurant closed this Feb 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow fsspec/zarr/mfdataset #4461

Allow fsspec/zarr/mfdataset #4461

martindurant commented Sep 25, 2020 •

edited

Loading

alexamici Sep 25, 2020

martindurant Sep 25, 2020

martindurant commented Sep 25, 2020

keewis commented Sep 25, 2020

dcherian commented Sep 25, 2020 •

edited

Loading

alexamici Oct 1, 2020 •

edited

Loading

martindurant Oct 1, 2020

rabernat Nov 30, 2020

shoyer Sep 25, 2020

martindurant Oct 17, 2020

keewis Oct 18, 2020

martindurant Oct 18, 2020

max-sixty left a comment

pep8speaks commented Oct 19, 2020 •

edited

Loading

martindurant commented Oct 19, 2020

keewis commented Oct 19, 2020

martindurant commented Nov 3, 2020

rabernat commented Nov 30, 2020

rabernat left a comment

rabernat Nov 30, 2020

dcherian commented Dec 3, 2020

martindurant commented Dec 9, 2020

rsignell-usgs commented Dec 9, 2020 •

edited

Loading

shoyer commented Dec 9, 2020

rabernat commented Dec 9, 2020

rsignell-usgs commented Dec 9, 2020

rabernat commented Dec 9, 2020 •

edited

Loading

martindurant commented Dec 11, 2020

rafa-guedes commented Dec 20, 2020 •

edited

Loading

martindurant commented Jan 18, 2021

		if fs is not None:
		filename_or_obj = fs.get_mapper(filename_or_obj)

		@@ -876,6 +876,7 @@ can be omitted as it will internally be set to ``'a'``.

		.. ipython:: python

		ds1 = xr.Dataset(

Allow fsspec/zarr/mfdataset #4461

Allow fsspec/zarr/mfdataset #4461

Conversation

martindurant commented Sep 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Sep 25, 2020

keewis commented Sep 25, 2020

dcherian commented Sep 25, 2020 • edited Loading

alexamici Oct 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty left a comment

Choose a reason for hiding this comment

pep8speaks commented Oct 19, 2020 • edited Loading

Comment last updated at 2020-11-30 14:39:18 UTC

martindurant commented Oct 19, 2020

keewis commented Oct 19, 2020

martindurant commented Nov 3, 2020

rabernat commented Nov 30, 2020

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian commented Dec 3, 2020

martindurant commented Dec 9, 2020

rsignell-usgs commented Dec 9, 2020 • edited Loading

shoyer commented Dec 9, 2020

rabernat commented Dec 9, 2020

rsignell-usgs commented Dec 9, 2020

rabernat commented Dec 9, 2020 • edited Loading

martindurant commented Dec 11, 2020

rafa-guedes commented Dec 20, 2020 • edited Loading

martindurant commented Jan 18, 2021

martindurant commented Sep 25, 2020 •

edited

Loading

dcherian commented Sep 25, 2020 •

edited

Loading

alexamici Oct 1, 2020 •

edited

Loading

pep8speaks commented Oct 19, 2020 •

edited

Loading

rsignell-usgs commented Dec 9, 2020 •

edited

Loading

rabernat commented Dec 9, 2020 •

edited

Loading

rafa-guedes commented Dec 20, 2020 •

edited

Loading