Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fsspec URLs in open_(mf)dataset #4823

Merged
merged 24 commits into from
Feb 16, 2021
Merged

Conversation

martindurant
Copy link
Contributor

@martindurant martindurant commented Jan 18, 2021

  • Closes Allow fsspec/zarr/mfdataset #4461 and related
  • Tests added
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@martindurant
Copy link
Contributor Author

Docs to be added

@martindurant
Copy link
Contributor Author

Question: should HTTP URLs be passed through unprocessed as before? I think that might be required by some of the netCDF engines, but we probably don't test this.

@martindurant
Copy link
Contributor Author

martindurant commented Jan 18, 2021

pint errors in xarray/tests/test_units.py::TestVariable appear unrelated

@martindurant martindurant mentioned this pull request Jan 18, 2021
5 tasks
@charlesbluca
Copy link

Looking over the changes to is_remote_uri(), it looks like this should resolve the issue I had at #4691!

xarray/backends/zarr.py Outdated Show resolved Hide resolved
@martindurant
Copy link
Contributor Author

Next open question: aside from zarr, few of the other backends will know what to do with fsspec's dict-like mappers. Should we prevent them from passing through? Should we attempt to distinguish between directories and files, and make fsspec file-like objects? We could just allow the backends to fail later on incorrect input.

@martindurant martindurant changed the title Fsspec mk2 Allow fsspec URLs in open_(mf)dataset Jan 20, 2021
@alexamici
Copy link
Collaborator

@martindurant it is OK to let every backend raise errors for unsupported input. So no need to add any additional logic here IMO.

@alexamici
Copy link
Collaborator

@martindurant with respect to the backend API (old and new) looks good to me. I'm don't know fsspec so I can't comment on that.

@martindurant
Copy link
Contributor Author

(please definitely do not merge until I've added documentation)

@keewis keewis marked this pull request as draft January 21, 2021 15:25
)
paths = sorted(glob(_normalize_path(paths)))
paths = fs.glob(fs._strip_protocol(paths))
paths = [fs.get_mapper(path) for path in paths]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky. This assumes the backend is to want a mapper object (as the zarr backend does). But, what if the glob returns a list of netcdf files? Wouldn't we want a list of file objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is my comment for "should we actually special case zarr". It could make files - for now it would just error. We don't have tests for this, though, but now might be the time to start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now tracking with the comments above, I think we have two options:

  1. inject some backend specific logic in the api here to make a decision about what sort of object to return (if engine=='zarr', return mapper; else return file_obj)
  2. only support globbing remote zarr stores

(1) seems to be the more reasonable thing to do here but is slightly less principled as we've been working to cleanly separate the api from the backends.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose a third alternative might be to pass the paths through, and create mappers in the zarr backend (will re-instantiate the FS, but that's fine) and add the opening of remote files into each of the other backends that can handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have essentially done 1), but excluded HTTP for the non-zarr path, because it has a special place for some backends (dap...). In any case, I don't suppose anyone is using globbing with http, since it's generally unreliable.

@martindurant
Copy link
Contributor Author

I am marking this PR as ready, but please ask me for specific test cases that might be relevant and should be included.

@kmuehlbauer
Copy link
Contributor

@martindurant Should be fixed by #4845. Probably just needs rebase.

@martindurant
Copy link
Contributor Author

Thanks, @kmuehlbauer

"{!r}. Instead, supply paths as an explicit list of strings.".format(
paths
)
from fsspec import open_files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to handle the case when fsspec is not installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is, this would raise ImportError. Would it be better to have a try/except ImportError/raise from, with an extra sentence?
"To open fsspec-compatible [remote?] URLs, you must install fsspec"
Note that you might still get an error in this block if, for example, it's an s3 URL, but s3fs is not installed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought because L893 cannot be reached without fsspec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I don't really mind - just because you mention backward compatibility - I am also fine to leave it as is)

xarray/backends/api.py Outdated Show resolved Hide resolved
"{!r}. Instead, supply paths as an explicit list of strings.".format(
paths
)
from fsspec import open_files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought because L893 cannot be reached without fsspec.

@martindurant
Copy link
Contributor Author

I have decided, on reflection, to back away on the scope here and only implement for zarr for now, since, frankly, I am confused about what should happen for other backends, and they are not tested. Yes, some of them are happy to accept file-like objects, but others either don't do that at all, or want the URL passing through. My code would have changed how things were handled, depending on whether it passed through open_dataset or open_mfdataset. Best would be to set up a set of expectations as tests.

@raybellwaves
Copy link
Contributor

I think my Q on SO is related to this PR https://stackoverflow.com/questions/66145459/open-mfdataset-on-remote-zarr-store-giving-zarr-errors-groupnotfounderror

Was looking at reading a remote zarr store using open_mfdataset

@martindurant suggested putting the single "file" (mapping) in a list which works
ds = xr.open_mfdataset([file], engine="zarr")

but I also wanted to test the other suggestion

ds = xr.open_mfdataset(uri, engine="zarr", backend_kwargs=dict(storage_options={'anon': True}))

On current xarray master I get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ray.bell/Documents/PYTHON_dev/xarray/xarray/backends/api.py", line 884, in open_mfdataset
    raise OSError("no files to open")
OSError: no files to open

On this branch it works

git clone https://github.com/martindurant/xarray.git
git checkout fsspec_mk2
conda create -c conda-forge -n xarray-tests python=3.8
conda env update -f ci/requirements/environment.yml
conda activate xarray-tests
pip install -e .
pip install s3fs

import xarray as xr
uri = "s3://era5-pds/zarr/2020/12/data/eastward_wind_at_10_metres.zarr"
ds = xr.open_mfdataset(uri, engine="zarr", backend_kwargs=dict(storage_options={'anon': True}))

@martindurant
Copy link
Contributor Author

@raybellwaves , might I paraphrase to "this PR is useful, please merge!" ?

Copy link
Collaborator

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments, but otherwise looks good to me.

with raises_regex(ValueError, "wild-card"):
open_mfdataset("http://some/remote/uri")

@requires_fsspec
def test_open_mfdataset_no_files(self):
pytest.importorskip("aiobotocore")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means this test is always skipped. Should this be added to some of the environments?

xarray/tests/test_backends.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator

keewis commented Feb 15, 2021

could you add aiobotocore to environment.yml, too? py38-all-but-dask should be environment.yml without dask / distributed. Also, could you merge in recent master to resolve the conflicts?

@martindurant
Copy link
Contributor Author

Can someone please explain the minimum version policy that is failing

Package           Required             Policy               Status
----------------- -------------------- -------------------- ------
aiobotocore       1.1.2   (2020-08-18) 0.12    (2020-02-24) > (!) (w)

Martin Durant added 2 commits February 16, 2021 10:02
because it pins othr deps too tightly, makes solver panic
@keewis
Copy link
Collaborator

keewis commented Feb 16, 2021

if you wait two days the minimum version policy check should pass with aiobotocore=1.1.2, too.

@dcherian
Copy link
Contributor

There are 3 approvals and @martindurant has been quite patient! Thanks. merging. We can update the min_deps_check later.

@dcherian dcherian merged commit 8bf415a into pydata:master Feb 16, 2021
@martindurant martindurant deleted the fsspec_mk2 branch February 16, 2021 21:26
@martindurant
Copy link
Contributor Author

Thank you, @dcherian

dcherian added a commit to dcherian/xarray that referenced this pull request Feb 17, 2021
* upstream/master:
  FIX: h5py>=3 string decoding (pydata#4893)
  Update matplotlib's canonical (pydata#4919)
  Adding vectorized indexing docs (pydata#4711)
  Allow fsspec URLs in open_(mf)dataset (pydata#4823)
  Fix typos in example notebooks (pydata#4908)
  pre-commit autoupdate CI (pydata#4906)
  replace the ci-trigger action with a external one (pydata#4905)
  Update area_weighted_temperature.ipynb (pydata#4903)
  hide the decorator from the test traceback (pydata#4900)
  Sort backends (pydata#4886)
  Compatibility with dask 2021.02.0 (pydata#4884)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants