Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to RemoteStore for fsspec URIs #2198

Merged
merged 13 commits into from
Sep 19, 2024

Conversation

TomAugspurger
Copy link
Contributor

This allows for specifying a RemoteStore by an fsspec URI + storage options, as a convenience (and backwards compatibility shim) for creating a RemoteStore:

In [1]: import zarr.store

In [2]: g = zarr.open_group(store="az://daymet-zarr/annual/hi.zarr", zarr_format=2, storage_options={"account_name": "daymeteuwest"})

In [3]: list(g.arrays())
Object at .zmetadata is not recognized as a component of a Zarr hierarchy.
Out[3]: 
[<Array az://daymet-zarr/annual/hi.zarr/lambert_conformal_conic shape=() dtype=int16>,
 <Array az://daymet-zarr/annual/hi.zarr/lat shape=(584, 284) dtype=float32>,
...

Closes #2195

elif isinstance(store_like, str):
return StorePath(await LocalStore.open(root=Path(store_like), mode=mode or "r"))
try:
fs, path = fsspec.url_to_fs(store_like, **storage_options)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: this constructs the actual filesystem class instance, but we have to go back to a URL (and reconstruct the filesystem instance) for the RemoteStore interface. It'd be nice to avoid that.

try:
fs, path = fsspec.url_to_fs(store_like, **storage_options)
except Exception:
# not sure what to do here, but I don't want this to fail...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad, but in my experience fsspec.url_to_fs can run pretty much arbitrary code, depending on what packages you have installed and what URI you pass. I really don't want this to fail and cause issues for people who just want a local path. maybe we'd always get an fsspec LocalFileSystem for a plain, non-fsspec URI string, but I'm not sure.

else:
raise TypeError

if path is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot to revert this. It's not used yet.

We call make_store_path multiple times in a single call by the user (e.g. save_array calls make_store_path and passes the result to AsyncArray.create which also calls make_store_path). I was thinking about consolidating all the store / path stuff in a single spot, but it ended up being a bit much and I wasn't sure what the tolerance was for expanding the API of methods like AsyncArray.create even more. Do we want users to do

AsyncArray.create(store=StorePath(RemoteStore(..., **storage_options), path))

or

AsyncArray.create(store="s3://bucket/path.zarr", storage_options=...)

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much opinion on the actual change, but is it worth adding warnings everywhere if storage_options is passed by the user (ie not None), but they haven't passed an fsspec URL so they aren't used?

src/zarr/api/asynchronous.py Show resolved Hide resolved
elif isinstance(store_like, Path):
return StorePath(await LocalStore.open(root=store_like, mode=mode or "r"))
result = StorePath(await LocalStore.open(root=store_like, mode=mode or "r"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can pass **storage_options to LocalStore as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be missing something, but I don't think that'll work. LocalStore.open will call LocalStore.__init__, which just takes root and mode, which are passed as regular args here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I was thinking auto_mkdir would be passed through but if that's not the case, let's not get distracted here.

elif isinstance(store_like, str):
return StorePath(await LocalStore.open(root=Path(store_like), mode=mode or "r"))
storage_options = storage_options or {}
fs, _ = fsspec.url_to_fs(store_like, **storage_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts here:

  1. It would be nice to avoid creating the FileSystem if possible. A lighter approach could attempt to parse the string and look for :// (and not file://).
  2. If we do have to create filesystem, it would be nice to reuse it when creating the RemoteStore. Something like RemoteStore.from_filesystem(...) could be a nice pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below for the way we used to do this. May be worth replicating parts of this

if "://" in store or "::" in store:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed that refactor. It's a bit bigger than I'd like, but the basic version is that RemoteStores's __init__ matches what it uses internally: you can only given it an async filesystem.

There's a from_url and from_path that handles the other stuff that was previously in __init__.

I removed RemoteStore._url. I think that that's safe to remove, but I need to actually think through how it interacted with store.path

@jhamman jhamman merged commit f538662 into zarr-developers:v3 Sep 19, 2024
26 checks passed
dcherian added a commit to dcherian/zarr-python that referenced this pull request Sep 24, 2024
* v3:
  chore: update pre-commit hooks (zarr-developers#2222)
  fix: validate v3 dtypes when loading/creating v3 metadata (zarr-developers#2209)
  fix typo in store integration test (zarr-developers#2223)
  Basic Zarr-python 2.x compatibility changes (zarr-developers#2098)
  Make Group.arrays, groups compatible with v2 (zarr-developers#2213)
  Typing fixes to test_indexing (zarr-developers#2193)
  Default to RemoteStore for fsspec URIs (zarr-developers#2198)
  Make MemoryStore serialiazable (zarr-developers#2204)
  [v3] Implement Group methods for empty, full, ones, and zeros (zarr-developers#2210)
  implement `store.list_prefix` and `store._set_many` (zarr-developers#2064)
  Fixed codec for v2 data with no fill value (zarr-developers#2207)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Default to a RemoteStore for (some) string StoreLike
3 participants