Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N5Store support of cloud buckets #540

Closed
chrisroat opened this issue Feb 26, 2020 · 32 comments · Fixed by #546
Closed

N5Store support of cloud buckets #540

chrisroat opened this issue Feb 26, 2020 · 32 comments · Fixed by #546

Comments

@chrisroat
Copy link

The N5Store works great for storing local files. What are the technical challenges in moving to support something like GCS for storage? Other pieces of the zarr library seamlessly work if I move to gs:// prefixes (with the right packages installed), but N5Store ends up writing to the local filesystem when using that prefix.

Is it a matter of going through n5.py and storage.py and changing uses of the os library to uses of fsspec? Or are there operations that fundamentally won't work on cloud objects?

@jakirkham
Copy link
Member

Funny this came up in conversation today. Also relates to issue ( #395 ) and issue ( zarr-developers/n5py#9 ).

cc @ryan-williams @joshmoore

@d-v-b
Copy link
Contributor

d-v-b commented Mar 2, 2020

I'm also affected by this issue -- I have n5 data on s3 (because that's what neuroglancer wants) but I'd also like to use zarr locally to access these data via s3fs. Concretely, what needs to be done to get this working?

@jakirkham
Copy link
Member

This is probably not a difficult thing to solve if you are interested. As implemented N5Store inherits from NestedDirectoryStore. However there is no reason it needs to do that. Instead we might consider giving N5Store a constructor that takes a Store as an argument. This would allow it to operate on other stores we might provide it. Internally we would replace super calls to using the store we gave it.

A similar thing may be needed for NestedDirectoryStore.

If we decide that we would rather not modify N5Store or NestedDirectoryStore, we could consider new classes N5Mapping and NestedDirectoryMapping to provide the same behavior as above, but in new classes. We could then simplify N5Store and NestedDirectoryStore to use these new Mapping classes.

Thoughts? 🙂

@d-v-b
Copy link
Contributor

d-v-b commented Mar 2, 2020

as a non-maintainer, either one looks great to me :) I will start with the first strategy and see how far I get.

@jakirkham
Copy link
Member

Sounds good. Please let us know if there are other ways we can help 🙂

@chrisroat
Copy link
Author

I like the first strategy, as well. Meta-question: what sorts of Store's do you expect?

The reason I ask is that the various implementations of fsspec could be used in a single FilesystemStore, which detects which fssspec implementation to use. This avoids having a S3Store, GcsStore, LocalFsStore, etc. Is this a worthwhile approach?

@mzjp2
Copy link
Member

mzjp2 commented Mar 2, 2020

I like the first strategy, as well. Meta-question: what sorts of Store's do you expect?

The reason I ask is that the various implementations of fsspec could be used in a single FilesystemStore, which detects which fssspec implementation to use. This avoids having a S3Store, GcsStore, LocalFsStore, etc. Is this a worthwhile approach?

Indeed, see @rabernat PR: #373 (comment) -- I was going to try a quick implementation this weekend, happy to collaborate though!

@martindurant
Copy link
Member

Let me know what you come up with @mzjp2 . It's also on my radar, but I don't know when.

@perlman
Copy link

perlman commented Mar 17, 2020

Has anyone made progress on n5+http? I like @jakirkham's suggested approach, but figure my time would be better spent testing/tweaking someone else's code if it's hiding in a branch somewhere.

@martindurant
Copy link
Member

It is on my list, but also happy to review rather than write!

@d-v-b
Copy link
Contributor

d-v-b commented Mar 17, 2020

it's on my list too :) but i haven't gotten to it yet

@chrisroat
Copy link
Author

chrisroat commented Apr 1, 2020

With #546 in the works, let me get my head around what we want here. @alimanfoo has the good idea of re-writing N5Store to take a FSStore (or another store/mapping) and be simply a transformation. It looks like this would mean replacing super calls with calls to the passed in mapping? (The listdir method and other methods calling os might need some special attention.)

If I understand correctly, N5Store would then be a MutableMapping that takes another MutableMapping?

The code would look like:

fs_store = zarr.FSStore('gs://bucket/foo')
n5_store = zarr.N5Store(fs_store)  # Or rename this to zarr.N5Transformer
arr.to_zarr(n5_store)

@martindurant
Copy link
Member

Sure, that would work fine.

However, it would also be really nice to have top level
zarr.open_*(generic_url, storage_options=, is_n5, is_consolidated, is_nested)
functions rather than have users passing around store objects. Maybe I'm asking too much, perhaps n5 is too different.
On the other hand, this could be implemented via dispatch.

@shoyer
Copy link
Contributor

shoyer commented Jul 1, 2020

I have n5 data on s3 (because that's what neuroglancer wants)

Side note: Neuroglancer now supports directly zarr files, so that might be a good option for you!

@chrisroat
Copy link
Author

Thanks for the heads up. I'd like to migrate to that, as it'd be a bit more efficient than my current setup!

I have a cloudvolume+igneous deployment going now -- the existing igneous code+config sets up downscaling pyramids out-of-the-box, which is super helpful and I'd have to replicate (maybe not hard... just time).

@joshmoore
Copy link
Member

@martindurant : sorry, I missed the comment in #546 -- this isn't fully implemented yet, correct? But rather "NestedDirectoryStore support of cloud buckets", right?

@martindurant
Copy link
Member

Correct, FSStore allows nesting by specifying the key separator. I don't know what more N5 needs, but presumably it could use of of these stores.

@joshmoore
Copy link
Member

Ok. Re-opening this issue then.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 10, 2021

I got FSStore-backed n5 writing working in my own repo: https://github.com/janelia-cosem/fibsem-tools/blob/master/src/fibsem_tools/io/storage.py

I created a class N5FSStore that inherits from FSStore, much like how N5Store inherits from NestedDirectoryStore
To get this to work, I had to modiify FSStore in two ways:

  • instead of defining the zarr meta keys (.zarray, .zgroup, .zattrs) as properties of the FSStore class, FSStore takes meta_keys as a keyword argument in its constructor. This allows subclasses to provide their own meta keys (e.g., attributes.json for n5).

  • FSStore.getitems takes keys as input, transforms them via _normalize_keys(), and returns a dict of {transformed_keys, values}. This didn't work with n5 because the core zarr library asks for "." separated keys, which the n5 adapter code converts to "/"-separated keys in reverse order (e.g., ("0.1.2" -> "2/1/0"). If getitems returns a dict with the transformed keys, then the core zarr code doesn't recognize them and all the values are discarded. So I modified getitems to return a dict with the original keys instead.

Beyond these two changes, I could recycle a lot of the existing key transformation logic from N5Store in N5FSStore.

I need this for my own work so I haven't put together a PR yet. Would the changes I made to FSStore be acceptable? And would we want to have 2 N5 implementations (one derived from NestedDirectoryStore, and another derived from FSStore?

@joshmoore
Copy link
Member

  • FSStore takes meta_keys as a keyword argument in its constructor

👍

  • So I modified getitems to return a dict with the original keys instead.

I don't know the impact of this offhand, but if tests are passing...

Would the changes I made to FSStore be acceptable?

From my POV, having them would be great!

would we want to have 2 N5 implementations

Since NestedDirectoryStore can't deal with remote access, I think we minimally need an FSStore version which can handle N5. (I had some hope that we wouldn't need to subclass and instead could use composition, but I haven't looked into it further.)

I think the question is whether or not we maintain NDS or if it gets deprecated.

@martindurant
Copy link
Member

I originally wrote FSStore to optionally contain N5 also, but it proved too unwieldy. I don't think there's a problem to have multiple store implementations. I haven't had a look at the changes yet, but passing the tests is a good indication.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 11, 2021

speaking of passing tests... I noticed that some of the tests for fsstore are incomplete, and you can't actually read from a "/"-separated fsstore without the aforementioned changes to fsstore.getitems: see #709

@martindurant
Copy link
Member

I noticed that some of the tests for fsstore are incomplete

Please flesh out as you see fit; if this justifies the changes in FSStore, all the better.

@perlman
Copy link

perlman commented Jun 11, 2021

I found myself trying to open a remote N5 this morning. I had completely forgotten about this issue. 🤦

@d-v-b
Copy link
Contributor

d-v-b commented Jun 11, 2021

@perlman see https://github.com/janelia-cosem/fibsem-tools/blob/master/src/fibsem_tools/io/storage.py#L223

Ultimately I want to take this code out of my own repo and get it here...

@martindurant
Copy link
Member

This has been mentioned elsewhere, but it may be better instead of subclassing, to have a layered approach, i.e., a N5 storage class that takes another storage class as an arguments.
On the other hand if FSStore can handle a great variety of backends, then it might be fine to have a couple more based on it.

@martindurant
Copy link
Member

And to be clear: code that works now is a great thing to have!

@perlman
Copy link

perlman commented Jun 15, 2021

Thanks @d-v-b, I'll give it a whirl. I might have a few cycles to work on moving this over (...I would certainly use it).

This has been mentioned elsewhere, but it may be better instead of subclassing, to have a layered approach, i.e., a N5 storage class that takes another storage class as an arguments.

@martindurant: Is there example to follow, style-wise, for doing this within python-zarr?

@joshmoore
Copy link
Member

Not as far as I know. cc: @jakirkham who mentioned something similar to me as well in case he has an example.

@joshmoore
Copy link
Member

@d-v-b: would you like to do the honors?

@d-v-b
Copy link
Contributor

d-v-b commented Sep 22, 2021

this should be addressed by #793

@joshmoore
Copy link
Member

💯

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants