-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make kerchunk dependency entirely optional #258
Comments
I think Being able to read kerchunk reference format without explicitly needing kerchunk would go a long way making this package useful for the interim until kerchunk can support v3 |
I agree - it allows people to use VirtualiZarr to move their references from Kerchunk's format to Icechunk's format. We merged support for that yesterday! #251 We could aim for a working docs example along these lines, i.e: # kerchunk is not installed
vds = open_virtual_datasets('refs.parquet', format='kerchunk')
vds.virtualize.to_icechunk(icechunkstore) |
I guess actually @norlandrhagen and I weren't explicitly thinking about avoiding a kerchunk dependency in the kerchunk reader. But it looks like it only depends on ujson. |
That's true for the json kerchunk format, but the parquet one is a bit more intertwined. I took the easy route of using LazyReferenceFileSystem to read the Zarquet references, but we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥) |
Do we even need to go that far? The aim here is to get something that works with zarr v3, and if fsspec can work with zarr v3 (can it??) then that should be okay. |
isn't that defined in |
I’ll be out of the country until November but I’m very interested in this change. Don’t wait on my review unless you can wait 2 weeks.
|
@norlandrhagen after looking closer I'm pretty sure we can achieve this fairly easily - it just requires some modifications to the fixures for the tests you submitted in #251. Specifically we need to have data for those tests without actually invoking kerchunk at any point. I think we should just make a very small example json/parquet file (like literally 1 variable with 3 values) and save that into the repo, then have a fixture that returns that. Then we should be able to remove |
Great! Happy to work on that. Also ties in a bit to the local data testing PR |
Awesome! Thanks @norlandrhagen . FYI I should have said this in your previous PR but I think we should try to separate the kerchunk "reader" into two files, one "reader.py" which contains code that accepts a path and returns kerchunk references, and one "translator.py" which turns kerchunk-formatted in-memory references into a virtual dataset. Or something along those lines. That would keep the two use cases of |
This has been done in #261 |
Virtualizarr doesn't need to have an explicit dependency on Kerchunk. Kerchunk is one (well it reads many formats) amongst many readers for creating virtual references, and it is now one of two options for writing virtual references (the other being Icechunk). Issue #78 points this out for the readers, but we could make the entire package only optionally depend on kerchunk.
A more urgent reason to change this is Kerchunk does not currently support zarr-python v3 (fsspec/kerchunk#516), which is preventing us ensuring that the rest of this package works with zarr v3 (#182), which is needed for testing Icechunk compatibility (#256).
Changing the virtualizarr business logic to make kerchunk optional is easy - the hard part is that lots of our tests are more reliant on kerchunk than they need to be. Ideally we would either rewrite those integration tests to start from in-memory references rather than files on disk, or use a non-kerchunk approach to generate references from example HDF5 files (#87). I think the former would be neater, but alternatively once the #87 is merged the latter would be a simple change.
@mpiannucci @norlandrhagen @ghidalgo3 @sharkinsspatial
The text was updated successfully, but these errors were encountered: