Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make kerchunk dependency entirely optional #258

Closed
TomNicholas opened this issue Oct 17, 2024 · 11 comments · Fixed by #259
Closed

Make kerchunk dependency entirely optional #258

TomNicholas opened this issue Oct 17, 2024 · 11 comments · Fixed by #259
Labels
dependencies Updates a dependency Kerchunk Relating to the kerchunk library / specification itself

Comments

@TomNicholas
Copy link
Member

Virtualizarr doesn't need to have an explicit dependency on Kerchunk. Kerchunk is one (well it reads many formats) amongst many readers for creating virtual references, and it is now one of two options for writing virtual references (the other being Icechunk). Issue #78 points this out for the readers, but we could make the entire package only optionally depend on kerchunk.

A more urgent reason to change this is Kerchunk does not currently support zarr-python v3 (fsspec/kerchunk#516), which is preventing us ensuring that the rest of this package works with zarr v3 (#182), which is needed for testing Icechunk compatibility (#256).

Changing the virtualizarr business logic to make kerchunk optional is easy - the hard part is that lots of our tests are more reliant on kerchunk than they need to be. Ideally we would either rewrite those integration tests to start from in-memory references rather than files on disk, or use a non-kerchunk approach to generate references from example HDF5 files (#87). I think the former would be neater, but alternatively once the #87 is merged the latter would be a simple change.

@mpiannucci @norlandrhagen @ghidalgo3 @sharkinsspatial

@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself dependencies Updates a dependency labels Oct 17, 2024
@mpiannucci
Copy link
Contributor

I think Being able to read kerchunk reference format without explicitly needing kerchunk would go a long way making this package useful for the interim until kerchunk can support v3

@TomNicholas
Copy link
Member Author

able to read kerchunk reference format without explicitly needing kerchunk

I agree - it allows people to use VirtualiZarr to move their references from Kerchunk's format to Icechunk's format. We merged support for that yesterday! #251

We could aim for a working docs example along these lines, i.e:

# kerchunk is not installed
vds = open_virtual_datasets('refs.parquet', format='kerchunk')
vds.virtualize.to_icechunk(icechunkstore)

@TomNicholas
Copy link
Member Author

I guess actually @norlandrhagen and I weren't explicitly thinking about avoiding a kerchunk dependency in the kerchunk reader. But it looks like it only depends on ujson.

@norlandrhagen
Copy link
Collaborator

I guess actually @norlandrhagen and I weren't explicitly thinking about avoiding a kerchunk dependency in the kerchunk reader. But it looks like it only depends on ujson.

That's true for the json kerchunk format, but the parquet one is a bit more intertwined. I took the easy route of using LazyReferenceFileSystem to read the Zarquet references, but we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥)

@TomNicholas
Copy link
Member Author

we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥)

Do we even need to go that far? The aim here is to get something that works with zarr v3, and if fsspec can work with zarr v3 (can it??) then that should be okay.

@keewis
Copy link
Contributor

keewis commented Oct 17, 2024

LazyReferenceFileSystem

isn't that defined in fsspec?

@ghidalgo3
Copy link
Contributor

ghidalgo3 commented Oct 17, 2024 via email

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 17, 2024

read kerchunk reference format without explicitly needing kerchunk

@norlandrhagen after looking closer I'm pretty sure we can achieve this fairly easily - it just requires some modifications to the fixures for the tests you submitted in #251. Specifically we need to have data for those tests without actually invoking kerchunk at any point. I think we should just make a very small example json/parquet file (like literally 1 variable with 3 values) and save that into the repo, then have a fixture that returns that. Then we should be able to remove @requires_kerchunk for your two tests.

@norlandrhagen
Copy link
Collaborator

we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥)

Do we even need to go that far? The aim here is to get something that works with zarr v3, and if fsspec can work with zarr v3 (can it??) then that should be okay.

Great! Happy to work on that. Also ties in a bit to the local data testing PR

@TomNicholas
Copy link
Member Author

Awesome! Thanks @norlandrhagen . FYI I should have said this in your previous PR but I think we should try to separate the kerchunk "reader" into two files, one "reader.py" which contains code that accepts a path and returns kerchunk references, and one "translator.py" which turns kerchunk-formatted in-memory references into a virtual dataset. Or something along those lines. That would keep the two use cases of dataset_from_kerchunk_refs more distinct.

@TomNicholas
Copy link
Member Author

I think we should try to separate the kerchunk "reader" into two files, one "reader.py" which contains code that accepts a path and returns kerchunk references, and one "translator.py"

This has been done in #261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Updates a dependency Kerchunk Relating to the kerchunk library / specification itself
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants