Make kerchunk dependency entirely optional #258

TomNicholas · 2024-10-17T05:12:48Z

Virtualizarr doesn't need to have an explicit dependency on Kerchunk. Kerchunk is one (well it reads many formats) amongst many readers for creating virtual references, and it is now one of two options for writing virtual references (the other being Icechunk). Issue #78 points this out for the readers, but we could make the entire package only optionally depend on kerchunk.

A more urgent reason to change this is Kerchunk does not currently support zarr-python v3 (fsspec/kerchunk#516), which is preventing us ensuring that the rest of this package works with zarr v3 (#182), which is needed for testing Icechunk compatibility (#256).

Changing the virtualizarr business logic to make kerchunk optional is easy - the hard part is that lots of our tests are more reliant on kerchunk than they need to be. Ideally we would either rewrite those integration tests to start from in-memory references rather than files on disk, or use a non-kerchunk approach to generate references from example HDF5 files (#87). I think the former would be neater, but alternatively once the #87 is merged the latter would be a simple change.

@mpiannucci @norlandrhagen @ghidalgo3 @sharkinsspatial

mpiannucci · 2024-10-17T11:22:51Z

I think Being able to read kerchunk reference format without explicitly needing kerchunk would go a long way making this package useful for the interim until kerchunk can support v3

TomNicholas · 2024-10-17T13:40:47Z

able to read kerchunk reference format without explicitly needing kerchunk

I agree - it allows people to use VirtualiZarr to move their references from Kerchunk's format to Icechunk's format. We merged support for that yesterday! #251

We could aim for a working docs example along these lines, i.e:

# kerchunk is not installed
vds = open_virtual_datasets('refs.parquet', format='kerchunk')
vds.virtualize.to_icechunk(icechunkstore)

TomNicholas · 2024-10-17T13:42:40Z

I guess actually @norlandrhagen and I weren't explicitly thinking about avoiding a kerchunk dependency in the kerchunk reader. But it looks like it only depends on ujson.

norlandrhagen · 2024-10-17T13:52:37Z

I guess actually @norlandrhagen and I weren't explicitly thinking about avoiding a kerchunk dependency in the kerchunk reader. But it looks like it only depends on ujson.

That's true for the json kerchunk format, but the parquet one is a bit more intertwined. I took the easy route of using LazyReferenceFileSystem to read the Zarquet references, but we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥)

TomNicholas · 2024-10-17T13:58:03Z

we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥)

Do we even need to go that far? The aim here is to get something that works with zarr v3, and if fsspec can work with zarr v3 (can it??) then that should be okay.

keewis · 2024-10-17T14:16:20Z

LazyReferenceFileSystem

isn't that defined in fsspec?

ghidalgo3 · 2024-10-17T16:20:35Z

I’ll be out of the country until November but I’m very interested in this change. Don’t wait on my review unless you can wait 2 weeks.

TomNicholas · 2024-10-17T18:39:25Z

read kerchunk reference format without explicitly needing kerchunk

@norlandrhagen after looking closer I'm pretty sure we can achieve this fairly easily - it just requires some modifications to the fixures for the tests you submitted in #251. Specifically we need to have data for those tests without actually invoking kerchunk at any point. I think we should just make a very small example json/parquet file (like literally 1 variable with 3 values) and save that into the repo, then have a fixture that returns that. Then we should be able to remove @requires_kerchunk for your two tests.

norlandrhagen · 2024-10-17T19:08:44Z

we could also probably assembled the reference by reading all the group/variable level parquets with pyarrow (or something lighter arro3?! 🫥)

Do we even need to go that far? The aim here is to get something that works with zarr v3, and if fsspec can work with zarr v3 (can it??) then that should be okay.

Great! Happy to work on that. Also ties in a bit to the local data testing PR

TomNicholas · 2024-10-17T19:17:00Z

Awesome! Thanks @norlandrhagen . FYI I should have said this in your previous PR but I think we should try to separate the kerchunk "reader" into two files, one "reader.py" which contains code that accepts a path and returns kerchunk references, and one "translator.py" which turns kerchunk-formatted in-memory references into a virtual dataset. Or something along those lines. That would keep the two use cases of dataset_from_kerchunk_refs more distinct.

TomNicholas · 2024-10-18T13:57:59Z

I think we should try to separate the kerchunk "reader" into two files, one "reader.py" which contains code that accepts a path and returns kerchunk references, and one "translator.py"

This has been done in #261

TomNicholas added Kerchunk Relating to the kerchunk library / specification itself dependencies Updates a dependency labels Oct 17, 2024

TomNicholas mentioned this issue Oct 17, 2024

Skip tests that require kerchunk #259

Merged

1 task

TomNicholas closed this as completed in #259 Oct 17, 2024

TomNicholas mentioned this issue Oct 17, 2024

Split kerchunk reader up #261

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make kerchunk dependency entirely optional #258

Make kerchunk dependency entirely optional #258

TomNicholas commented Oct 17, 2024

mpiannucci commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

norlandrhagen commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

keewis commented Oct 17, 2024

ghidalgo3 commented Oct 17, 2024 via email •

edited by TomNicholas

Loading

TomNicholas commented Oct 17, 2024 •

edited

Loading

norlandrhagen commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 18, 2024

Make kerchunk dependency entirely optional #258

Make kerchunk dependency entirely optional #258

Comments

TomNicholas commented Oct 17, 2024

mpiannucci commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

norlandrhagen commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

keewis commented Oct 17, 2024

ghidalgo3 commented Oct 17, 2024 via email • edited by TomNicholas Loading

TomNicholas commented Oct 17, 2024 • edited Loading

norlandrhagen commented Oct 17, 2024

TomNicholas commented Oct 17, 2024

TomNicholas commented Oct 18, 2024

ghidalgo3 commented Oct 17, 2024 via email •

edited by TomNicholas

Loading

TomNicholas commented Oct 17, 2024 •

edited

Loading