Allow `open_virtual_dataset` to read existing Kerchunk references #251

norlandrhagen · 2024-10-08T16:01:56Z

Allow `open_virtual_dataset` to read existing Kerchunk references as virtual datasets

Closes Open on-disk kerchunk references as a virtual dataset #118
Tests added
Tests passing
Changes are documented in docs/releases.rst
New functionality has documentation

To Do:

[JSON] trailing \\ in variable names. Fixed. json was adding trailing slashes on ref write.
[JSON] coordinates not recognized. Fixed. This was related to json write.
[JSON] coordinates set as data variables. Fixed. This was related to json write.
[PARQUET] extract references from 1. LazyReferenceMapper or 2. Reconstruct reference from ✨Zarrquet ✨ format.
[FUTURE PR] Convert inlined vars into numpy arrays. Note: After talking with @sharkinsspatial, it seems like the HDF reader he is working on won't inline at all. Should we: 1. Raise an error if any inline data exists or 2. Add logic to convert any inlined bytes to numpy arrays to maintain more compatibility with other Kerchunk readers?

Notes:

Mypy checks in CI are currently disabled! [Tracking] Scheduled CI is failing on main #249
The utility function _fsspec_openfile_from_filepath forced the fsspec filesystem to open a filepath (the fault of past me). I replaced it with a simple class that has a .open_file method that can be used as needed. The Kerchunk LazyReferenceMapper required a fsspec filesystem.

keewis · 2024-10-08T18:29:55Z

did you see #186? That was a second attempt after #119 to implement this, and contains some documentation that you might be able to reuse?

norlandrhagen · 2024-10-08T22:41:14Z

@keewis thanks for the rec. Having some docs already is nice.

…ctored _fsspec_open... to class

pyproject.toml

Co-authored-by: Justus Magin <[email protected]>

TomNicholas

Great work @norlandrhagen !

Convert inlined vars into numpy arrays. Note: After talking with @sharkinsspatial, it seems like the HDF reader he is working on won't inline at all. Should we:

Raise an error if any inline data exists or

Add logic to convert any inlined bytes to numpy arrays to maintain more compatibility with other Kerchunk readers?

I feel like this mixes a few issues together. We do need to be able to read back inlined kerchunk references (though it's fine to add that feature in a follow-up PR), partly because even with Sean's PR we're still going to want to use the other kerchunk readers sometimes. I think it makes sense for @sharkinsspatial 's HDF reader not to create inlined refs, so long as we have another way to create inlined refs (i.e. using the normal xarray backend for that filetype).

Mypy checks in CI are currently disabled!

Should be fixed by #252

The utility function _fsspec_openfile_from_filepath forced the fsspec filesystem to open a filepath

Not totally sure I understand this but abstracting away fsspec details sounds good.

.github/workflows/main.yml

virtualizarr/backend.py

virtualizarr/tests/test_backend.py

virtualizarr/utils.py

Co-authored-by: Tom Nicholas <[email protected]>

codecov · 2024-10-11T22:21:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.37%. Comparing base (53a609f) to head (8d53227).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #251      +/-   ##
==========================================
+ Coverage   91.20%   91.37%   +0.17%     
==========================================
  Files          32       32              
  Lines        2057     2098      +41     
==========================================
+ Hits         1876     1917      +41     
  Misses        181      181

Flag	Coverage Δ
unittests	`91.37% <100.00%> (+0.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

virtualizarr/backend.py

docs/usage.md

virtualizarr/backend.py

norlandrhagen · 2024-10-16T19:59:53Z

Thanks for the feedback @keewis and @TomNicholas. I think everything is addressed. The auto-detection is pretty rough, but hopefully if you're using this, you know the difference between parquet and json 🤷

TomNicholas · 2024-10-16T20:09:35Z

Thanks @norlandrhagen - happy to merge!

reading existing refs - wip

18f0deb

norlandrhagen added Kerchunk Relating to the kerchunk library / specification itself references formats Storing byte range info on disk labels Oct 8, 2024

norlandrhagen temporarily deployed to test-release October 8, 2024 16:02 — with GitHub Actions Inactive

ujson stub to mypy overrides in pyproject.toml

4d31016

norlandrhagen temporarily deployed to test-release October 8, 2024 16:23 — with GitHub Actions Inactive

added xfail to kerchunk json

e2b14fa

norlandrhagen temporarily deployed to test-release October 8, 2024 16:25 — with GitHub Actions Inactive

updated reference writing to remove trailing //

571991b

norlandrhagen temporarily deployed to test-release October 8, 2024 17:59 — with GitHub Actions Inactive

MYPY TEMP DISABLED

12a331f

norlandrhagen temporarily deployed to test-release October 8, 2024 18:00 — with GitHub Actions Inactive

added section to usage docs + updated releases.rst

6a445aa

norlandrhagen temporarily deployed to test-release October 8, 2024 22:48 — with GitHub Actions Inactive

test

a5a8f80

norlandrhagen temporarily deployed to test-release October 8, 2024 22:52 — with GitHub Actions Inactive

remove test deps from doc.yaml build

a5dcef0

norlandrhagen temporarily deployed to test-release October 8, 2024 22:59 — with GitHub Actions Inactive

tests passing for reading parquet references to virtual dataset, refa…

ba7daca

…ctored _fsspec_open... to class

norlandrhagen temporarily deployed to test-release October 9, 2024 00:56 — with GitHub Actions Inactive

norlandrhagen changed the title ~~[DRAFT] Allow open_virual_dataset to read existing Kerchunk references~~ Allow open_virual_dataset to read existing Kerchunk references Oct 9, 2024

norlandrhagen changed the title ~~Allow open_virual_dataset to read existing Kerchunk references~~ Allow open_virtual_dataset to read existing Kerchunk references Oct 9, 2024

norlandrhagen requested a review from TomNicholas October 9, 2024 17:34

This was referenced Oct 9, 2024

Team Planning - Wednesday, October 9th leap-stc/data-and-compute-team#30

Closed

Open Kerchunk refs as Virtual Dataset #119

Closed

keewis reviewed Oct 10, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Update pyproject.toml

77e80ce

Co-authored-by: Justus Magin <[email protected]>

norlandrhagen temporarily deployed to test-release October 10, 2024 15:27 — with GitHub Actions Inactive

norlandrhagen temporarily deployed to test-release October 10, 2024 20:05 — with GitHub Actions Inactive

TomNicholas reviewed Oct 11, 2024

View reviewed changes

Update .github/workflows/main.yml

be709df

Co-authored-by: Tom Nicholas <[email protected]>

norlandrhagen temporarily deployed to test-release October 11, 2024 15:47 — with GitHub Actions Inactive

Merge branch 'main' into read_existing_refs

8d53227

norlandrhagen temporarily deployed to test-release October 11, 2024 22:19 — with GitHub Actions Inactive

norlandrhagen mentioned this pull request Oct 15, 2024

Ensure every reader uses dataset_from_kerchunk_refs before returning a virtual dataset #257

Closed

Dict -> dict, -> engine option. very flaky autodetection

15c5471

norlandrhagen temporarily deployed to test-release October 15, 2024 19:14 — with GitHub Actions Inactive

removed version from parquet refs

e3e086c

norlandrhagen temporarily deployed to test-release October 15, 2024 19:37 — with GitHub Actions Inactive

keewis reviewed Oct 16, 2024

View reviewed changes

virtualizarr/backend.py Outdated Show resolved Hide resolved

keewis reviewed Oct 16, 2024

View reviewed changes

virtualizarr/backend.py Outdated Show resolved Hide resolved

TomNicholas approved these changes Oct 16, 2024

View reviewed changes

docs/usage.md Outdated Show resolved Hide resolved

docs/usage.md Show resolved Hide resolved

keewis reviewed Oct 16, 2024

View reviewed changes

virtualizarr/backend.py Show resolved Hide resolved

norlandrhagen added 2 commits October 16, 2024 13:53

adds path for invalid kerchunk format + test

56c4053

updates existing references docs

f6fd5aa

norlandrhagen temporarily deployed to test-release October 16, 2024 19:57 — with GitHub Actions Inactive

TomNicholas added the references generation Reading byte ranges from archival files label Oct 16, 2024

norlandrhagen merged commit ec8e465 into main Oct 16, 2024
9 checks passed

TomNicholas mentioned this pull request Oct 17, 2024

Open kerchunk ref as virtual dataset, only json (from PR 119) #186

Closed

6 tasks

TomNicholas deleted the read_existing_refs branch October 17, 2024 13:38

This was referenced Oct 17, 2024

Make kerchunk dependency entirely optional #258

Closed

Store test datasets in repo #235

Open

Add Zarr v3 dependency #182

Draft

keewis mentioned this pull request Oct 21, 2024

file type discovery for the parquet format fsspec/kerchunk#519

Open

TomNicholas mentioned this pull request Oct 23, 2024

Appending to references on disk #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `open_virtual_dataset` to read existing Kerchunk references #251

Allow `open_virtual_dataset` to read existing Kerchunk references #251

norlandrhagen commented Oct 8, 2024 •

edited

Loading

keewis commented Oct 8, 2024

norlandrhagen commented Oct 8, 2024

TomNicholas left a comment

codecov bot commented Oct 11, 2024 •

edited

Loading

norlandrhagen commented Oct 16, 2024

TomNicholas commented Oct 16, 2024

Allow open_virtual_dataset to read existing Kerchunk references #251

Allow open_virtual_dataset to read existing Kerchunk references #251

Conversation

norlandrhagen commented Oct 8, 2024 • edited Loading

Allow open_virtual_dataset to read existing Kerchunk references as virtual datasets

keewis commented Oct 8, 2024

norlandrhagen commented Oct 8, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 11, 2024 • edited Loading

Codecov Report

norlandrhagen commented Oct 16, 2024

TomNicholas commented Oct 16, 2024

Allow `open_virtual_dataset` to read existing Kerchunk references #251

Allow `open_virtual_dataset` to read existing Kerchunk references #251

norlandrhagen commented Oct 8, 2024 •

edited

Loading

Allow `open_virtual_dataset` to read existing Kerchunk references as virtual datasets

codecov bot commented Oct 11, 2024 •

edited

Loading