Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow open_virtual_dataset to read existing Kerchunk references #251

Merged
merged 17 commits into from
Oct 16, 2024

Conversation

norlandrhagen
Copy link
Collaborator

@norlandrhagen norlandrhagen commented Oct 8, 2024

Allow open_virtual_dataset to read existing Kerchunk references as virtual datasets

To Do:

  • [JSON] trailing \\ in variable names. Fixed. json was adding trailing slashes on ref write.
  • [JSON] coordinates not recognized. Fixed. This was related to json write.
  • [JSON] coordinates set as data variables. Fixed. This was related to json write.
  • [PARQUET] extract references from 1. LazyReferenceMapper or 2. Reconstruct reference from ✨Zarrquet ✨ format.
  • [FUTURE PR] Convert inlined vars into numpy arrays. Note: After talking with @sharkinsspatial, it seems like the HDF reader he is working on won't inline at all. Should we: 1. Raise an error if any inline data exists or 2. Add logic to convert any inlined bytes to numpy arrays to maintain more compatibility with other Kerchunk readers?

Notes:

  • Mypy checks in CI are currently disabled! [Tracking] Scheduled CI is failing on main #249
  • The utility function _fsspec_openfile_from_filepath forced the fsspec filesystem to open a filepath (the fault of past me). I replaced it with a simple class that has a .open_file method that can be used as needed. The Kerchunk LazyReferenceMapper required a fsspec filesystem.

@norlandrhagen norlandrhagen added Kerchunk Relating to the kerchunk library / specification itself references formats Storing byte range info on disk labels Oct 8, 2024
@keewis
Copy link
Contributor

keewis commented Oct 8, 2024

did you see #186? That was a second attempt after #119 to implement this, and contains some documentation that you might be able to reuse?

@norlandrhagen
Copy link
Collaborator Author

@keewis thanks for the rec. Having some docs already is nice.

@norlandrhagen norlandrhagen changed the title [DRAFT] Allow open_virual_dataset to read existing Kerchunk references Allow open_virual_dataset to read existing Kerchunk references Oct 9, 2024
@norlandrhagen norlandrhagen changed the title Allow open_virual_dataset to read existing Kerchunk references Allow open_virtual_dataset to read existing Kerchunk references Oct 9, 2024
Co-authored-by: Justus Magin <[email protected]>
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @norlandrhagen !

Convert inlined vars into numpy arrays. Note: After talking with @sharkinsspatial, it seems like the HDF reader he is working on won't inline at all. Should we:

  1. Raise an error if any inline data exists or
  2. Add logic to convert any inlined bytes to numpy arrays to maintain more compatibility with other Kerchunk readers?

I feel like this mixes a few issues together. We do need to be able to read back inlined kerchunk references (though it's fine to add that feature in a follow-up PR), partly because even with Sean's PR we're still going to want to use the other kerchunk readers sometimes. I think it makes sense for @sharkinsspatial 's HDF reader not to create inlined refs, so long as we have another way to create inlined refs (i.e. using the normal xarray backend for that filetype).

Mypy checks in CI are currently disabled!

Should be fixed by #252

The utility function _fsspec_openfile_from_filepath forced the fsspec filesystem to open a filepath

Not totally sure I understand this but abstracting away fsspec details sounds good.

Copy link

codecov bot commented Oct 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.37%. Comparing base (53a609f) to head (8d53227).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #251      +/-   ##
==========================================
+ Coverage   91.20%   91.37%   +0.17%     
==========================================
  Files          32       32              
  Lines        2057     2098      +41     
==========================================
+ Hits         1876     1917      +41     
  Misses        181      181              
Flag Coverage Δ
unittests 91.37% <100.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@norlandrhagen
Copy link
Collaborator Author

Thanks for the feedback @keewis and @TomNicholas. I think everything is addressed. The auto-detection is pretty rough, but hopefully if you're using this, you know the difference between parquet and json 🤷

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label Oct 16, 2024
@TomNicholas
Copy link
Member

Thanks @norlandrhagen - happy to merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kerchunk Relating to the kerchunk library / specification itself references formats Storing byte range info on disk references generation Reading byte ranges from archival files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants