Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V3] v2 -> v3 data migration #1798

Open
Tracked by #2412
d-v-b opened this issue Apr 17, 2024 · 8 comments
Open
Tracked by #2412

[V3] v2 -> v3 data migration #1798

d-v-b opened this issue Apr 17, 2024 · 8 comments
Milestone

Comments

@d-v-b
Copy link
Contributor

d-v-b commented Apr 17, 2024

We should invest in tools to make the v2 -> v3 conversion simple for people who are motivated to convert their data. A few high-level ideas:

  • A simple CLI that converts an array, or a group (recursive or not) from v2 to v3, in a new location.
    • Someone should investigate how complicated in-place conversions would be. On a local filesystem where mv is cheap, this could be attractive.. V3 is designed to make array conversions easy, requiring only the creation of new metadata.
    • The CLI should use functions that are accessible from scripts that don't use the CLI. We can look at work @normanrz did in Zarrita.
  • Documentation of the key differences between zarr-python v2 and v3, and a migration guide. This should have its own page in the docs.
    • We should consider options for people who don't want to re-save their data. I'm not presently a kerchunk user, but I presume that kerchunk could map v2 to v3, for people who don't want to convert their data? cc @martindurant.
@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 17, 2024

Big +1 on this. I'm working on a conversion tool for large-scale genomics data (100s TB scale) which is usually held in file systems (for the moment, it will probably migrate to object stores later on). A CLI tool that does an in-place migration from v2 to v3 would be a big help. I'm hoping to move to v3 early on, before too many datasets are converted into v2 format and so most users won't ever know about v2.

My assumptions was that the migration was largely a case of writing a new JSON metadata file per-array, and should be possible to do both cheaply and safely?

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 17, 2024

My assumptions was that the migration was largely a case of writing a new JSON metadata file per-array, and should be possible to do both cheaply and safely?

Yes, I think this is right. Besides the metadata, which will live in a completely new JSON document (zarr.json), V3 supports a backwards-compatible layout for the chunks

@jeromekelleher
Copy link
Member

Thanks yes, I've been aiming for v3 forwards compatibility by using "/" as the default dimension separator. Then, iterating over the chunks in the first dimension and renaming to have a "c" prefix should be relatively cheap (I forgot about this difference).

Is there some developer documentation with recommendations for forwards/backwards compatibility?

@normanrz
Copy link
Member

  • Someone should investigate how complicated in-place conversions would be. On a local filesystem where mv is cheap, this could be attractive.

For most cases, the migration only requires adding zarr.json files throughout the hierarchy. There should be no need to even touch the chunk files. zarr.json and .zarray files can also live side-by-side. So, why would a mv be needed?
Only, when using a non-supported codec or filter, chunks need to be rewritten.

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 17, 2024

So, why would a mv be needed?
Only, when using a non-supported codec or filter, chunks need to be rewritten.

This is correct. When I wrote up this issue, I forgot about the v2 chunk key encoding supported by v3 🤦

@d-v-b
Copy link
Contributor Author

d-v-b commented Apr 17, 2024

i updated the issue to be more accurate :)

@normanrz
Copy link
Member

I agree that a CLI tool that can convert an entire hierarchy would be great!

@jhamman jhamman added the V3 label Apr 19, 2024
@jhamman jhamman added this to the After 3.0.0 milestone Apr 19, 2024
@jhamman
Copy link
Member

jhamman commented Apr 19, 2024

Today I learned that there is a v1 to v2 migrator in the zarr-python codebase:

zarr-python/zarr/storage.py

Lines 1941 to 1956 in 6105ef2

def migrate_1to2(store):
"""Migrate array metadata in `store` from Zarr format version 1 to
version 2.
Parameters
----------
store : Store
Store to be migrated.
Notes
-----
Version 1 did not support hierarchies, so this migration function will
look for a single array in `store` and migrate the array metadata to
version 2.
"""

@jhamman jhamman modified the milestones: After 3.0.0, 3.0.0 Apr 22, 2024
@jhamman jhamman moved this to Todo in Zarr-Python - 3.0 Apr 22, 2024
@dstansby dstansby removed the V3 label Dec 12, 2024
@dstansby dstansby changed the title [V3] v2 -> v3 migration [V3] v2 -> v3 datamigration Dec 16, 2024
@dstansby dstansby changed the title [V3] v2 -> v3 datamigration [V3] v2 -> v3 data migration Dec 16, 2024
@jhamman jhamman marked this as a duplicate of #2564 Dec 18, 2024
@dstansby dstansby marked this as not a duplicate of #2564 Dec 18, 2024
@jhamman jhamman marked this as not a duplicate of #2564 Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

5 participants