ZEP9 (phase 1): add clarifications for extension naming #330

joshmoore · 2025-02-14T14:21:49Z

This PR clarifies the extension mechanism concept in the v3 specification. Comments on any changes which will break existing implementations are STRONGLY encouraged. Please see zarr-developers/zeps#65 for background material.

TODOs:

clarify the file numbering (currently 3.0.rst)
move definitions to the appropriate location (core, subtype page or ext

Post-merge:

remove preview label from zarr-extensions
return to zarr-specs v3.1 follow-ons #329

rabernat · 2025-02-16T02:00:01Z

@joshmoore - really glad you got this started! 🙌

My feedback is that the PR is hard to review. It touches 15 files, including a ton of minor, unrelated formatting changes to the core spec document.

If we want folks to engage and give meaningful feedback, we need to make it easier to review. I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes.

Remaining text blocks are likely to be re-used under the more general "Extension points" section. see: zarr-developers#312

joshmoore · 2025-02-16T11:39:29Z

@rabernat
really glad you got this started! 🙌

👍

It touches 15 files

You're right. I've extracted out #331.

including a ton of minor, unrelated formatting changes to the core spec document.

I disagree that they are unrelated. Take a look. The sections I've modified were basically already un-parseable. Since I was adding sections, the outline was getting more convoluted.

I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes.

👍 Give it a look and let me know what you think.

docs/v3/data-types.rst

docs/v3/core/v3.0.rst

jbms · 2025-02-17T05:07:54Z

Thanks for all of your work on this!

My current understanding of the practical effect of proposal is as follows:

-raw names will be granted fairly easily, e.g. zstd, bfloat16, and others I've proposed would be assigned to me, the ones that zarr-python has started using (string, bytes, vlen-utf8, etc.) would be assigned to someone from zarr-python. URL names will be used only for really experimental stuff, all commonly-used extensions will have raw names since they will be minimal effort. Therefore, the verbosity of the URLs is not really a problem in practice.

the ZEP process, or really any mandatory review process at all, will not be used for proposing extensions that fit into any of the existing extension points, only for entirely new extension points. At most someone might ask around for comments informally before adopting something.

The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process.

normanrz · 2025-02-18T12:07:32Z

The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process.

I share your concerns to some degree. I think we can adapt the governance structure for extensions in the future, if we think that a more thorough review process would be necessary. We are thinking of forming a zarr specs team that could take on that responsibility.

No changes to content were made!

docs/v3/core/index.rst

d-v-b · 2025-02-20T09:41:33Z

thanks so much for working on this josh! I have a few high-level comments:

URIs as names still feels unmotivated.

How would we explain to someone developing a new data type why they would need to use a URI for the name? I don't think I could give a good justification for this decision right now, and that's a problem.
By typical naming standards, URIs are not good names, as good names tend to be short and expressive, and URIs are not. So there must be a good reason to use a verbose, structured string like a URI. What problem does a URI based name solve?

Are we using URIs to ensure that data type / codec names are scoped? In that case, why is this better than a simple scoping scheme, like <github handle>/<name> (given that all zarr development is on github)?
Are we using URIs to ensure that extensions carry around a reference to their documentation? In that case, why not use a dedicated field for this, e.g. {"name": my_codec, "docs": "https://my-codec.com/docs", "config": {...}}
Are we using URIs because we want unique identifiers? Putting aside that URIs are not guaranteed to be unique, why is a URI better than separating the human-readable name of the entity from its identifier, e.g. {"name": "foo", "id": <long UUID>, "config": {...}}? Although I think <github handle>/<name> would also be unique enough for our purposes.

On a practical level, I think it would be good to have a guide for people who want to make their own codecs / datatypes / chunk grids. Should they use a "raw name" or a URI name? That isn't clear in the text right now.

normanrz · 2025-02-20T10:50:45Z

It feels like we have given an explanation for the reasons of the URI names a few times now. Let me reiterate one more time:

Since we only want folks to use URIs they own, URIs can be used without any coordination and provide protection against naming conflicts
URIs can link to specification documents. We would like that but are not requiring that
URIs would tie in nicely with json-ld, which we are interested in exploring in the future
If folks want a short and expressive name, they should register a "raw name". It is very easy

We know that there might be other options here, but that is the design we landed on.

d-v-b · 2025-02-20T11:11:41Z

It feels like we have given an explanation for the reasons of the URI names a few times now.

In this comment I asked several design questions which are all posed in the following form: "if we want <feature>, why is a URI-based name better than <alternative>?" I'm sorry if it feels repetitive but if I could get answers to those questions it would really help get me on board.

d-v-b · 2025-02-20T11:48:58Z

docs/v3/stores.rst

+   Stores are *not* extension points since they define the mechanism
+   for loading metadata documents such that extensions can be loaded.


If "extensibility" is defined as a property of a field in an a metadata document, then we don't need this note, because stores are not defined in metadata.

d-v-b · 2025-02-20T12:46:17Z

another very basic question about, e.g., a new data type that uses a URI for its name. That new data type should have a standard handle (like "complex256") that can be used as an identifier in most programming languages, which rules out a URI. Should the URI be defined such that the final path component of the URI sans "."-delimited suffixes is the handle for the data type? e.g. "http://foo.com/complex256.schema.json" would be a URI for a data type called "complex256"?

Without a constraint like this, it's not clear that an extension has a human-and-software-friendly name, but I think this is an important feature.

joshmoore · 2025-02-20T14:39:34Z

Thanks for the various suggestions, @d-v-b. I've pushed a commit for the comments that I've resolved to this point.

Based on the discussion above, I went ahead and restricted this PR as phase 1 of ZEP9 to just discuss URLs and depending on that discussion phase 3 can address the issue of URIs if at all.

That leaves as the other major next steps:

rename the subdocuments and split out the chunk documents
decide on the terminology question -- ZEP9 (phase 1): add clarifications for extension naming #330 (review) etc.
update the versioning/stability section (or defer that discussion) --ZEP9 (phase 1): add clarifications for extension naming #330 (review)

d-v-b · 2025-02-20T14:48:03Z

Based on the discussion above, I went ahead and restricted this PR as phase 1 of ZEP9 to just discuss URLs and depending on that discussion phase 3 can address the issue of URIs if at all.

I think my concerns in the discussion above apply equally to URLs and URIs alike.

d-v-b · 2025-02-20T16:50:39Z

How should implementations interpret the requirement that extension names be either a string registered on a zarr extensions github repo, or a URL?

Suppose a user is working on a new dtype. It's unpublished code on a single computer; it has no spec, and no URL. Should zarr-python query github if they try to register their new dtype for testing purposes? Should zarr-python require that they provide a dummy URL? Neither of these options seems good.

I think it's really important we support this use case, because solo tinkering and experimentation is where many new dtypes / codecs / etc come from. I think this argues against stating that extension identifiers MUST be registered on github or a URL. More broadly, I don't think the spec should make any MUST statements about things that cannot be locally evaluated at runtime, which excludes any dynamic online registry lookups.

Of course it's vital that extensions are discoverable, documented properly, maintained, etc. But IMO the rules for this process should be defined outside the core spec. Otherwise we will make normative statements that are very hard for implementations to work with.

normanrz · 2025-02-20T19:59:50Z

I don't see that as a practical issue. The spec defines spec-compliant metadata and behavior with the intention of organizing interoperability.
I think implementations should allow room for non-compliance, in particular when tinkering and developing. But, it is up to the implementations to decide how and when to enforce spec-compliance.
If people don't share their data and don't expect interoperability with other implementations, they can do whatever they want with their data. As soon as they want to share their data under the Zarr brand, the data needs to be spec-compliant.

Here is what I think zarr-python might do:

Should it check the registry when opening an array? That would be overkill. It could have some well-known extensions built-in, though, that are validated at build time.
Should it ship with non-compliant extensions? No
Should it try to prevent to load plugins (via entry points or similar) that implement non-compliant extensions? Probably not. It could check for naming conflicts with built-in extensions and already-loaded plugins, though.

joshmoore mentioned this pull request Feb 14, 2025

Add ZEP 9 (extension naming) draft zarr-developers/zeps#65

Merged

5 tasks

joshmoore added 12 commits February 16, 2025 12:34

Merge Davis proposal with ZEP0009

65bc69f

Remaining text blocks are likely to be re-used under the more general "Extension points" section. see: zarr-developers#312

Start changelog

b109fb7

Add definitions

4c0e494

Fix definitions

05b4fa4

slightly longer change log

f9508d4

New extensions section

34ac282

Update array metadata section

16e34ca

Update group metadata section

d2f6f9d

Clean the extension listing pages

1d85e70

Also list no datatypes as defined

43e3862

Link more terms to extensions

c1accfe

More crosslinks and identifier clarifications

454faaf

joshmoore force-pushed the zep9-ext-naming branch from 549cc16 to 454faaf Compare February 16, 2025 11:36

joshmoore mentioned this pull request Feb 16, 2025

Cleanup the spec docs & builds for 3.1 #331

Merged

jbms reviewed Feb 17, 2025

View reviewed changes

docs/v3/data-types.rst Outdated Show resolved Hide resolved

jbms reviewed Feb 17, 2025

View reviewed changes

docs/v3/core/v3.0.rst Outdated Show resolved Hide resolved

normanrz added 2 commits February 17, 2025 15:44

add zarr-extensions repo

ef69ff1

Merge remote-tracking branch 'origin/main' into zep9-ext-naming

db7db15

joshmoore mentioned this pull request Feb 18, 2025

Add zstd codec #256

Open

joshmoore mentioned this pull request Feb 18, 2025

Define the list of codecs in the v3 spec #312

Closed

joshmoore added 5 commits February 18, 2025 13:40

Remove TODOs with PR and repo link

3d448c1

Move 'core data types' to a subpage

e6200c8

Clarify concept of 'core'

0e0a03b

Unify listing of all extensions on subpages

1600ee9

No changes to content were made!

Rename core/v3.0 to core/index

429988a

d-v-b reviewed Feb 19, 2025

View reviewed changes