Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP9 (phase 1): add clarifications for extension naming #330

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

joshmoore
Copy link
Member

@joshmoore joshmoore commented Feb 14, 2025

This PR clarifies the extension mechanism concept in the v3 specification. Comments on any changes which will break existing implementations are STRONGLY encouraged. Please see zarr-developers/zeps#65 for background material.

TODOs:

  • clarify the file numbering (currently 3.0.rst)
  • move definitions to the appropriate location (core, subtype page or ext

Post-merge:

@rabernat
Copy link
Contributor

@joshmoore - really glad you got this started! 🙌

My feedback is that the PR is hard to review. It touches 15 files, including a ton of minor, unrelated formatting changes to the core spec document.

If we want folks to engage and give meaningful feedback, we need to make it easier to review. I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes.

@joshmoore
Copy link
Member Author

@rabernat
really glad you got this started! 🙌

👍

It touches 15 files

You're right. I've extracted out #331.

including a ton of minor, unrelated formatting changes to the core spec document.

I disagree that they are unrelated. Take a look. The sections I've modified were basically already un-parseable. Since I was adding sections, the outline was getting more convoluted.

I'd recommend starting fresh with a minimal PR in which the diffs are reflective exclusively of the actual proposed changes.

👍 Give it a look and let me know what you think.

@jbms
Copy link
Contributor

jbms commented Feb 17, 2025

Thanks for all of your work on this!

My current understanding of the practical effect of proposal is as follows:

-raw names will be granted fairly easily, e.g. zstd, bfloat16, and others I've proposed would be assigned to me, the ones that zarr-python has started using (string, bytes, vlen-utf8, etc.) would be assigned to someone from zarr-python. URL names will be used only for really experimental stuff, all commonly-used extensions will have raw names since they will be minimal effort. Therefore, the verbosity of the URLs is not really a problem in practice.

  • the ZEP process, or really any mandatory review process at all, will not be used for proposing extensions that fit into any of the existing extension points, only for entirely new extension points. At most someone might ask around for comments informally before adopting something.

The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process.

@joshmoore joshmoore mentioned this pull request Feb 18, 2025
@normanrz
Copy link
Member

The lack of basically any review worries me a bit. But ultimately I'm in favor of this proposal because I think it reflects the reality that the ZEP process isn't working for the existing extension points, and it would be better to just rely on a less formal process.

I share your concerns to some degree. I think we can adapt the governance structure for extensions in the future, if we think that a more thorough review process would be necessary. We are thinking of forming a zarr specs team that could take on that responsibility.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 20, 2025

thanks so much for working on this josh! I have a few high-level comments:

URIs as names still feels unmotivated.

How would we explain to someone developing a new data type why they would need to use a URI for the name? I don't think I could give a good justification for this decision right now, and that's a problem.
By typical naming standards, URIs are not good names, as good names tend to be short and expressive, and URIs are not. So there must be a good reason to use a verbose, structured string like a URI. What problem does a URI based name solve?

  • Are we using URIs to ensure that data type / codec names are scoped? In that case, why is this better than a simple scoping scheme, like <github handle>/<name> (given that all zarr development is on github)?
  • Are we using URIs to ensure that extensions carry around a reference to their documentation? In that case, why not use a dedicated field for this, e.g. {"name": my_codec, "docs": "https://my-codec.com/docs", "config": {...}}
  • Are we using URIs because we want unique identifiers? Putting aside that URIs are not guaranteed to be unique, why is a URI better than separating the human-readable name of the entity from its identifier, e.g. {"name": "foo", "id": <long UUID>, "config": {...}}? Although I think <github handle>/<name> would also be unique enough for our purposes.

On a practical level, I think it would be good to have a guide for people who want to make their own codecs / datatypes / chunk grids. Should they use a "raw name" or a URI name? That isn't clear in the text right now.

@normanrz
Copy link
Member

normanrz commented Feb 20, 2025

It feels like we have given an explanation for the reasons of the URI names a few times now. Let me reiterate one more time:

  • Since we only want folks to use URIs they own, URIs can be used without any coordination and provide protection against naming conflicts
  • URIs can link to specification documents. We would like that but are not requiring that
  • URIs would tie in nicely with json-ld, which we are interested in exploring in the future
  • If folks want a short and expressive name, they should register a "raw name". It is very easy

We know that there might be other options here, but that is the design we landed on.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 20, 2025

It feels like we have given an explanation for the reasons of the URI names a few times now.

In this comment I asked several design questions which are all posed in the following form: "if we want <feature>, why is a URI-based name better than <alternative>?" I'm sorry if it feels repetitive but if I could get answers to those questions it would really help get me on board.

Comment on lines +19 to +20
Stores are *not* extension points since they define the mechanism
for loading metadata documents such that extensions can be loaded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "extensibility" is defined as a property of a field in an a metadata document, then we don't need this note, because stores are not defined in metadata.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 20, 2025

another very basic question about, e.g., a new data type that uses a URI for its name. That new data type should have a standard handle (like "complex256") that can be used as an identifier in most programming languages, which rules out a URI. Should the URI be defined such that the final path component of the URI sans "."-delimited suffixes is the handle for the data type? e.g. "http://foo.com/complex256.schema.json" would be a URI for a data type called "complex256"?

Without a constraint like this, it's not clear that an extension has a human-and-software-friendly name, but I think this is an important feature.

@joshmoore
Copy link
Member Author

joshmoore commented Feb 20, 2025

Thanks for the various suggestions, @d-v-b. I've pushed a commit for the comments that I've resolved to this point.

Based on the discussion above, I went ahead and restricted this PR as phase 1 of ZEP9 to just discuss URLs and depending on that discussion phase 3 can address the issue of URIs if at all.

That leaves as the other major next steps:

@d-v-b
Copy link
Contributor

d-v-b commented Feb 20, 2025

Based on the discussion above, I went ahead and restricted this PR as phase 1 of ZEP9 to just discuss URLs and depending on that discussion phase 3 can address the issue of URIs if at all.

I think my concerns in the discussion above apply equally to URLs and URIs alike.

@d-v-b
Copy link
Contributor

d-v-b commented Feb 20, 2025

How should implementations interpret the requirement that extension names be either a string registered on a zarr extensions github repo, or a URL?

Suppose a user is working on a new dtype. It's unpublished code on a single computer; it has no spec, and no URL. Should zarr-python query github if they try to register their new dtype for testing purposes? Should zarr-python require that they provide a dummy URL? Neither of these options seems good.

I think it's really important we support this use case, because solo tinkering and experimentation is where many new dtypes / codecs / etc come from. I think this argues against stating that extension identifiers MUST be registered on github or a URL. More broadly, I don't think the spec should make any MUST statements about things that cannot be locally evaluated at runtime, which excludes any dynamic online registry lookups.

Of course it's vital that extensions are discoverable, documented properly, maintained, etc. But IMO the rules for this process should be defined outside the core spec. Otherwise we will make normative statements that are very hard for implementations to work with.

@normanrz
Copy link
Member

I don't see that as a practical issue. The spec defines spec-compliant metadata and behavior with the intention of organizing interoperability.
I think implementations should allow room for non-compliance, in particular when tinkering and developing. But, it is up to the implementations to decide how and when to enforce spec-compliance.
If people don't share their data and don't expect interoperability with other implementations, they can do whatever they want with their data. As soon as they want to share their data under the Zarr brand, the data needs to be spec-compliant.

Here is what I think zarr-python might do:

  • Should it check the registry when opening an array? That would be overkill. It could have some well-known extensions built-in, though, that are validated at build time.
  • Should it ship with non-compliant extensions? No
  • Should it try to prevent to load plugins (via entry points or similar) that implement non-compliant extensions? Probably not. It could check for naming conflicts with built-in extensions and already-loaded plugins, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants