Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are regexes good enough to validate literals? #407

Closed
stasm opened this issue Jul 3, 2023 · 14 comments · Fixed by #815
Closed

Are regexes good enough to validate literals? #407

stasm opened this issue Jul 3, 2023 · 14 comments · Fixed by #815
Labels
LDML46 LDML46 Release (Tech Preview - October 2024) registry Issue pertains to the function registry

Comments

@stasm
Copy link
Collaborator

stasm commented Jul 3, 2023

A follow-up to #368.

The current draft of the registry uses named regex patterns to allow defining rules for validating literal arguments and option values.

<pattern id="positiveInteger" regex="[0-9]+"/>

And then:

<option name="minimumIntegerDigits" pattern="positiveInteger"/>
  • Are regexes enough for this task?
  • Should patterns be defined inline rather than referenced by id?
@catamorphism
Copy link
Collaborator

Whether regexes are enough depends on the type language: what is the set of possible types that can appear in function signatures? If the type language includes numeric types, finite enumerations, and strings, then regexes are enough. If the set of possible types includes nested lists, for example, then regexes aren't enough to validate arguments.

@eemeli
Copy link
Collaborator

eemeli commented Jul 5, 2023

I think regex pattern values are sufficient, given that they complement the explicit values list. As for their inlining vs. referentiality, I think I'd need to see what e.g. the JS Intl set of formatters would look like as a registry in order to really say.

However, I do have two other related thoughts on this:

  1. Right now, the pattern attribute of the <input> and <match> elements is an NMTOKEN, while <option> uses IDREF. Presumably they should all be IDREF values?
  2. We should ensure that there's a way to refer to an external source for match values. Specifically for plurals, it should be possible to refer to the CLDR supplemental/plurals.xml for locale-specific values.

@macchiati
Copy link
Member

macchiati commented Jul 5, 2023 via email

@aphillips aphillips added the registry Issue pertains to the function registry label Jul 5, 2023
stasm added a commit to stasm/message-format-wg that referenced this issue Jul 23, 2023
We already use `pattern` to describe a sequence of text and placeholders in the message body. Let's not overload it with another, validation-related meaning. I had picked `pattern` in `registry.dtd` because of "regex patterns", but I think we can do better than that.

This PR also changes some `NMTOKEN` attributes to `IDREF`, as noticed by @eemeli in unicode-org#407 (comment).

This PR doesn't address the main question of unicode-org#407, however. It's only concerned with renaming `pattern` to prevent confusion.
stasm added a commit to stasm/message-format-wg that referenced this issue Aug 14, 2023
We already use `pattern` to describe a sequence of text and placeholders in the message body. Let's not overload it with another, validation-related meaning. I had picked `pattern` in `registry.dtd` because of "regex patterns", but I think we can do better than that.

This PR also changes some `NMTOKEN` attributes to `IDREF`, as noticed by @eemeli in unicode-org#407 (comment).

This PR doesn't address the main question of unicode-org#407, however. It's only concerned with renaming `pattern` to prevent confusion.
aphillips pushed a commit that referenced this issue Aug 14, 2023
We already use `pattern` to describe a sequence of text and placeholders in the message body. Let's not overload it with another, validation-related meaning. I had picked `pattern` in `registry.dtd` because of "regex patterns", but I think we can do better than that.

This PR also changes some `NMTOKEN` attributes to `IDREF`, as noticed by @eemeli in #407 (comment).

This PR doesn't address the main question of #407, however. It's only concerned with renaming `pattern` to prevent confusion.
@aphillips aphillips added Agenda+ Requested for upcoming teleconference LDML45 LDML45 Release (Tech Preview) labels Jan 11, 2024
@aphillips
Copy link
Member

The specific question for LDML45 release is: do we need to define something other than regexes in order to deliver the default registry? If so, what?

@macchiati
Copy link
Member

macchiati commented Jan 11, 2024 via email

@aphillips
Copy link
Member

The unresolved tension here is that MF2.0 turned out to be untyped at the specification level. Implementations in strongly typed languages will want to make use of typing. Weakly or untyped implementations might want to use serializations common to their runtime for objects such as dates, numbers, etc. which are idiosyncratic and don't match the regex we provide in the default registry. A regex can really only describe a string serialization. What we do not want is for, say, :datetime to require all date and time values to be provided as some flavor of ISO8601/SEDATE/etc. (see this draft for details). In (let's say) Java, we want to accept Date, Temporal, Calendar , (and maybe long) for this function.

@macchiati
Copy link
Member

macchiati commented Jan 15, 2024 via email

@mihnita
Copy link
Collaborator

mihnita commented Jan 29, 2024

At least in my mind the regexp are intended to validate literal value that are already part of the placeholder, not input arguments.
Things like ... {|1234.56| :number}... and ... {|2023-12-30T21:37| :datetime ...} ...

@aphillips aphillips removed the Agenda+ Requested for upcoming teleconference label Feb 15, 2024
@aphillips
Copy link
Member

I think this is being addressed in the default registry work.

A regex is insufficient to describe the implementation defined input types. So we use text for that.

What a regex is sufficient for is defining the literal values that can be used in lieu of the input type. Functions are not required to accept a literal value for an operand or option value, but if they do, they must define what the literal's format is. Since this has to be a sequence of characters, a regex can do the job. What's more, a regex is extremely useful in ensuring interoperability between platforms in terms of the literal syntax.

For example, in LDML45, we accept (among other things) the XMLSchema date syntax as a literal for date/time values. This means that the following expression is valid and has the same interpretation in every MF2 implementation, even though the implementation of the :datetime function is wholly different:

{|2024-02-17| :datetime}

If this is what we mean, let's answer the question this issue poses as "yes" and close this issue.

Okay @stasm?

@macchiati
Copy link
Member

macchiati commented Feb 18, 2024 via email

@aphillips
Copy link
Member

@macchiati noted:

As long as the regex is an 'outer bound'. That is, the function won't
accept any literal that doesn't match the regex, but doesn't have to
accept everything that does match.

That's right. For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}

@macchiati
Copy link
Member

macchiati commented Feb 18, 2024 via email

@gibson042
Copy link
Collaborator

For example, this is "valid" but probably doesn't work for multiple reasons: {|2024-02-35| :date}

I think you meant something like {|2024-02-31| :date}. The former is not syntactically valid, because dayFrag is constrained to 01 through 31.

@aphillips aphillips added LDML46 LDML46 Release (Tech Preview - October 2024) and removed LDML45 LDML45 Release (Tech Preview) labels Apr 13, 2024
@aphillips
Copy link
Member

Moved to v46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDML46 LDML46 Release (Tech Preview - October 2024) registry Issue pertains to the function registry
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants