-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are regexes good enough to validate literals? #407
Comments
Whether regexes are enough depends on the type language: what is the set of possible types that can appear in function signatures? If the type language includes numeric types, finite enumerations, and strings, then regexes are enough. If the set of possible types includes nested lists, for example, then regexes aren't enough to validate arguments. |
I think regex However, I do have two other related thoughts on this:
|
Reflexes are not sufficient for validity of all data types. Example: valid
locale identifiers.
Well-formed locale IDs can be verified by a (horrendous) regex, but not
valid
…On Wed, Jul 5, 2023, 04:52 Eemeli Aro ***@***.***> wrote:
I think regex pattern values are sufficient, given that they complement
the explicit values list. As for their inlining vs. referentiality, I
think I'd need to see what e.g. the JS Intl set of formatters would look
like as a registry in order to really say.
However, I do have two other related thoughts on this:
1. Right now, the pattern attribute of the <input> and <match>
elements is an NMTOKEN, while <option> uses IDREF. Presumably they
should all be IDREF values?
2. We should ensure that there's a way to refer to an external source
for match values. Specifically for plurals, it should be possible to refer
to the CLDR supplemental/plurals.xml
<https://github.com/unicode-org/cldr/blob/main/common/supplemental/plurals.xml>
for locale-specific values.
—
Reply to this email directly, view it on GitHub
<#407 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMAGTRRGQHSII7DMBYTXOVIQ3ANCNFSM6AAAAAAZ4VJZOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
We already use `pattern` to describe a sequence of text and placeholders in the message body. Let's not overload it with another, validation-related meaning. I had picked `pattern` in `registry.dtd` because of "regex patterns", but I think we can do better than that. This PR also changes some `NMTOKEN` attributes to `IDREF`, as noticed by @eemeli in unicode-org#407 (comment). This PR doesn't address the main question of unicode-org#407, however. It's only concerned with renaming `pattern` to prevent confusion.
We already use `pattern` to describe a sequence of text and placeholders in the message body. Let's not overload it with another, validation-related meaning. I had picked `pattern` in `registry.dtd` because of "regex patterns", but I think we can do better than that. This PR also changes some `NMTOKEN` attributes to `IDREF`, as noticed by @eemeli in unicode-org#407 (comment). This PR doesn't address the main question of unicode-org#407, however. It's only concerned with renaming `pattern` to prevent confusion.
We already use `pattern` to describe a sequence of text and placeholders in the message body. Let's not overload it with another, validation-related meaning. I had picked `pattern` in `registry.dtd` because of "regex patterns", but I think we can do better than that. This PR also changes some `NMTOKEN` attributes to `IDREF`, as noticed by @eemeli in #407 (comment). This PR doesn't address the main question of #407, however. It's only concerned with renaming `pattern` to prevent confusion.
The specific question for LDML45 release is: do we need to define something other than regexes in order to deliver the default registry? If so, what? |
I think it is ok to just have regexes, if we document that the regex match
is necessary for well-formedness, but *not sufficient*. And definitely not
for validity.
…On Thu, Jan 11, 2024, 08:59 Addison Phillips ***@***.***> wrote:
The specific question for LDML45 release is: do we need to define
something other than regexes in order to deliver the default registry? If
so, what?
—
Reply to this email directly, view it on GitHub
<#407 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMAVZBJ2TYI6BWDDLRDYOAK5VAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBXGU4DANRUGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
The unresolved tension here is that MF2.0 turned out to be untyped at the specification level. Implementations in strongly typed languages will want to make use of typing. Weakly or untyped implementations might want to use serializations common to their runtime for objects such as dates, numbers, etc. which are idiosyncratic and don't match the regex we provide in the default registry. A regex can really only describe a string serialization. What we do not want is for, say, |
I completely agree. The regex can only constrain values *if and when*
datetypes are serialized.
In addition, an implementation of MF2.0 must be allowed to convert the
string format of a message into an equivalent internal structure that
replaces values by native datatypes.
…On Thu, Jan 11, 2024 at 5:08 PM Addison Phillips ***@***.***> wrote:
The unresolved tension here is that MF2.0 turned out to be untyped at the
specification level. Implementations in strongly typed languages will want
to make use of typing. Weakly or untyped implementations might want to use
serializations common to their runtime for objects such as dates, numbers,
etc. which are idiosyncratic and don't match the regex we provide in the
default registry. A regex can really only describe a string serialization.
What we do not want is for, say, :datetime to require all date and time
values to be provided as some flavor of ISO8601/SEDATE/etc. (see this
draft <https://w3c.github.io/timezone> for details). In (let's say) Java,
we want to accept Date, Temporal, Calendar , (and maybe long) for this
function.
—
Reply to this email directly, view it on GitHub
<#407 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMBYM5MODXA62WQMYDDYOCEIDAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGIZTEMZTG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
At least in my mind the regexp are intended to validate literal value that are already part of the placeholder, not input arguments. |
I think this is being addressed in the default registry work. A regex is insufficient to describe the implementation defined input types. So we use text for that. What a regex is sufficient for is defining the literal values that can be used in lieu of the input type. Functions are not required to accept a literal value for an operand or option value, but if they do, they must define what the literal's format is. Since this has to be a sequence of characters, a regex can do the job. What's more, a regex is extremely useful in ensuring interoperability between platforms in terms of the literal syntax. For example, in LDML45, we accept (among other things) the XMLSchema
If this is what we mean, let's answer the question this issue poses as "yes" and close this issue. Okay @stasm? |
As long as the regex is an 'outer bound'. That is, the function won't
accept any literal that doesn't match the regex, but *doesn't have to*
accept everything that does match.
…On Sun, Feb 18, 2024, 11:22 Addison Phillips ***@***.***> wrote:
I think this is being addressed in the default registry work.
A regex is insufficient to describe the implementation defined input
types. So we use text for that.
What a regex *is* sufficient for is defining the literal values that can
be used in lieu of the input type. Functions are not required to accept a
literal value for an operand or option value, but if they do, they must
define what the literal's format is. Since this has to be a sequence of
characters, a regex can do the job. What's more, a regex is extremely
useful in ensuring interoperability between platforms in terms of the
literal syntax.
For example, in LDML45, we accept (among other things) the XMLSchema date
syntax as a literal for date/time values. This means that the following
expression is valid and has the same interpretation in *every* MF2
implementation, even though the implementation of the :datetime function
is wholly different:
{|2024-02-17| :datetime}
If this is what we mean, let's answer the question this issue poses as
"yes" and close this issue.
Okay @stasm <https://github.com/stasm>?
—
Reply to this email directly, view it on GitHub
<#407 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMGJNDM4WM24H5425RTYUJIHZAVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQZDANRXGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@macchiati noted:
That's right. For example, this is "valid" but probably doesn't work for multiple reasons: |
The problem is that the current wording in registry.md does not at all make
that clear. If anything, the wording (4 sentences with "regex", below) makes it seem that *all* and only the
matches to regex are valid. Until that wording is fixed, I think we should
leave this issue open.
Named <validationRule> elements can optionally define regex validation rules for literals, option values, and variant keys.
...
<validationRule id="anyNumber" regex="-?[0-9]+(\.[0-9]+)"/>
<validationRule id="positiveInteger" regex="[0-9]+"/>
<validationRule id="currencyCode" regex="[A-Z]{3}"/>
…On Sun, Feb 18, 2024 at 11:48 AM Addison Phillips ***@***.***> wrote:
@macchiati <https://github.com/macchiati> noted:
As long as the regex is an 'outer bound'. That is, the function won't
accept any literal that doesn't match the regex, but *doesn't have to*
accept everything that does match.
That's right. For example, this is "valid" but probably doesn't work for
multiple reasons: {|2024-02-35| :date}
—
Reply to this email directly, view it on GitHub
<#407 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMF7CGAHLORP4KOBREDYUJLJ3AVCNFSM6AAAAAAZ4VJZOKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGQZDOMZVGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think you meant something like |
Moved to v46 |
A follow-up to #368.
The current draft of the registry uses named regex patterns to allow defining rules for validating literal arguments and option values.
And then:
The text was updated successfully, but these errors were encountered: