Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add <when> to help select the right <match> #558

Closed
wants to merge 7 commits into from

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Dec 9, 2023

As noted by @macchiati in #471 (comment):

One note: we might not want to treat :integer as simply an alias for :number. There is one difference in handling:

When you have a match, for translation the when clauses are expanded (or contracted) depending on the locale. There are some locales that have a plural category that is only present when the number is fractional, and in those locales you can limit the expansion by not including that category if the numeric value is known to be integral.

Thinking about this myself, I think we need a mechanism for picking the right set of keys based on multiple option values. Polish is a decent example here:

  • plural selection on integers uses one, few, and many
  • plural selection on all numbers uses one, few, many and other
  • ordinal selection uses only other

The way we have aliases set up, they set option values like maximumFractionDigits: 0 for their parent function, so if we start solving this via aliases, then we're likely to end up in a place where :integer and :number maximumFractionDigits=0 have different behaviour.

So here I propose adding <when> elements to the <matchSignature> as a way of resolving that. Continuing with the above example, they work like this:

<function name="number">
  ...
  <matchSignature>
    ...
    <option name="select" values="plural ordinal" default="plural" />
    ...
    <when option="select" values="ordinal">
      <match locales="en" values="one two few other" validationRule="anyNumber" />
      <match locales="pl" values="other" validationRule="anyNumber" />
    </when>
    <when option="select" values="plural">
      <when option="maximumFractionDigits" values="0">
        <match locales="pl" values="one few many" validationRule="anyNumber" />
      </when>
      <match locales="en" values="one other" validationRule="anyNumber" />
      <match locales="pl" values="one few many other" validationRule="anyNumber" />
    </when>
    <match values="zero one two few many other" validationRule="anyNumber" />
  </matchSignature>
  ...
  <alias name="integer">
    <setOption name="maximumFractionDigits" value="0" />
  </alias>
</function>

The way that should be used is that the option/values combo of each <when> is tested in order to pick a first set of <match> to check against the current locale, and that's repeated until one of the sets provides a match.

So if we start with a selector expression {$x :integer}, its resolved options are:

{ maximumFractionDigits: '0', select: 'plural' }

Let's consider what happens with a few different locales en, pl, and fr (standing in for "any other locale"):

  • First, <when option="select" values="ordinal"> does not match, so its contents are ignored.
  • As <when option="select" values="plural"> does match, let's consider its contents:
    • The inner <when option="maximumFractionDigits" values="0"> also matches:
      • For pl, its set of <match> elements does provide a Lookup match, so we use that and don't consider any later ones.
      • For en and fr, its set of <match> elements does not provide a Lookup match.
    • Continuing for en and fr with the next set <match locales="en" ... />, <match locales="pl" ... />:
      • For en we do find a Lookup match, and use that.
      • For fr, no match at this level.
  • For fr, we ultimately fall back to the last <match> without a locales which implicitly matches.

So we end up with these sets of category keys to use for the selector:

  • en: one other
  • pl: one few many
  • fr: zero one two few many other (due to fallback)

Please note that while the example here does perform selection on :number and that's still being discussed in #471, this approach would be equally valid on a non-alias selector like :plural, for which being able to account for integral input values would be just as useful.

@eemeli eemeli added the registry Issue pertains to the function registry label Dec 9, 2023
@eemeli eemeli requested review from stasm and aphillips December 9, 2023 11:04
@aphillips
Copy link
Member

This mechanism looks decent, but I don't think we should incorporate it.

We do need to point to CLDR data (which this PR doesn't solve). I think that recreating plurals.xml in the registry is wasteful and will lead to a binding between the registry and CLDR releases. There is no reason for us to do that.

The registry defines:

  • functions
  • those function's options
  • those function's option's values

This PR is adding locale filters (or explosion rules, if you prefer) on the option values (but not the rules themselves). The formatter implementation will never read this or check it. This is for tools to use when generating target language variant matrices for translation.

I think we should work on the pointer-to-data mechanism and maybe something like this PR for defining locale option sets using an ancillary format. That would mean either a transform of plurals.xml or using plurals.xml's format for ancillary files to be pointed at by custom functions.

@eemeli
Copy link
Collaborator Author

eemeli commented Dec 9, 2023

We do need to point to CLDR data (which this PR doesn't solve). I think that recreating plurals.xml in the registry is wasteful and will lead to a binding between the registry and CLDR releases. There is no reason for us to do that.

I'd like to push back a bit on that characterization. This is not about adding data to the registry, but about mapping specific sets of MF2 selector options to subsets of that data and providing a form in which that data can be expressed for tools.

To be explicit, I am not proposing to include any plurals.xml data within the default MF2 registry. I am, however, looking for the language of that registry to be able to express selection data in a useful and generic manner.

For example, the data on Polish plurals looks like this:

<pluralRules locales="pl">
  <pluralRule count="one">i = 1 and v = 0 @integer 1</pluralRule>
  <pluralRule count="few">v = 0 and i % 10 = 2..4 and i % 100 != 12..14 @integer 2~4, 22~24, 32~34, 42~44, 52~54, 62, 102, 1002, …</pluralRule>
  <pluralRule count="many">v = 0 and i != 1 and i % 10 = 0..1 or v = 0 and i % 10 = 5..9 or v = 0 and i % 100 = 12..14 @integer 0, 5~19, 100, 1000, 10000, 100000, 1000000, …</pluralRule>
  <pluralRule count="other">   @decimal 0.0~1.5, 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, …</pluralRule>
</pluralRules>

Currently, in order to figure out that in Polish a selector on :integer will never select other, you need to parse the syntax used within the <pluralRule> elements. That's certainly doable (esp. if you rely on the lack of @integer in that rule), but it's sufficiently heavy lifting that I'm not sure that anyone currently does so.

This PR is about providing an MF2-friendly language in which it's possible to express the selection data so that the problems of "figure out the relevant data" and "write a good message validator" can be solved separately, rather than needing everyone who might want to solve the problem doing it for themselves -- or not at all.

For a possible next-step with the proposed <when> and still using CLDR data, I could see an implementation adding some custom function or options to express that a number will only ever be in a range x...y, and separately from that for someone to provide the registry data showing that selection on small numbers will never need the many variant of Romance languages. This PR makes those problems separable, and for the same data to become available for all tools.

This PR is adding locale filters (or explosion rules, if you prefer) on the option values (but not the rules themselves). The formatter implementation will never read this or check it. This is for tools to use when generating target language variant matrices for translation.

Yes, you're right. None of the registry contents has any bearing on a formatter implementation; it's all for tools and validators.

I think we should work on the pointer-to-data mechanism and maybe something like this PR for defining locale option sets using an ancillary format. That would mean either a transform of plurals.xml or using plurals.xml's format for ancillary files to be pointed at by custom functions.

Agreed; that's what we have #538 for. The <matchRef> that I propose there could do some of the data transformation, but it still needs an MF2 "shape" for the data that we're transforming. Thus far we've been working on defining that within the <registry>, which already has the <matchSignature> and <match> elements. If we add the <when>, we could end up with this being synonymous (though more complete) with my earlier example:

<function name="number">
  ...
  <matchSignature>
    ...
    <option name="select" values="plural ordinal" default="plural" />
    ...
    <when option="select" values="ordinal">
      <matchRef href="path/to/ordinals.xml" transform="all-pluralRules.xsl" />
    </when>
    <when option="select" values="plural">
      <when option="maximumFractionDigits" values="0">
        <matchRef href="path/to/plurals.xml" transform="integer-pluralRules.xsl" />
      </when>
      <matchRef href="path/to/plurals.xml" transform="all-pluralRules.xsl" />
    </when>
    <match values="zero one two few many other" validationRule="anyNumber" />
  </matchSignature>
  ...
  <alias name="integer">
    <setOption name="maximumFractionDigits" value="0" />
  </alias>
</function>

Note there how two different data sources and two different transforms are used, and how the fallback is still defined directly. Each <matchRef> would resolve the same way as if multiple <match> statements were used in its place. But we still need a definition like that enabled by <when> connecting maximumFractionDigits=0 with its corresponding v=0 meaning in the plural rule syntax.

@aphillips
Copy link
Member

Currently, in order to figure out that in Polish a selector on :integer will never select other, you need to parse the syntax used within the elements. That's certainly doable (esp. if you rely on the lack of @integer in that rule), but it's sufficiently heavy lifting that I'm not sure that anyone currently does so.

Yes, although tools don't need to know which rule fires for a given value. They just need to know what rules can fire for a given selector. The :integer selector in the pl case won't fire *, but we require * as a fallback. I'm sure, though, that there are cases where a given selector can't fire a given rule for integer or ordinal or whatever, so your example isn't wrong.

What I'm getting at is: you're right that we could parse the data to produce an intermediate data file (and we should probably get CLDR to produce it going forwards). What's more, we need to handle the case of a custom function that doesn't use CLDR data directly. But I still don't think we should put any CLDR data into the registry itself. matchRef looks like a good approach?

@macchiati
Copy link
Member

But I still don't think we should put any CLDR data into the registry itself. matchRef looks like a good approach?

I agree with Addison

@eemeli
Copy link
Collaborator Author

eemeli commented Dec 11, 2023

Okay, let's fix all of it here then. I added <matches> as a grouping element for the <match> elements, with an option to include an href attribute instead; kinda like how in HTML a <script> can either have a body or a src attribute.

If the registry does include a <matches href="..." />, then its URL value should resolve to an XML document with a <matches> element at its root, which replaces the current one for later processing.

After prodding at this for quite a while and starting from the ideas of #538, I came to the realization that the transform of e.g. CLDR plural content to <match> elements can and should be defined within the XML document the href points to, rather than by a reference in the registry file. Trying to include a reference to that transform as another attribute on <matches> makes the processing much more complicated and custom.

So we end up with the <matches href> pointing at files like

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="cldr-plural-matches.xslt"?>
<cldrPluralMatches href="path/to/cldr/common/supplemental/plurals.xml" />

where the cldr-plural-matches.xslt transforms the document and its single node into the required shape.

To prove that this can work, here's an implementation of the transforms required for CLDR ordinals and plurals, including integer handling: https://gist.github.com/eemeli/75a0380e57adb237305ab4c480929a1f

You can test that locally by putting all the .xml files in a single directory, adjusting the path/to/cldr/... to point to a local copy, and running the command:

xsltproc match-plural-integers.xml

spec/registry.md Outdated Show resolved Hide resolved
spec/registry.md Outdated Show resolved Hide resolved
spec/registry.md Outdated

Each `<matches>` MAY contain either one or more `<match>` elements, or an `href` attribute.
If an `href` attribute is set, its URL value MUST resolve to an XML document
with a root `<matches>` element with no `href` attribute,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not permit chaining?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we've no reason to, and this way we can rely on the external matches XML to resolve all of its dependencies completely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are no cases in which part of a <matches> tree in the external XML refers to (say) CLDR data? Once you've implemented resolving an external file, it's just a question of recursion, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are no cases in which part of a <matches> tree in the external XML refers to (say) CLDR data?

Correct. A <matches> can only contain <match> elements, which may only have locales and values attributes, and are otherwise empty. So there's no space for recursion.

spec/registry.md Outdated Show resolved Hide resolved
spec/registry.md Outdated Show resolved Hide resolved
spec/registry.md Outdated Show resolved Hide resolved
eemeli and others added 2 commits December 11, 2023 23:18
Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>
eemeli and others added 2 commits December 12, 2023 17:43
Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>
@eemeli eemeli requested a review from aphillips December 16, 2023 07:18
Comment on lines +185 to +187
- `<when option="select" values="plural"><matches><match locales="en" values="one other" ... />`
can be used in locales like `en` and `en-GB` if the selection type is known to be plural
to validate that only `one`, `other` or numeric keys are used for variants.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to make these into examples. And I think we might want to avoid "validate that only". Perhaps:

Suggested change
- `<when option="select" values="plural"><matches><match locales="en" values="one other" ... />`
can be used in locales like `en` and `en-GB` if the selection type is known to be plural
to validate that only `one`, `other` or numeric keys are used for variants.
> For example,
> `<when option="select" values="plural"><matches><match locales="en" values="one other" ... />`
> could be used when validating translations for locales such as `en` and `en-GB`
> to check that variant keys `one` and `other` have been provided
> (in addition to any numeric keys).

Copy link
Collaborator

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of href for match, but that's #538, so let's discuss it there. I'm not a fan of other changes in this PR.

I think this is trying to model the registry data in a way that specifies the logic to be applied to the data. It thus couples the definition of the data with the usage of the data. The same can be said of the changes in #534, which I commented on a few minutes ago.

We already have a built-in way of achieving what this PR is trying to do, if I understand correctly: multiple signature elements.

This PR proposes to allow creating nested element hierarchies that imitate code, like this:

<when option="select" values="plural">
  <matches validationRule="anyNumber">
    <match locales="en" values="one other"/>
  </matches>
</when>
<when option="select" values="ordinal">
  <matches validationRule="anyNumber">
    <match locales="en" values="one two few other"/>
  </matches>
</when>
<matches validationRule="anyNumber">
  <match values="zero one two few many other"/>
</matches>

We should instead be able to just describe the data:

<matchSignature>
  <match validationRule="anyNumber"/>
  <match values="zero one two few many other"/>
</matchSignature>

<matchSignature>
  <option name="select" values="plural"/>
  <match locales="en" values="one other"/>
  <match locales="pl" values="one few many other"/>
</matchSignature>

<matchSignature>
  <option name="select" values="ordinal"/>
  <match locales="en" values="one two few other"/>
  <match locales="pl" values="other"/>
</matchSignature>

In fact, as I mention in #534, I don't think we even need the locales attributes on match elements. Compare:

<matchSignature>
  <match validationRule="anyNumber"/>
  <match values="zero one two few many other"/>
</matchSignature>

<matchSignature locales="en">
  <option name="select" values="plural"/>
  <match values="one other"/>
</matchSignature>

<matchSignature locales="en">
  <option name="select" values="ordinal"/>
  <match values="one two few other"/>
</matchSignature>

<matchSignature locales="pl">
  <option name="select" values="plural"/>
  <match values="one few many other"/>
</matchSignature>

<matchSignature locales="pl">
  <option name="select" values="ordinal"/>
  <match values="other"/>
</matchSignature>

@aphillips
Copy link
Member

@stasm I agree that the purpose of the registry data (as I've described it elsewhere) is to inform tools and such about available variant keys, but not to replicate either CLDR data or functionality about which key gets chosen when.

I like the middle example with <match locales=...> more than the bottom example with locales on matchSignature, since the former puts all of the locales next to one another, presumably with the root locale at the top. This makes the registry easy to understand and compact, with like things next to each other:

<matchSignature>
   <!-- all of plural in one place -->
   <option name="select" values="plural">
   <match values="zero one two few many other"/><!-- the root locale -->
   <match locales="ar" values="..."/>
   ... more locales...
   <match locales="zh" values="other"/>
</matchSignature>

@stasm
Copy link
Collaborator

stasm commented Dec 16, 2023

I agree that the purpose of the registry data (as I've described it elsewhere) is to inform tools and such about available variant keys, but not to replicate either CLDR data or functionality about which key gets chosen when.

I fully agree. I used plural matching for the sake of example, and because #538 is still open.

I like the middle example with <match locales=...> more than the bottom example with locales on matchSignature, since the former puts all of the locales next to one another, presumably with the root locale at the top. This makes the registry easy to understand and compact, with like things next to each other.

I can see the appeal of the middle form; I like it as well. The bottom one is equally expressive but perhaps goes one step too far in trying to avoid extending the registry's schema, which ends up being inconvenient.

Even if we expect registries to be generated and consumed by tools, I think they may also be authored by people. I'd call the middle snippet reasonably convenient for that purpose.


To further explain why I think multiple signatures are more expressive than <when>:

  • <when> only allows a single predicate.
  • The predicate can only match a concrete set of option values (rather than regex rules, or even inputs).
  • Nesting match elements inside the matches element feels arbitrary and not needed.
  • Because of the nesting, it's unclear whether the validationRule on matches should also apply to match values inside it.

@eemeli
Copy link
Collaborator Author

eemeli commented Dec 17, 2023

Should we hold a separate call on the registry, to align ourselves on its core user stories? We do have the registry's Goals section, but it's rather focused on end users. It leaves out library and implementation developer concerns like:

  • Updating an option's description
  • Generating API documentation from a registry definition
  • Adding a namespaced option to a default function

Considering these is leading me at least towards the thoughts expressed in #561, i.e. dropping the <formatSignature> and <matchSignature> elements. But we should discuss these matters together so that our views can align.

@aphillips aphillips added Future Deferred for future standardization Action-Item Action item assigned by the WG labels Jan 21, 2024
@aphillips
Copy link
Member

In the 2024-01-15 call we agreed to tag this for Future and that I would set up a separate call about registry format.

@macchiati
Copy link
Member

macchiati commented Jan 22, 2024 via email

@aphillips aphillips removed the Action-Item Action item assigned by the WG label Feb 18, 2024
@eemeli
Copy link
Collaborator Author

eemeli commented Jul 29, 2024

Closing due to #815.

@eemeli eemeli closed this Jul 29, 2024
@eemeli eemeli deleted the registry-option-combos branch July 29, 2024 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Future Deferred for future standardization registry Issue pertains to the function registry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants