Skip to content

Commit

Permalink
Address name and literal equality (#885)
Browse files Browse the repository at this point in the history
* Address name and literal equality

This change defines equality as discussed in the 2024-09-09 teleconference in the following ways:

- It defines _name_ equality as being under NFC
- It defines _literal_ equality as explicitly **not** under NFC
- It moves _name_ before _identifier_ in that section of text to avoid a forward definition.

Note that this deviates from discussion in 2024-09-09's call in that we didn't discuss literals at length. It also doesn't discuss non-name/non-literal values, which I'll point out are limited to ASCII sequences such as keywords.

* Typo fix

* Add a note about not requiring implementations to actually normalize

* Implement changes dicussed in 2024-09-16 call.

- Make _key_ require NFC for uniqueness/comparison
- Add a note about NFC
- Make _literal_ **_not_** define equality
- Make text in _name_ identical to that in _key_ for consistency

* Update formatting.md to include keys in NFC

* Address comments

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>
  • Loading branch information
aphillips and eemeli authored Sep 17, 2024
1 parent 95ec6d5 commit 6f5ad39
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 16 deletions.
5 changes: 4 additions & 1 deletion spec/formatting.md
Original file line number Diff line number Diff line change
Expand Up @@ -502,7 +502,7 @@ Next, using `res`, resolve the preferential order for all message keys:
1. Let `key` be the `var` key at position `i`.
1. If `key` is not the catch-all key `'*'`:
1. Assert that `key` is a _literal_.
1. Let `ks` be the resolved value of `key`.
1. Let `ks` be the resolved value of `key` in Unicode Normalization Form C.
1. Append `ks` as the last element of the list `keys`.
1. Let `rv` be the resolved value at index `i` of `res`.
1. Let `matches` be the result of calling the method MatchSelectorKeys(`rv`, `keys`)
Expand All @@ -516,6 +516,9 @@ The returned list MAY be empty.
The most-preferred key is first,
with each successive key appearing in order by decreasing preference.
The resolved value of each _key_ MUST be in Unicode Normalization Form C ("NFC"),
even if the _literal_ for the _key_ is not.
If calling MatchSelectorKeys encounters any error,
a _Bad Selector_ error is emitted
and an empty list is returned.
Expand Down
64 changes: 49 additions & 15 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,12 @@ A _key_ can be either a _literal_ value or the "catch-all" key `*`.
The **_<dfn>catch-all key</dfn>_** is a special key, represented by `*`,
that matches all values for a given _selector_.

The value of each _key_ MUST be treated as if it were in
[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC").
Two _keys_ are considered equal if they are canonically equivalent strings,
that is, if they consist of the same sequence of Unicode code points after
Unicode Normalization Form C has been applied to both.

## Expressions

An **_<dfn>expression</dfn>_** is a part of a _message_ that will be determined
Expand Down Expand Up @@ -690,6 +696,20 @@ except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.

All code points are preserved.

> [!IMPORTANT]
> Most text, including that produced by common keyboards and input methods,
> is already encoded in the canonical form known as
> [Unicode Normalization Form C](https://unicode.org/reports/tr15) ("NFC").
> A few languages, legacy character encoding conversions, or operating environments
> can result in _literal_ values that are not in this form.
> Some uses of _literals_ in MessageFormat,
> notably as the value of _keys_,
> apply NFC to the _literal_ value during processing or comparison.
> While there is no requirement that the _literal_ value actually be entered
> in a normalized form,
> users are cautioned to employ the same character sequences
> for equivalent values and, whenever possible, ensure _literals_ are in NFC.
A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
The characters `\` and `|` within a _quoted literal_ MUST be
escaped as `\\` and `\|`.
Expand All @@ -714,21 +734,6 @@ number-literal = ["-"] (%x30 / (%x31-39 *DIGIT)) ["." 1*DIGIT] [%i"e" ["-" / "

### Names and Identifiers

An **_<dfn>identifier</dfn>_** is a character sequence that
identifies a _function_, _markup_, or _option_.
Each _identifier_ consists of a _name_ optionally preceeded by
a _namespace_.
When present, the _namespace_ is separated from the _name_ by a
U+003A COLON `:`.
Built-in _functions_ and their _options_ do not have a _namespace_ identifier.

The _namespace_ `u` (U+0075 LATIN SMALL LETTER U)
is reserved for future standardization.

_Function_ _identifiers_ are prefixed with `:`.
_Markup_ _identifiers_ are prefixed with `#` or `/`.
_Option_ _identifiers_ have no prefix.

A **_<dfn>name</dfn>_** is a character sequence used in an _identifier_
or as the name for a _variable_
or the value of an _unquoted literal_.
Expand All @@ -740,6 +745,20 @@ when matching _name_ or _identifier_ strings or _unquoted literal_ values.

_Variable_ _names_ are prefixed with `$`.

Two _names_ are considered equal if they are canonically equivalent strings,
that is, if they consist of the same sequence of Unicode code points after
[Unicode Normalization Form C](https://unicode.org/reports/tr15/) ("NFC")
has been applied to both.

> [!NOTE]
> Implementations are not required to normalize all _names_.
> Comparisons of _name_ values only need be done "as-if" normalization
> has occured.
> Since most text in the wild is already in NFC
> and since checking for NFC is fast and efficient,
> implementations can often substitute checking for actually applying normalization
> to _name_ values.
Valid content for _names_ is based on <cite>Namespaces in XML 1.0</cite>'s
[NCName](https://www.w3.org/TR/xml-names/#NT-NCName).
This is different from XML's [Name](https://www.w3.org/TR/xml/#NT-Name)
Expand All @@ -751,6 +770,21 @@ Otherwise, the set of characters allowed in a _name_ is large.
> Such variables cannot be referenced in a _message_,
> but are not otherwise errors.
An **_<dfn>identifier</dfn>_** is a character sequence that
identifies a _function_, _markup_, or _option_.
Each _identifier_ consists of a _name_ optionally preceeded by
a _namespace_.
When present, the _namespace_ is separated from the _name_ by a
U+003A COLON `:`.
Built-in _functions_ and their _options_ do not have a _namespace_ identifier.

The _namespace_ `u` (U+0075 LATIN SMALL LETTER U)
is reserved for future standardization.

_Function_ _identifiers_ are prefixed with `:`.
_Markup_ _identifiers_ are prefixed with `#` or `/`.
_Option_ _identifiers_ have no prefix.

Examples:
> A variable:
>```
Expand Down

0 comments on commit 6f5ad39

Please sign in to comment.