Add section on Uniqueness and Equality #869

eemeli · 2024-08-25T16:44:38Z

Adds a section to the spec intro requiring that string comparisons use NFC normalization when comparing matches.

Explicitly allows for but does not require all content to be NFC-normalized by an implementation.

aphillips

I think this is the right discussion, but the wrong place to fix it.

aphillips · 2024-08-25T18:00:00Z

spec/README.md

+Parts of the specification compare strings with each other.
+In all such cases, the comparison MUST be made
+after applying [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC)
+to each string being compared.


I oppose adding this, not because we shouldn't be explicit about how comparisons are done, but because this is not the right place to do it. We should be explicit about identifier equality and about string equality where these are actually used in the spec.

I also want to avoid requiring NFC at this level and in this way, because messages might not be in a Unicode encoding and because some implementers might object to being required to perform NFC inside of comparisons (vs. pre-normalizing values). There may also be functions that support non-normalized literals as operands or produce deliberately non-normalized output (for example, pseudo-translators that use combining accents to decorate ASCII might produce NFD output).

I would be more amenable to language such as:

Suggested change

Parts of the specification compare strings with each other.

In all such cases, the comparison MUST be made

after applying [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC)

to each string being compared.

Except where otherwise defined or permitted by this specification,

comparison of values for equality is case-sensitive, code point-by-code point

comparison (Infra [is](https://infra.spec.whatwg.org/#string-is)).

I could see us adding:

Identifiers (including names, namespaces, and operands) MUST be considered equal
if the strings are identical after Unicode Normalization Form C (NFC) has been applied.

Note well: I have a long (twenty-five plus) year history of wrestling with this issue and you can read the results of that in String Matching. I have no problem--and would encourage--requiring NFC (without case folding) in our namespace. But the cases need to be clear for implementers and we should not require implementers to be normalizing messages and strings on-the-fly.

As far as I can tell, the following are the "parts of the specification" that do string matching:

Data model error check for duplicate options

Data model error check for duplicate variant key lists

Data model error check for duplicate declarations

Variable resolution

Function implementation lookup

Standard function set option resolution

When putting this PR together, I did consider going into each of those and adding the NFC normalization there, but that's tricky in particular for the variable resolution and function lookup, as we've somewhat explicitly left their details out of the spec, so that an implementation can e.g. resolve $foo.bar by looking up the bar property of a foo object, or by deciding for itself how to look up the function for :html:img.

Would more explicitly enumerating here the list of spec parts where normalization happens be sufficient?

Hence this catch-all type of approach, which is intended to be sufficient to ensure that normalization is applied to comparisons, but it's not required to be externally visible, along with an explicit permission to solve the problem by normalizing everything.

I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.

Some history is probably warranted here. W3C I18N for many years championed a concept called "Early Uniform Normalization" (EUN), which, in a nutshell, said "normalize all data values to NFC near the point of input so that comparisons can be fast and efficient".

It turns out that there are practical problems ensuring this. There are lots of places where denormalized data can creep in, such that people end up having to at least check data values before comparison.

Normalization of an MF2 message should not just be on the whole of the message as a string. The grammar of MF2 contains literal text that needs to allow non-normalized code point sequences.

I misspoke in saying that each comparison point would be where to specify normalization. What I should have said (and meant to say) is that the grammar of MF2 is sufficiently tight that there is a single "choke point" where we need to talk about normalization of values and it is the production name.

Notice that name is used to create variable names, option names, function names, unquoted literals, namespace names (and thus identifiers), and everything that isn't either an ASCII word (.input, etc.) or some punctuation ({{, }}, {, etc.) or whitespace.

We can simply say that name MUST be NFC or, when converted from another character encoding, must be normalized to NFC. This ensures that matching never need to worry about normalization.

What about literals? Unquoted literals use name, so they'll be NFC. But what about quoted literals? These can and should allow non-NFC sequences. We do not want to normalize these in order to allow non-normalized sequences or values, which are occasionally useful. Note well that the quotes (| or {{/}}) around literal sequences are not part of the literal. Thus |\u0300| does not treat the combining mark U+0300 as an extension of the | grapheme. (This is why you cannot normalize an MF2 message as a whole.)

Non-normalized literals, when used in an MF2 message as a value of a key, option, etc. behave as non-normalized values. They may be visually indistinguishable from normalized values and not match, a fact that is also true of lots of strings in Unicode that are normalized. This is rarely a problem (self-spoofing is a Bad Idea). Processing of MF2 messages needs to understand the boundary conditions when parsing.

I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.

This is not misleading at all: it explicitly does not allow normalization of the strings in question at the point of comparison. If we have enforced EUN (as described above), we've made it irrelevant. However, EUN of name imposes the cost of carrying around a normalizer and doing normalization checking on implementations.

The alternative to EUN of name is to do what we've currently done: ignore the problem. This turned out to be what the Web (and internet at large) did, which is why charmod-norm is the way that it is. If we adopt that approach (or rather, keep that approach), then it is the responsibility of the user to ensure that their names are normalized (or not) and match each other (or not), because our grammar is normalization sensitive (just as it is case sensitive). There is no cost or burden on implementations in such a case, except as a source of frustration for end-users when values that are visually and semantically indistinguishable don't match.

While the Unicadett in me thinks NFC is the answer (and I fought for 20 years to make it the answer!), in practice I lost that battle and stuff mostly seems to work. If we "accept defeat" here too, we should insert text into our spec about here that says basically "avoid non-normalized name values: they work bad magick"

aphillips · 2024-09-13T19:27:46Z

See #885 for a competing version...

aphillips · 2024-09-16T22:21:55Z

I'm going to close this PR since we have a replacement baking in #885.

Add section on Uniqueness and Equality

b681e27

eemeli added the normative Issue affects normative text in the specification label Aug 25, 2024

eemeli requested review from aphillips and macchiati August 25, 2024 16:44

aphillips requested changes Aug 25, 2024

View reviewed changes

aphillips mentioned this pull request Sep 9, 2024

Conformance with UAX #31 & UTS #55 #847

Closed

aphillips closed this Sep 16, 2024

eemeli deleted the nfc-equality branch September 17, 2024 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add section on Uniqueness and Equality #869

Add section on Uniqueness and Equality #869

eemeli commented Aug 25, 2024

aphillips left a comment

aphillips Aug 25, 2024

eemeli Aug 26, 2024

aphillips Aug 26, 2024 •

edited

Loading

aphillips commented Sep 13, 2024

aphillips commented Sep 16, 2024

Add section on Uniqueness and Equality #869

Add section on Uniqueness and Equality #869

Conversation

eemeli commented Aug 25, 2024

aphillips left a comment

Choose a reason for hiding this comment

aphillips Aug 25, 2024

Choose a reason for hiding this comment

eemeli Aug 26, 2024

Choose a reason for hiding this comment

aphillips Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

aphillips commented Sep 13, 2024

aphillips commented Sep 16, 2024

aphillips Aug 26, 2024 •

edited

Loading