Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section on Uniqueness and Equality #869

Closed
wants to merge 1 commit into from
Closed

Add section on Uniqueness and Equality #869

wants to merge 1 commit into from

Conversation

eemeli
Copy link
Collaborator

@eemeli eemeli commented Aug 25, 2024

See #847

Adds a section to the spec intro requiring that string comparisons use NFC normalization when comparing matches.

Explicitly allows for but does not require all content to be NFC-normalized by an implementation.

@eemeli eemeli added the normative Issue affects normative text in the specification label Aug 25, 2024
@eemeli eemeli requested review from aphillips and macchiati August 25, 2024 16:44
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right discussion, but the wrong place to fix it.

Comment on lines +79 to +82
Parts of the specification compare strings with each other.
In all such cases, the comparison MUST be made
after applying [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC)
to each string being compared.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I oppose adding this, not because we shouldn't be explicit about how comparisons are done, but because this is not the right place to do it. We should be explicit about identifier equality and about string equality where these are actually used in the spec.

I also want to avoid requiring NFC at this level and in this way, because messages might not be in a Unicode encoding and because some implementers might object to being required to perform NFC inside of comparisons (vs. pre-normalizing values). There may also be functions that support non-normalized literals as operands or produce deliberately non-normalized output (for example, pseudo-translators that use combining accents to decorate ASCII might produce NFD output).

I would be more amenable to language such as:

Suggested change
Parts of the specification compare strings with each other.
In all such cases, the comparison MUST be made
after applying [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC)
to each string being compared.
Except where otherwise defined or permitted by this specification,
comparison of values for equality is case-sensitive, code point-by-code point
comparison (Infra [is](https://infra.spec.whatwg.org/#string-is)).

I could see us adding:

Identifiers (including names, namespaces, and operands) MUST be considered equal
if the strings are identical after Unicode Normalization Form C (NFC) has been applied.

Note well: I have a long (twenty-five plus) year history of wrestling with this issue and you can read the results of that in String Matching. I have no problem--and would encourage--requiring NFC (without case folding) in our namespace. But the cases need to be clear for implementers and we should not require implementers to be normalizing messages and strings on-the-fly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the following are the "parts of the specification" that do string matching:

  • Data model error check for duplicate options
  • Data model error check for duplicate variant key lists
  • Data model error check for duplicate declarations
  • Variable resolution
  • Function implementation lookup
  • Standard function set option resolution

When putting this PR together, I did consider going into each of those and adding the NFC normalization there, but that's tricky in particular for the variable resolution and function lookup, as we've somewhat explicitly left their details out of the spec, so that an implementation can e.g. resolve $foo.bar by looking up the bar property of a foo object, or by deciding for itself how to look up the function for :html:img.

Would more explicitly enumerating here the list of spec parts where normalization happens be sufficient?

Hence this catch-all type of approach, which is intended to be sufficient to ensure that normalization is applied to comparisons, but it's not required to be externally visible, along with an explicit permission to solve the problem by normalizing everything.

I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.

Copy link
Member

@aphillips aphillips Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some history is probably warranted here. W3C I18N for many years championed a concept called "Early Uniform Normalization" (EUN), which, in a nutshell, said "normalize all data values to NFC near the point of input so that comparisons can be fast and efficient".

It turns out that there are practical problems ensuring this. There are lots of places where denormalized data can creep in, such that people end up having to at least check data values before comparison.

Normalization of an MF2 message should not just be on the whole of the message as a string. The grammar of MF2 contains literal text that needs to allow non-normalized code point sequences.

I misspoke in saying that each comparison point would be where to specify normalization. What I should have said (and meant to say) is that the grammar of MF2 is sufficiently tight that there is a single "choke point" where we need to talk about normalization of values and it is the production name.

Notice that name is used to create variable names, option names, function names, unquoted literals, namespace names (and thus identifiers), and everything that isn't either an ASCII word (.input, etc.) or some punctuation ({{, }}, {, etc.) or whitespace.

We can simply say that name MUST be NFC or, when converted from another character encoding, must be normalized to NFC. This ensures that matching never need to worry about normalization.

What about literals? Unquoted literals use name, so they'll be NFC. But what about quoted literals? These can and should allow non-NFC sequences. We do not want to normalize these in order to allow non-normalized sequences or values, which are occasionally useful. Note well that the quotes (| or {{/}}) around literal sequences are not part of the literal. Thus |\u0300| does not treat the combining mark U+0300 as an extension of the | grapheme. (This is why you cannot normalize an MF2 message as a whole.)

Non-normalized literals, when used in an MF2 message as a value of a key, option, etc. behave as non-normalized values. They may be visually indistinguishable from normalized values and not match, a fact that is also true of lots of strings in Unicode that are normalized. This is rarely a problem (self-spoofing is a Bad Idea). Processing of MF2 messages needs to understand the boundary conditions when parsing.

I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.

This is not misleading at all: it explicitly does not allow normalization of the strings in question at the point of comparison. If we have enforced EUN (as described above), we've made it irrelevant. However, EUN of name imposes the cost of carrying around a normalizer and doing normalization checking on implementations.

The alternative to EUN of name is to do what we've currently done: ignore the problem. This turned out to be what the Web (and internet at large) did, which is why charmod-norm is the way that it is. If we adopt that approach (or rather, keep that approach), then it is the responsibility of the user to ensure that their names are normalized (or not) and match each other (or not), because our grammar is normalization sensitive (just as it is case sensitive). There is no cost or burden on implementations in such a case, except as a source of frustration for end-users when values that are visually and semantically indistinguishable don't match.

While the Unicadett in me thinks NFC is the answer (and I fought for 20 years to make it the answer!), in practice I lost that battle and stuff mostly seems to work. If we "accept defeat" here too, we should insert text into our spec about here that says basically "avoid non-normalized name values: they work bad magick"

@aphillips
Copy link
Member

See #885 for a competing version...

@aphillips
Copy link
Member

I'm going to close this PR since we have a replacement baking in #885.

@aphillips aphillips closed this Sep 16, 2024
@eemeli eemeli deleted the nfc-equality branch September 17, 2024 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
normative Issue affects normative text in the specification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants