-
-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add section on Uniqueness and Equality #869
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right discussion, but the wrong place to fix it.
Parts of the specification compare strings with each other. | ||
In all such cases, the comparison MUST be made | ||
after applying [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC) | ||
to each string being compared. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I oppose adding this, not because we shouldn't be explicit about how comparisons are done, but because this is not the right place to do it. We should be explicit about identifier equality and about string equality where these are actually used in the spec.
I also want to avoid requiring NFC at this level and in this way, because messages might not be in a Unicode encoding and because some implementers might object to being required to perform NFC inside of comparisons (vs. pre-normalizing values). There may also be functions that support non-normalized literals as operands or produce deliberately non-normalized output (for example, pseudo-translators that use combining accents to decorate ASCII might produce NFD output).
I would be more amenable to language such as:
Parts of the specification compare strings with each other. | |
In all such cases, the comparison MUST be made | |
after applying [Unicode Normalization Form C](https://unicode.org/reports/tr15/) (NFC) | |
to each string being compared. | |
Except where otherwise defined or permitted by this specification, | |
comparison of values for equality is case-sensitive, code point-by-code point | |
comparison (Infra [is](https://infra.spec.whatwg.org/#string-is)). |
I could see us adding:
Identifiers (including names, namespaces, and operands) MUST be considered equal
if the strings are identical after Unicode Normalization Form C (NFC) has been applied.
Note well: I have a long (twenty-five plus) year history of wrestling with this issue and you can read the results of that in String Matching. I have no problem--and would encourage--requiring NFC (without case folding) in our namespace. But the cases need to be clear for implementers and we should not require implementers to be normalizing messages and strings on-the-fly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the following are the "parts of the specification" that do string matching:
- Data model error check for duplicate options
- Data model error check for duplicate variant key lists
- Data model error check for duplicate declarations
- Variable resolution
- Function implementation lookup
- Standard function set option resolution
When putting this PR together, I did consider going into each of those and adding the NFC normalization there, but that's tricky in particular for the variable resolution and function lookup, as we've somewhat explicitly left their details out of the spec, so that an implementation can e.g. resolve $foo.bar
by looking up the bar
property of a foo
object, or by deciding for itself how to look up the function for :html:img
.
Would more explicitly enumerating here the list of spec parts where normalization happens be sufficient?
Hence this catch-all type of approach, which is intended to be sufficient to ensure that normalization is applied to comparisons, but it's not required to be externally visible, along with an explicit permission to solve the problem by normalizing everything.
I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some history is probably warranted here. W3C I18N for many years championed a concept called "Early Uniform Normalization" (EUN), which, in a nutshell, said "normalize all data values to NFC near the point of input so that comparisons can be fast and efficient".
It turns out that there are practical problems ensuring this. There are lots of places where denormalized data can creep in, such that people end up having to at least check data values before comparison.
Normalization of an MF2 message should not just be on the whole of the message as a string. The grammar of MF2 contains literal text that needs to allow non-normalized code point sequences.
I misspoke in saying that each comparison point would be where to specify normalization. What I should have said (and meant to say) is that the grammar of MF2 is sufficiently tight that there is a single "choke point" where we need to talk about normalization of values and it is the production name
.
Notice that name
is used to create variable names, option names, function names, unquoted literals, namespace names (and thus identifiers), and everything that isn't either an ASCII word (.input
, etc.) or some punctuation ({{
, }}
, {
, etc.) or whitespace.
We can simply say that name
MUST be NFC or, when converted from another character encoding, must be normalized to NFC. This ensures that matching never need to worry about normalization.
What about literals? Unquoted literals use name
, so they'll be NFC. But what about quoted literals? These can and should allow non-NFC sequences. We do not want to normalize these in order to allow non-normalized sequences or values, which are occasionally useful. Note well that the quotes (|
or {{
/}}
) around literal sequences are not part of the literal. Thus |\u0300|
does not treat the combining mark U+0300 as an extension of the |
grapheme. (This is why you cannot normalize an MF2 message as a whole.)
Non-normalized literals, when used in an MF2 message as a value of a key, option, etc. behave as non-normalized values. They may be visually indistinguishable from normalized values and not match, a fact that is also true of lots of strings in Unicode that are normalized. This is rarely a problem (self-spoofing is a Bad Idea). Processing of MF2 messages needs to understand the boundary conditions when parsing.
I fear that the text you suggest would be misleading, as "code point-by-code point comparison" does not allow for normalization unless the strings in question have been previously normalized.
This is not misleading at all: it explicitly does not allow normalization of the strings in question at the point of comparison. If we have enforced EUN (as described above), we've made it irrelevant. However, EUN of name
imposes the cost of carrying around a normalizer and doing normalization checking on implementations.
The alternative to EUN of name is to do what we've currently done: ignore the problem. This turned out to be what the Web (and internet at large) did, which is why charmod-norm is the way that it is. If we adopt that approach (or rather, keep that approach), then it is the responsibility of the user to ensure that their names are normalized (or not) and match each other (or not), because our grammar is normalization sensitive (just as it is case sensitive). There is no cost or burden on implementations in such a case, except as a source of frustration for end-users when values that are visually and semantically indistinguishable don't match.
While the Unicadett in me thinks NFC is the answer (and I fought for 20 years to make it the answer!), in practice I lost that battle and stuff mostly seems to work. If we "accept defeat" here too, we should insert text into our spec about here that says basically "avoid non-normalized name values: they work bad magick"
See #885 for a competing version... |
I'm going to close this PR since we have a replacement baking in #885. |
See #847
Adds a section to the spec intro requiring that string comparisons use NFC normalization when comparing matches.
Explicitly allows for but does not require all content to be NFC-normalized by an implementation.