Clarify the absolute nature of "any code point" #282

gibson042 · 2022-06-09T16:59:14Z

Adds explicit mention of cases that are often overlooked.

Ref #268

Adds explicit mention of cases that are often overlooked. Fixes unicode-org#268

spec/syntax.md

eemeli

Looks good; some nitpicks below.

spec/syntax.md

eemeli · 2022-06-09T18:30:57Z

spec/syntax.md

+This includes line-breaking characters (such as U+000A LINE FEED and U+000D CARRIAGE RETURN),
+other control characters (such as U+0000 NULL and U+0009 TAB),
+permanently reserved noncharacters (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
+surrogate code points (U+D800 through U+DBFF),
+private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD),
+and unassigned code points.


Is there a benefit of including such an explicit and exhaustive list here? It feels like it doesn't bring any real benefit, while creating a potential maintenance burden if a theoretical future change were to alter any of these ranges.

I'm not sure why we don't exclude the non-characters (*FFFE and *FFFF as well as the surrogates)? The other characters are just characters, but those shouldn't appear?

@eemeli The benefit of this list is specifically calling attention to easily-overlooked consequences of allowing such a broad collection of code points to be expressed literally, as perfectly demonstrated by @aphillips. And it's not even exhaustive; I intentionally left out potentially troublesome but text-oriented characters such as U+2067 RIGHT-TO-LEFT ISOLATE.

Okay, point taken. I do not have particularly strong feelings about the exact range we choose to include here, but appreciate that there may well be real-world considerations that we ought to take into account. This particular proposed change does clarify the current situation, and hopefully provides a better platform for later conversations about restricting it.

@gibson042 This could become complicated. It depends on whether we intend the syntax to be enforced by implementations, validated in tests, and the like. "Potentially troublesome" characters, such as RLI, are "garbage in/garbage out" (let the user beware). Unassigned characters can become assigned. Etc. But my comment was that non-character code points ought to be excluded (since we are doing text processing).

This looks like a job for USVString?

my comment was that non-character code points ought to be excluded (since we are doing text processing)

For the record, I agree with this. A particular negative consequence of the current "anything goes" stance just occurred to me: any Message Format literal including a surrogate code point cannot be represented in UTF-8, which on its own seems to justify excluding at least the same range as USVString.

I think we should think about this syntax as "in memory representation"
I loaded the string from somewhere, and it is now in memory.

So the encoding ends up being either a dedicated class that can support almost anything (including surrogates)
Or whatever the tech stack used (utf-8 in linux/mac C/C++ strings, wchar_t (utf-16 on Win, utf-32 on linux, utf-16 always in JS / Java).

We don't really care what is there, and we don't require it to be correct Unicode.

This would mean that we should not make this change.

I don't have a strong opinion either / or.
But we should understand the implications.

I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.
MessageFormat should not be in the business of validating / fixing incorrect surrogates.

So my approval for this PR does not mean "I like it", but "I am not against it :-)

Hmmmm... Maybe we should have reached an agreement on this in an issue, before getting to a PR.

This PR is purely editorial, and I think entirely appropriate for prompting such discussion even if it is ends up replaced before merging.

I think we should think about this syntax as "in memory representation" I loaded the string from somewhere, and it is now in memory.

I strongly disagree. Syntax is something all parties must agree upon, and constrains both what can be communicated and how. It is not an in-memory representation (as evident by e.g. the \\ and \" escape sequences), and even if it were, that would still not absolve the need to address representation in all the various possibilities for getting it to memory.

This would mean that we should not make this change.

For clarity, what change should we not make? The exclusion of surrogate code points (which I do support but have not included in this PR)?

I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.

Now you have truly lost me, because code units only apply within an encoding (e.g., UTF-8 octets with special handling of leadings and continuations vs. UTF-16 hexadectets with special handling of surrogates).

Adds explicit mention of cases that are often overlooked.

Clarify the absolute nature of "any code point"

96a7ecd

Adds explicit mention of cases that are often overlooked. Fixes unicode-org#268

gibson042 mentioned this pull request Jun 9, 2022

Strings: Are all code points preserved? #268

Closed

mihnita approved these changes Jun 9, 2022

View reviewed changes

gibson042 commented Jun 9, 2022

View reviewed changes

spec/syntax.md Outdated Show resolved Hide resolved

Use consistent representations

b7f16a3

eemeli linked an issue Jun 9, 2022 that may be closed by this pull request

Strings: Are all code points preserved? #268

Closed

eemeli approved these changes Jun 9, 2022

View reviewed changes

Update message.ebnf

abca063

romulocintra approved these changes Jun 11, 2022

View reviewed changes

eemeli merged commit 752dc44 into unicode-org:develop Jun 13, 2022

echeran pushed a commit that referenced this pull request Sep 20, 2022

Clarify the absolute nature of "any code point" (#282)

dd5e580

Adds explicit mention of cases that are often overlooked.

gibson042 mentioned this pull request Dec 26, 2023

Forbid {�} as a valid expression #576

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify the absolute nature of "any code point" #282

Clarify the absolute nature of "any code point" #282

gibson042 commented Jun 9, 2022 •

edited

Loading

eemeli left a comment

eemeli Jun 9, 2022

aphillips Jun 9, 2022

gibson042 Jun 9, 2022

eemeli Jun 9, 2022

aphillips Jun 9, 2022

gibson042 Jun 9, 2022

mihnita Jun 10, 2022

mihnita Jun 10, 2022

gibson042 Jun 11, 2022

Clarify the absolute nature of "any code point" #282

Clarify the absolute nature of "any code point" #282

Conversation

gibson042 commented Jun 9, 2022 • edited Loading

eemeli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibson042 commented Jun 9, 2022 •

edited

Loading