-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify the absolute nature of "any code point" #282
Conversation
Adds explicit mention of cases that are often overlooked. Fixes unicode-org#268
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; some nitpicks below.
This includes line-breaking characters (such as U+000A LINE FEED and U+000D CARRIAGE RETURN), | ||
other control characters (such as U+0000 NULL and U+0009 TAB), | ||
permanently reserved noncharacters (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10), | ||
surrogate code points (U+D800 through U+DBFF), | ||
private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD), | ||
and unassigned code points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benefit of including such an explicit and exhaustive list here? It feels like it doesn't bring any real benefit, while creating a potential maintenance burden if a theoretical future change were to alter any of these ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we don't exclude the non-characters (*FFFE
and *FFFF
as well as the surrogates)? The other characters are just characters, but those shouldn't appear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eemeli The benefit of this list is specifically calling attention to easily-overlooked consequences of allowing such a broad collection of code points to be expressed literally, as perfectly demonstrated by @aphillips. And it's not even exhaustive; I intentionally left out potentially troublesome but text-oriented characters such as U+2067 RIGHT-TO-LEFT ISOLATE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, point taken. I do not have particularly strong feelings about the exact range we choose to include here, but appreciate that there may well be real-world considerations that we ought to take into account. This particular proposed change does clarify the current situation, and hopefully provides a better platform for later conversations about restricting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gibson042 This could become complicated. It depends on whether we intend the syntax to be enforced by implementations, validated in tests, and the like. "Potentially troublesome" characters, such as RLI, are "garbage in/garbage out" (let the user beware). Unassigned characters can become assigned. Etc. But my comment was that non-character code points ought to be excluded (since we are doing text processing).
This looks like a job for USVString
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comment was that non-character code points ought to be excluded (since we are doing text processing)
For the record, I agree with this. A particular negative consequence of the current "anything goes" stance just occurred to me: any Message Format literal including a surrogate code point cannot be represented in UTF-8, which on its own seems to justify excluding at least the same range as USVString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should think about this syntax as "in memory representation"
I loaded the string from somewhere, and it is now in memory.
So the encoding ends up being either a dedicated class that can support almost anything (including surrogates)
Or whatever the tech stack used (utf-8 in linux/mac C/C++ strings, wchar_t (utf-16 on Win, utf-32 on linux, utf-16 always in JS / Java).
We don't really care what is there, and we don't require it to be correct Unicode.
This would mean that we should not make this change.
I don't have a strong opinion either / or.
But we should understand the implications.
I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.
MessageFormat should not be in the business of validating / fixing incorrect surrogates.
So my approval for this PR does not mean "I like it", but "I am not against it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... Maybe we should have reached an agreement on this in an issue, before getting to a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is purely editorial, and I think entirely appropriate for prompting such discussion even if it is ends up replaced before merging.
I think we should think about this syntax as "in memory representation" I loaded the string from somewhere, and it is now in memory.
I strongly disagree. Syntax is something all parties must agree upon, and constrains both what can be communicated and how. It is not an in-memory representation (as evident by e.g. the \\
and \"
escape sequences), and even if it were, that would still not absolve the need to address representation in all the various possibilities for getting it to memory.
This would mean that we should not make this change.
For clarity, what change should we not make? The exclusion of surrogate code points (which I do support but have not included in this PR)?
I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.
Now you have truly lost me, because code units only apply within an encoding (e.g., UTF-8 octets with special handling of leadings and continuations vs. UTF-16 hexadectets with special handling of surrogates).
Adds explicit mention of cases that are often overlooked.
Adds explicit mention of cases that are often overlooked.
Ref #268