Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify the absolute nature of "any code point" #282
Clarify the absolute nature of "any code point" #282
Changes from all commits
96a7ecd
b7f16a3
abca063
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a benefit of including such an explicit and exhaustive list here? It feels like it doesn't bring any real benefit, while creating a potential maintenance burden if a theoretical future change were to alter any of these ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we don't exclude the non-characters (
*FFFE
and*FFFF
as well as the surrogates)? The other characters are just characters, but those shouldn't appear?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eemeli The benefit of this list is specifically calling attention to easily-overlooked consequences of allowing such a broad collection of code points to be expressed literally, as perfectly demonstrated by @aphillips. And it's not even exhaustive; I intentionally left out potentially troublesome but text-oriented characters such as U+2067 RIGHT-TO-LEFT ISOLATE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, point taken. I do not have particularly strong feelings about the exact range we choose to include here, but appreciate that there may well be real-world considerations that we ought to take into account. This particular proposed change does clarify the current situation, and hopefully provides a better platform for later conversations about restricting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gibson042 This could become complicated. It depends on whether we intend the syntax to be enforced by implementations, validated in tests, and the like. "Potentially troublesome" characters, such as RLI, are "garbage in/garbage out" (let the user beware). Unassigned characters can become assigned. Etc. But my comment was that non-character code points ought to be excluded (since we are doing text processing).
This looks like a job for
USVString
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I agree with this. A particular negative consequence of the current "anything goes" stance just occurred to me: any Message Format literal including a surrogate code point cannot be represented in UTF-8, which on its own seems to justify excluding at least the same range as USVString.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should think about this syntax as "in memory representation"
I loaded the string from somewhere, and it is now in memory.
So the encoding ends up being either a dedicated class that can support almost anything (including surrogates)
Or whatever the tech stack used (utf-8 in linux/mac C/C++ strings, wchar_t (utf-16 on Win, utf-32 on linux, utf-16 always in JS / Java).
We don't really care what is there, and we don't require it to be correct Unicode.
This would mean that we should not make this change.
I don't have a strong opinion either / or.
But we should understand the implications.
I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.
MessageFormat should not be in the business of validating / fixing incorrect surrogates.
So my approval for this PR does not mean "I like it", but "I am not against it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm... Maybe we should have reached an agreement on this in an issue, before getting to a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is purely editorial, and I think entirely appropriate for prompting such discussion even if it is ends up replaced before merging.
I strongly disagree. Syntax is something all parties must agree upon, and constrains both what can be communicated and how. It is not an in-memory representation (as evident by e.g. the
\\
and\"
escape sequences), and even if it were, that would still not absolve the need to address representation in all the various possibilities for getting it to memory.For clarity, what change should we not make? The exclusion of surrogate code points (which I do support but have not included in this PR)?
Now you have truly lost me, because code units only apply within an encoding (e.g., UTF-8 octets with special handling of leadings and continuations vs. UTF-16 hexadectets with special handling of surrogates).