Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the absolute nature of "any code point" #282

Merged
merged 3 commits into from
Jun 13, 2022

Conversation

gibson042
Copy link
Collaborator

@gibson042 gibson042 commented Jun 9, 2022

Adds explicit mention of cases that are often overlooked.

Ref #268

Adds explicit mention of cases that are often overlooked.

Fixes unicode-org#268
spec/syntax.md Outdated Show resolved Hide resolved
@eemeli eemeli linked an issue Jun 9, 2022 that may be closed by this pull request
Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; some nitpicks below.

spec/syntax.md Show resolved Hide resolved
Comment on lines +449 to +454
This includes line-breaking characters (such as U+000A LINE FEED and U+000D CARRIAGE RETURN),
other control characters (such as U+0000 NULL and U+0009 TAB),
permanently reserved noncharacters (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
surrogate code points (U+D800 through U+DBFF),
private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD),
and unassigned code points.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit of including such an explicit and exhaustive list here? It feels like it doesn't bring any real benefit, while creating a potential maintenance burden if a theoretical future change were to alter any of these ranges.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we don't exclude the non-characters (*FFFE and *FFFF as well as the surrogates)? The other characters are just characters, but those shouldn't appear?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eemeli The benefit of this list is specifically calling attention to easily-overlooked consequences of allowing such a broad collection of code points to be expressed literally, as perfectly demonstrated by @aphillips. And it's not even exhaustive; I intentionally left out potentially troublesome but text-oriented characters such as U+2067 RIGHT-TO-LEFT ISOLATE.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, point taken. I do not have particularly strong feelings about the exact range we choose to include here, but appreciate that there may well be real-world considerations that we ought to take into account. This particular proposed change does clarify the current situation, and hopefully provides a better platform for later conversations about restricting it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibson042 This could become complicated. It depends on whether we intend the syntax to be enforced by implementations, validated in tests, and the like. "Potentially troublesome" characters, such as RLI, are "garbage in/garbage out" (let the user beware). Unassigned characters can become assigned. Etc. But my comment was that non-character code points ought to be excluded (since we are doing text processing).

This looks like a job for USVString?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my comment was that non-character code points ought to be excluded (since we are doing text processing)

For the record, I agree with this. A particular negative consequence of the current "anything goes" stance just occurred to me: any Message Format literal including a surrogate code point cannot be represented in UTF-8, which on its own seems to justify excluding at least the same range as USVString.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should think about this syntax as "in memory representation"
I loaded the string from somewhere, and it is now in memory.

So the encoding ends up being either a dedicated class that can support almost anything (including surrogates)
Or whatever the tech stack used (utf-8 in linux/mac C/C++ strings, wchar_t (utf-16 on Win, utf-32 on linux, utf-16 always in JS / Java).

We don't really care what is there, and we don't require it to be correct Unicode.

This would mean that we should not make this change.


I don't have a strong opinion either / or.
But we should understand the implications.


I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.
MessageFormat should not be in the business of validating / fixing incorrect surrogates.

So my approval for this PR does not mean "I like it", but "I am not against it :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... Maybe we should have reached an agreement on this in an issue, before getting to a PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is purely editorial, and I think entirely appropriate for prompting such discussion even if it is ends up replaced before merging.

I think we should think about this syntax as "in memory representation" I loaded the string from somewhere, and it is now in memory.

I strongly disagree. Syntax is something all parties must agree upon, and constrains both what can be communicated and how. It is not an in-memory representation (as evident by e.g. the \\ and \" escape sequences), and even if it were, that would still not absolve the need to address representation in all the various possibilities for getting it to memory.

This would mean that we should not make this change.

For clarity, what change should we not make? The exclusion of surrogate code points (which I do support but have not included in this PR)?

I have a slight preference for "anything goes, correct unicode or not". Coding units, not codepoints.

Now you have truly lost me, because code units only apply within an encoding (e.g., UTF-8 octets with special handling of leadings and continuations vs. UTF-16 hexadectets with special handling of surrogates).

@eemeli eemeli merged commit 752dc44 into unicode-org:develop Jun 13, 2022
echeran pushed a commit that referenced this pull request Sep 20, 2022
Adds explicit mention of cases that are often overlooked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Strings: Are all code points preserved?
5 participants