Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strings: Are all code points preserved? #268

Closed
gibson042 opened this issue May 16, 2022 · 10 comments · Fixed by #282
Closed

Strings: Are all code points preserved? #268

gibson042 opened this issue May 16, 2022 · 10 comments · Fixed by #282
Labels
syntax Issues related with MF Syntax

Comments

@gibson042
Copy link
Collaborator

gibson042 commented May 16, 2022

Per develop syntax: Quoted Strings, the only code points that may not appear between a string's opening and closing " unescaped are " itself and \, which means that multi-line strings in particular are valid. So I'm considering the behavior of e.g.

$text = {"foo
bar

baz"}

[{$text: makeLineBreaksVisible}]

There's also the even thornier problems of control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), and especially surrogate code points (U+D800 through U+DBFF), which are impossible to encode into UTF-8.

@gibson042 gibson042 changed the title Are all code points in strings preserved? Strings: Are all code points preserved? May 16, 2022
@eemeli eemeli added the syntax Issues related with MF Syntax label May 16, 2022
@markusicu
Copy link
Member

I don't know what the question is here. If you provide a literal value that contains line feeds, then it contains line feeds. That may be desirable, or it may not be useful. But it's consistent with the behavior of other message formatting libraries, like printf() or ICU MessageFormat, or Java String.format() etc.

@gibson042
Copy link
Collaborator Author

Many formats do not support multi-line string literals—their syntax does not permit e.g. unescaped ASCII control characters. Such constraints can simplify parsing, reporting (e.g. the line number of an error), and human comprehensibility, and also preclude compatibility issues from e.g. usually-invisible line feed vs. carriage return + line feed endings being distinguishable contents.

It's certainly possible to support literally any raw code point between enclosing "s other than the escape sequence initiator \, but if that is actually the intent here then I think it merits specific direct mention because it is not the case in many similar technologies.

@mihnita
Copy link
Collaborator

mihnita commented Jun 8, 2022

Many formats do not support multi-line string literals—their syntax does not permit e.g. unescaped ASCII control characters.

But they allow for escaped ASCII controls.
I am not aware of any format (intended for localization) that doesn't do that.

This means it is all good. Escaping such things belong in the file specific rules (not in the MF2 syntax)

For example if Java .properties didn't support Unicode before JDK 9.
So one had to use \uXXXX escaping. And \n for newline.

Or you could use .xml files (also a native format for Java localization)
Which support Unicode. And if one wanted to make some characters visible (for example BiDi controls, or nbsp), you used the &#xXXXX or &#DDDDDD; escape, xml specific. And xml:space="preserve" to not collapse spaces and newlines.

But all of these are file storage layer.

So I think we should not concert ourselves with escape rules for the "raw, in memory string" that is parsed, after being loaded. We only need to escape characters that conflict with our own syntax.

In fact adding our own escapes for newline and such (if the syntax does not require it) can make things more difficult when storing the messages in existing l10n file formats.
Because we end up with double encoding, the file format one + MF2 one.
Same for mixing rules for space / newline collapsing, trimming spaces at the beginning / end, etc.

Separation of concerns...


Since we changed the syntax for literals from " to (...) the discussion changes a bit, but the principle remains the same.

gibson042 added a commit to gibson042/message-format-wg that referenced this issue Jun 9, 2022
Adds explicit mention of cases that are often overlooked.

Fixes unicode-org#268
@gibson042
Copy link
Collaborator Author

I'm not sure why you mention technologies such as Java or XML, because this issue is specifically about Message Format syntax and corresponding behavior/semantics.

Regardless, I think my question is answered in the affirmative, and I have accordingly opened #282.

@eemeli eemeli linked a pull request Jun 9, 2022 that will close this issue
@mihnita
Copy link
Collaborator

mihnita commented Jun 9, 2022

Sorry, I brought in Java / XML because I though we are talking about file formats.

What threw me off was this:

Many formats do not support multi-line string literals—their syntax does not permit e.g. unescaped ASCII control characters

I think an example would help.

Because I can't think of any format that require escaped ASCII for double quotes or newline.
As long as we talk about what I call "in memory format" and I assume you call "specifically about Message Format syntax"
These escapes (quotes / newline) are usually file format issue (and that was my connection to XML and Java)


Note: if you want to close this discussion I have nothing against.

I've already approved the pull request #282

I'm not opposing anything, I'm only trying to understand the comment .

@gibson042
Copy link
Collaborator Author

Many formats do not support multi-line string literals—their syntax does not permit e.g. unescaped ASCII control characters

I think an example would help.

Because I can't think of any format that require escaped ASCII for double quotes or newline.

The most obvious example is JSON requiring escaping for \, ", and all C0 control characters (0x00 through 0x1F, which includes both line feed and carriage return):

  string = quotation-mark *char quotation-mark

  char = unescaped /
      escape (
          %x22 /          ; "    quotation mark  U+0022
          %x5C /          ; \    reverse solidus U+005C
          %x2F /          ; /    solidus         U+002F
          %x62 /          ; b    backspace       U+0008
          %x66 /          ; f    form feed       U+000C
          %x6E /          ; n    line feed       U+000A
          %x72 /          ; r    carriage return U+000D
          %x74 /          ; t    tab             U+0009
          %x75 4HEXDIG )  ; uXXXX                U+XXXX

  escape = %x5C              ; \

  quotation-mark = %x22      ; "

  unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

To provide a concrete example, " " (U+0022 QUOTATION MARK, U+0009 CHARACTER TABULATION, U+0022 QUOTATION MARK) is invalid in JSON but valid in the current MessageFormat 2.0 develop branch.

As long as we talk about what I call "in memory format" and I assume you call "specifically about Message Format syntax". These escapes (quotes / newline) are usually file format issue (and that was my connection to XML and Java)

I am talking about the format defined by spec/message.ebnf and described in spec/syntax.md, which per the latter document is expected to appear in many places (making this not a "file format issue"):

The syntax should make a single message easily embeddable inside many container formats:
.properties, YAML, XML, inlined as string literals in programming languages, etc.

Does that clarify things?

@mihnita
Copy link
Collaborator

mihnita commented Jun 10, 2022

Thank you for the clarification.

Does that clarify things?

Yes and no :-)

Once we talk about json requiring escaping for \, " we moved to file format territory.
Which is what I mentioned XML and Java.
In my mind I was talking at the same level as you.
That's why I was puzzled by the question "why did you bring in XML and Java": because you talk about JSON :-)

If we move the discussion level of to what I call "in memory" then all this goes away.
When I parse a json, the string in memory does not contain \", it contains "
And when I parse xml, the string in memory is again ", not """

"In memory" is where the "real" MF2 syntax is, and the xml / json differences go away.
The MF2 parser (from string to data model) should not see these differences.
It should only escape things that interfere with its own syntax (like {).

That is the level where we should be at (I think) when we discuss the MF2 syntax

@gibson042
Copy link
Collaborator Author

gibson042 commented Jun 11, 2022

You may mean something different by "file format" than I do. Pulling straight from RFC 8259, JSON is "a text-based, language-independent data interchange format". That's also true of XML, and yes, of MF2—it's literally what's defined by the EBNF.

On the other hand, if by "in memory" you're referring to the structure and contents of local RAM for an implementation processing MF2 messages, then I don't see how that can possibly be in scope for specification. Strings can be UTF-8 or UTF-16 or UTF-32 or CESU-8 or other encodings, data structures can be byte-aligned in different ways (or not at all), pointers can be optimized for big-endian or little-endian or even mixed-endian processors... even bytes theirselves need not be constrained to exactly 8 bits (and historically have not been). There are thousands of choices available for someone writing code to implement the required interfaces, semantics, and behavior of such a specification.

The data model is certainly in scope, as is the text-based format for representing instances thereof. And it's the latter that this issue concerns, in which escape sequences are necessary to represent at least characters that otherwise serve to indicate the framing and/or structure of message components (e.g., \\ and \"), and possibly others that are disallowed because including them causes problems for people and/or machines that read, write, or modify those representations (e.g, surrogate code points and control characters).

gibson042 added a commit to gibson042/message-format-wg that referenced this issue Jul 18, 2022
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8.
Ref unicode-org#268
@echeran
Copy link
Collaborator

echeran commented Aug 11, 2022

PR #290 seems fine to me -- it is a minimal change that only restricts surrogates, which the Unicode spec defines as characters that cannot be assigned to any abstract character. So this change feels safe.

I happen to be plodding along through the Unicode book (aka "core spec"), and came across some verbiage that seems very relevant to the topic and to our previous discussions. I will include it here below so that we have it for future reference, and so that it stays with the overall discussion beyond PR #290.

Chapter 2, section 2.4, describes categories of characters (including Surrogates) and describes which characters {can/cannot; intended/not intended} for interchange.

  • Surrogate code points cannot be conformantly interchanged using Unicode encoding forms. They do not correspond to Unicode scalar values and thus do not have well-formed representations in any Unicode encoding form. (See Section 3.8, Surrogates.)
  • Noncharacter code points are reserved for internal use, such as for sentinel values. They have well-formed representations in Unicode encoding forms and survive conversions between encoding forms. This allows sentinel values to be preserved internally across Unicode encoding forms, even though they are not designed to be used in open interchange.
  • All implementations need to preserve reserved code points because they may originate in implementations that use a future version of the Unicode Standard. For example...

Chapter 2, section 2.7, describes Unicode strings -- what is valid, how they are used in practice, and suggestions. Since the section is short, I reproduce it in its entirety below.

A Unicode string data type is simply an ordered sequence of code units. Thus a Unicode 8- bit string is an ordered sequence of 8-bit code units, a Unicode 16-bit string is an ordered sequence of 16-bit code units, and a Unicode 32-bit string is an ordered sequence of 32-bit code units.

Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java,C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.

It is straightforward to design basic string manipulation libraries that handle isolated surrogates in a consistent and straightforward manner. They cannot ever be interpreted as abstract characters, but they can be internally handled the same way as noncharacters where they occur. Typically they occur only ephemerally, such as in dealing with keyboard events. While an ideal protocol would allow keyboard events to contain complete strings,many allow only a single UTF-16 code unit per event. As a sequence of events is transmitted to the application, a string that is being built up by the application in response to those events may contain isolated surrogates at any particular point in time.

Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowed when that string is specified to be well-formed UTF-16. A number of techniques are available for dealing with an isolated surrogate, such as omitting it, converting it into U+FFFD replacement character to produce well-formed UTF-16, or simply halting the processing of the string with an error. (See Section 3.9, Unicode Encoding Forms.)

eemeli pushed a commit that referenced this issue Aug 17, 2022
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8.

Ref #268
echeran pushed a commit that referenced this issue Sep 20, 2022
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8.

Ref #268
@stasm stasm added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Sep 23, 2022
@aphillips aphillips removed the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Feb 27, 2023
@aphillips
Copy link
Member

Resolved per 2023-02-27 call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
syntax Issues related with MF Syntax
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants