-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strings: Are all code points preserved? #268
Comments
I don't know what the question is here. If you provide a literal value that contains line feeds, then it contains line feeds. That may be desirable, or it may not be useful. But it's consistent with the behavior of other message formatting libraries, like printf() or ICU MessageFormat, or Java String.format() etc. |
Many formats do not support multi-line string literals—their syntax does not permit e.g. unescaped ASCII control characters. Such constraints can simplify parsing, reporting (e.g. the line number of an error), and human comprehensibility, and also preclude compatibility issues from e.g. usually-invisible line feed vs. carriage return + line feed endings being distinguishable contents. It's certainly possible to support literally any raw code point between enclosing |
But they allow for escaped ASCII controls. This means it is all good. Escaping such things belong in the file specific rules (not in the MF2 syntax) For example if Java Or you could use .xml files (also a native format for Java localization) But all of these are file storage layer. So I think we should not concert ourselves with escape rules for the "raw, in memory string" that is parsed, after being loaded. We only need to escape characters that conflict with our own syntax. In fact adding our own escapes for newline and such (if the syntax does not require it) can make things more difficult when storing the messages in existing l10n file formats. Separation of concerns... Since we changed the syntax for literals from |
Adds explicit mention of cases that are often overlooked. Fixes unicode-org#268
I'm not sure why you mention technologies such as Java or XML, because this issue is specifically about Message Format syntax and corresponding behavior/semantics. Regardless, I think my question is answered in the affirmative, and I have accordingly opened #282. |
Sorry, I brought in Java / XML because I though we are talking about file formats. What threw me off was this:
I think an example would help. Because I can't think of any format that require escaped ASCII for double quotes or newline. Note: if you want to close this discussion I have nothing against. I've already approved the pull request #282 I'm not opposing anything, I'm only trying to understand the comment . |
The most obvious example is JSON requiring escaping for
To provide a concrete example,
I am talking about the format defined by spec/message.ebnf and described in spec/syntax.md, which per the latter document is expected to appear in many places (making this not a "file format issue"):
Does that clarify things? |
Thank you for the clarification.
Yes and no :-) Once we talk about json requiring escaping for If we move the discussion level of to what I call "in memory" then all this goes away. "In memory" is where the "real" MF2 syntax is, and the xml / json differences go away. That is the level where we should be at (I think) when we discuss the MF2 syntax |
You may mean something different by "file format" than I do. Pulling straight from RFC 8259, JSON is "a text-based, language-independent data interchange format". That's also true of XML, and yes, of MF2—it's literally what's defined by the EBNF. On the other hand, if by "in memory" you're referring to the structure and contents of local RAM for an implementation processing MF2 messages, then I don't see how that can possibly be in scope for specification. Strings can be UTF-8 or UTF-16 or UTF-32 or CESU-8 or other encodings, data structures can be byte-aligned in different ways (or not at all), pointers can be optimized for big-endian or little-endian or even mixed-endian processors... even bytes theirselves need not be constrained to exactly 8 bits (and historically have not been). There are thousands of choices available for someone writing code to implement the required interfaces, semantics, and behavior of such a specification. The data model is certainly in scope, as is the text-based format for representing instances thereof. And it's the latter that this issue concerns, in which escape sequences are necessary to represent at least characters that otherwise serve to indicate the framing and/or structure of message components (e.g., |
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8. Ref unicode-org#268
PR #290 seems fine to me -- it is a minimal change that only restricts surrogates, which the Unicode spec defines as characters that cannot be assigned to any abstract character. So this change feels safe. I happen to be plodding along through the Unicode book (aka "core spec"), and came across some verbiage that seems very relevant to the topic and to our previous discussions. I will include it here below so that we have it for future reference, and so that it stays with the overall discussion beyond PR #290. Chapter 2, section 2.4, describes categories of characters (including Surrogates) and describes which characters {can/cannot; intended/not intended} for interchange.
Chapter 2, section 2.7, describes Unicode strings -- what is valid, how they are used in practice, and suggestions. Since the section is short, I reproduce it in its entirety below.
|
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8. Ref #268
Surrogate code units (U+D800 through U+DBFF) cannot be encoded into UTF-8. Ref #268
Resolved per 2023-02-27 call |
Per
develop
syntax: Quoted Strings, the only code points that may not appear between a string's opening and closing"
unescaped are"
itself and\
, which means that multi-line strings in particular are valid. So I'm considering the behavior of e.g.There's also the even thornier problems of control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), and especially surrogate code points (U+D800 through U+DBFF), which are impossible to encode into UTF-8.
The text was updated successfully, but these errors were encountered: