From 461555f4b44524909e03e23fc717925de3e35264 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Sun, 20 Oct 2024 13:20:29 -0700 Subject: [PATCH 01/10] Allow surrogates in content, issue #895 --- spec/appendices.md | 12 ++++++++---- spec/message.abnf | 3 +-- spec/syntax.md | 11 ++++++----- 3 files changed, 15 insertions(+), 11 deletions(-) diff --git a/spec/appendices.md b/spec/appendices.md index e94544596..2f5d6143a 100644 --- a/spec/appendices.md +++ b/spec/appendices.md @@ -14,17 +14,21 @@ host environments, their serializations and resource formats, that might be sufficient to prevent most problems. However, MessageFormat itself does not supply such a restriction. -MessageFormat _messages_ permit nearly all Unicode code points, -with the exception of surrogates, +MessageFormat _messages_ permit nearly all Unicode code points to appear in _literals_, including the text portions of a _pattern_. This means that it can be possible for a _message_ to contain invisible characters -(such as bidirectional controls, -ASCII control characters in the range U+0000 to U+001F, +(such as bidirectional controls, ASCII control characters in the range U+0000 to U+001F, or characters that might be interpreted as escapes or syntax in the host format) that abnormally affect the display of the _message_ when viewed as source code, or in resource formats or translation tools, but do not generate errors from MessageFormat parsers or processing APIs. +The localizable elements of a message (text and string literals) allow the presence of +unpaired surrogates (U+D800 to U+DFFF). This is for compatibility with existing formats +that are agnostic about them. \ +But their presence of unpaired surrogates is likely an indication of mistakes or bad tooling. +Their use is not recommended, and linting (if present) can be used to prevent them. + Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew) also poses a potential source of confusion for users. Since MessageFormat 2.0's syntax makes use of diff --git a/spec/message.abnf b/spec/message.abnf index 8ab7b5b23..a9293040c 100644 --- a/spec/message.abnf +++ b/spec/message.abnf @@ -76,8 +76,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) / %x41-5B ; omit \ (%x5C) / %x5D-7A ; omit { | } (%x7B-7D) / %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000) - / %x3001-D7FF ; omit surrogates - / %xE000-10FFFF + / %x3001-10FFFF ; allowing surrogates is intentional ; Character escapes escaped-char = backslash ( backslash / "{" / "|" / "}" ) diff --git a/spec/syntax.md b/spec/syntax.md index a31c3f921..0cfd75542 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions: control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and - U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content. + U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and literals + only (U+D800 through U+DFFF), and other potentially confusing content. ## Messages and their Syntax @@ -274,8 +275,9 @@ A _quoted pattern_ MAY be empty. ### Text **_text_** is the translateable content of a _pattern_. -Any Unicode code point is allowed, except for U+0000 NULL -and the surrogate code points U+D800 through U+DFFF inclusive. +Any Unicode code point is allowed, except for U+0000 NULL. +Unpaired surrogates code points (U+D800 through U+DFFF inclusive) are allowed +in localizable elements, but using them is likely a mistake and not recommended. The characters U+005C REVERSE SOLIDUS `\`, U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}` MUST be escaped as `\\`, `\{`, and `\}` respectively. @@ -691,8 +693,7 @@ A _literal_ can appear as a _key_ value, as the _operand_ of a _literal-expression_, or in the value of an _option_. -A _literal_ MAY include any Unicode code point -except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF. +A _literal_ MAY include any Unicode code point except for U+0000 NULL. All code points are preserved. From 617e39d4d2b7567d7727134c99cc6ef1947560c2 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Sun, 20 Oct 2024 17:47:43 -0700 Subject: [PATCH 02/10] Grammar and typos, linkify terms, make into a note, and fix 2119 keywords Thanks Addison! Co-authored-by: Addison Phillips --- spec/appendices.md | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/spec/appendices.md b/spec/appendices.md index 2f5d6143a..417c7e6db 100644 --- a/spec/appendices.md +++ b/spec/appendices.md @@ -23,11 +23,21 @@ that abnormally affect the display of the _message_ when viewed as source code, or in resource formats or translation tools, but do not generate errors from MessageFormat parsers or processing APIs. -The localizable elements of a message (text and string literals) allow the presence of -unpaired surrogates (U+D800 to U+DFFF). This is for compatibility with existing formats -that are agnostic about them. \ -But their presence of unpaired surrogates is likely an indication of mistakes or bad tooling. -Their use is not recommended, and linting (if present) can be used to prevent them. +> [!IMPORTANT] +> _Text_ and _literals_ allow unpaired surrogate code points +> (`U+D800` to `U+DFFF`). +> This is for compatibility with formats or data structures +> that use the UTF-16 encoding +> and do not check for unpaired surrogates. +> (Strings in Java or JavaScript are examples of this.) +> These code points SHOULD NOT be used in a _message_. +> Unpaired surrogate code points are likely an indication of mistakes +> or errors in the creation, serialization, or processing of the _message_. +> Many processes will convert them to +> � U+FFFD REPLACEMENT CHARACTER +> during processing or display. +> Implementations not based on UTF-16 might not be able to represent +> a _message_ containing such code points. Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew) also poses a potential source of confusion for users. From 41ed95962e0ee39d2a2d49ba9841b6be18495cef Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Sun, 20 Oct 2024 17:50:00 -0700 Subject: [PATCH 03/10] Not using "localizable elements" Co-authored-by: Addison Phillips --- spec/syntax.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/spec/syntax.md b/spec/syntax.md index 0cfd75542..f90f6e18e 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -276,8 +276,10 @@ A _quoted pattern_ MAY be empty. **_text_** is the translateable content of a _pattern_. Any Unicode code point is allowed, except for U+0000 NULL. -Unpaired surrogates code points (U+D800 through U+DFFF inclusive) are allowed -in localizable elements, but using them is likely a mistake and not recommended. +> [!NOTE] +> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive) +> are allowed for compatibility with UTF-16 based implementations +> that do not check for this encoding error. The characters U+005C REVERSE SOLIDUS `\`, U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}` MUST be escaped as `\\`, `\{`, and `\}` respectively. From 6fc5d68c13d0f79f9533fb7eca702a4ff3df1135 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Sun, 20 Oct 2024 17:55:07 -0700 Subject: [PATCH 04/10] Keep syntax.md in sync with message.abnf --- spec/syntax.md | 1 - 1 file changed, 1 deletion(-) diff --git a/spec/syntax.md b/spec/syntax.md index f90f6e18e..321be1a89 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -305,7 +305,6 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) / %x41-5B ; omit \ (%x5C) / %x5D-7A ; omit { | } (%x7B-7D) / %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000) - / %x3001-D7FF ; omit surrogates / %xE000-10FFFF ``` From 4a26e5dfbd1ed8d6bb48fd6a5625ee6ef7f3a66f Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Sun, 20 Oct 2024 18:03:56 -0700 Subject: [PATCH 05/10] Added note about surrogates to quoted literals --- spec/appendices.md | 2 +- spec/syntax.md | 11 ++++++++--- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/spec/appendices.md b/spec/appendices.md index 417c7e6db..4d3e22a5e 100644 --- a/spec/appendices.md +++ b/spec/appendices.md @@ -24,7 +24,7 @@ when viewed as source code, or in resource formats or translation tools, but do not generate errors from MessageFormat parsers or processing APIs. > [!IMPORTANT] -> _Text_ and _literals_ allow unpaired surrogate code points +> _Text_ and _quoted literals_ allow unpaired surrogate code points > (`U+D800` to `U+DFFF`). > This is for compatibility with formats or data structures > that use the UTF-16 encoding diff --git a/spec/syntax.md b/spec/syntax.md index 321be1a89..74c546a88 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -60,8 +60,8 @@ The syntax specification takes into account the following design restrictions: control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and - U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and literals - only (U+D800 through U+DFFF), and other potentially confusing content. + U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and + quoted literals only (U+D800 through U+DFFF), and other potentially confusing content. ## Messages and their Syntax @@ -305,7 +305,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) / %x41-5B ; omit \ (%x5C) / %x5D-7A ; omit { | } (%x7B-7D) / %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000) - / %xE000-10FFFF + / %x3001-10FFFF ; allowing surrogates is intentional ``` When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the @@ -716,6 +716,11 @@ A **_quoted literal_** begins and ends with U+005E VERTICAL BAR `|`. The characters `\` and `|` within a _quoted literal_ MUST be escaped as `\\` and `\|`. +> [!NOTE] +> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive) +> are allowed in quoted literals for compatibility with UTF-16 based +> implementations that do not check for this encoding error. + An **_unquoted literal_** is a _literal_ that does not require the `|` quotes around it to be distinct from the rest of the _message_ syntax. An _unquoted literal_ MAY be used when the content of the _literal_ From d7de9debf7b00fe24ad29d5caf4e3d6325225538 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Mon, 21 Oct 2024 16:52:02 -0700 Subject: [PATCH 06/10] Moved the note about surrogates from Security Considerations to The Message --- spec/appendices.md | 16 ---------------- spec/syntax.md | 16 ++++++++++++++++ 2 files changed, 16 insertions(+), 16 deletions(-) diff --git a/spec/appendices.md b/spec/appendices.md index 4d3e22a5e..b65036c6c 100644 --- a/spec/appendices.md +++ b/spec/appendices.md @@ -23,22 +23,6 @@ that abnormally affect the display of the _message_ when viewed as source code, or in resource formats or translation tools, but do not generate errors from MessageFormat parsers or processing APIs. -> [!IMPORTANT] -> _Text_ and _quoted literals_ allow unpaired surrogate code points -> (`U+D800` to `U+DFFF`). -> This is for compatibility with formats or data structures -> that use the UTF-16 encoding -> and do not check for unpaired surrogates. -> (Strings in Java or JavaScript are examples of this.) -> These code points SHOULD NOT be used in a _message_. -> Unpaired surrogate code points are likely an indication of mistakes -> or errors in the creation, serialization, or processing of the _message_. -> Many processes will convert them to -> � U+FFFD REPLACEMENT CHARACTER -> during processing or display. -> Implementations not based on UTF-16 might not be able to represent -> a _message_ containing such code points. - Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew) also poses a potential source of confusion for users. Since MessageFormat 2.0's syntax makes use of diff --git a/spec/syntax.md b/spec/syntax.md index 74c546a88..0fc72c381 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -114,6 +114,22 @@ A **_local variable_** is a _variable_ created as the result of a _lo > In particular, it avoids using quote characters common to many file formats and formal languages > so that these do not need to be escaped in the body of a _message_. +> [!NOTE] +> _Text_ and _quoted literals_ allow unpaired surrogate code points +> (`U+D800` to `U+DFFF`). +> This is for compatibility with formats or data structures +> that use the UTF-16 encoding +> and do not check for unpaired surrogates. +> (Strings in Java or JavaScript are examples of this.) +> These code points SHOULD NOT be used in a _message_. +> Unpaired surrogate code points are likely an indication of mistakes +> or errors in the creation, serialization, or processing of the _message_. +> Many processes will convert them to +> � U+FFFD REPLACEMENT CHARACTER +> during processing or display. +> Implementations not based on UTF-16 might not be able to represent +> a _message_ containing such code points. + > [!NOTE] > In general (and except where required by the syntax), whitespace carries no meaning in the structure > of a _message_. While many of the examples in this spec are written on multiple lines, the formatting From 16b90b30535bf9a838c69ee7c446a6aef6a1ebfc Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Mon, 21 Oct 2024 16:54:13 -0700 Subject: [PATCH 07/10] Update spec/syntax.md --- spec/syntax.md | 1 + 1 file changed, 1 insertion(+) diff --git a/spec/syntax.md b/spec/syntax.md index 0fc72c381..b36f582a0 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -292,6 +292,7 @@ A _quoted pattern_ MAY be empty. **_text_** is the translateable content of a _pattern_. Any Unicode code point is allowed, except for U+0000 NULL. + > [!NOTE] > Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive) > are allowed for compatibility with UTF-16 based implementations From 07d6fe55815a03f61ce6bdb5b4b179b3d5ef8d65 Mon Sep 17 00:00:00 2001 From: Addison Phillips Date: Mon, 21 Oct 2024 16:55:13 -0700 Subject: [PATCH 08/10] Update spec/syntax.md --- spec/syntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/syntax.md b/spec/syntax.md index b36f582a0..6fe046508 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -735,7 +735,7 @@ escaped as `\\` and `\|`. > [!NOTE] > Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive) -> are allowed in quoted literals for compatibility with UTF-16 based +> are allowed in _quoted literals_ for compatibility with UTF-16 based > implementations that do not check for this encoding error. An **_unquoted literal_** is a _literal_ that does not require the `|` From 17f2bbc5f36829cc78563673d0a41c477fdfbbff Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Mon, 21 Oct 2024 16:55:09 -0700 Subject: [PATCH 09/10] Italicize in a couple of places --- spec/syntax.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec/syntax.md b/spec/syntax.md index 6fe046508..2817e2658 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -61,7 +61,7 @@ The syntax specification takes into account the following design restrictions: (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and - quoted literals only (U+D800 through U+DFFF), and other potentially confusing content. + _quoted literals_ only (U+D800 through U+DFFF), and other potentially confusing content. ## Messages and their Syntax From b7877a4df9d17cbcd8d21c5be7be3dedbf879dfe Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Mon, 21 Oct 2024 17:02:01 -0700 Subject: [PATCH 10/10] Implemeted more (all?) feedback from review --- spec/syntax.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/spec/syntax.md b/spec/syntax.md index 2817e2658..38725a053 100644 --- a/spec/syntax.md +++ b/spec/syntax.md @@ -60,8 +60,8 @@ The syntax specification takes into account the following design restrictions: control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters (U+FDD0 through U+FDEF and U+nFFFE and U+nFFFF where n is 0x0 through 0x10), private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and - U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and - _quoted literals_ only (U+D800 through U+DFFF), and other potentially confusing content. + U+100000 through U+10FFFD), unassigned code points, unpaired surrogates (U+D800 through U+DFFF), + and other potentially confusing content. ## Messages and their Syntax @@ -293,10 +293,6 @@ A _quoted pattern_ MAY be empty. **_text_** is the translateable content of a _pattern_. Any Unicode code point is allowed, except for U+0000 NULL. -> [!NOTE] -> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive) -> are allowed for compatibility with UTF-16 based implementations -> that do not check for this encoding error. The characters U+005C REVERSE SOLIDUS `\`, U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}` MUST be escaped as `\\`, `\{`, and `\}` respectively. @@ -325,6 +321,11 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A) / %x3001-10FFFF ; allowing surrogates is intentional ``` +> [!NOTE] +> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive) +> are allowed for compatibility with UTF-16 based implementations +> that do not check for this encoding error. + When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the resulting _message_ can be embedded into various formats regardless of the container's whitespace trimming rules.