From 461555f4b44524909e03e23fc717925de3e35264 Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Sun, 20 Oct 2024 13:20:29 -0700
Subject: [PATCH 01/10] Allow surrogates in content, issue #895

---
 spec/appendices.md | 12 ++++++++----
 spec/message.abnf  |  3 +--
 spec/syntax.md     | 11 ++++++-----
 3 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/spec/appendices.md b/spec/appendices.md
index e94544596..2f5d6143a 100644
--- a/spec/appendices.md
+++ b/spec/appendices.md
@@ -14,17 +14,21 @@ host environments, their serializations and resource formats,
 that might be sufficient to prevent most problems.
 However, MessageFormat itself does not supply such a restriction.
 
-MessageFormat _messages_ permit nearly all Unicode code points,
-with the exception of surrogates, 
+MessageFormat _messages_ permit nearly all Unicode code points
 to appear in _literals_, including the text portions of a _pattern_.
 This means that it can be possible for a _message_ to contain invisible characters
-(such as bidirectional controls, 
-ASCII control characters in the range U+0000 to U+001F,
+(such as bidirectional controls, ASCII control characters in the range U+0000 to U+001F,
 or characters that might be interpreted as escapes or syntax in the host format)
 that abnormally affect the display of the _message_
 when viewed as source code, or in resource formats or translation tools,
 but do not generate errors from MessageFormat parsers or processing APIs.
 
+The localizable elements of a message (text and string literals) allow the presence of
+unpaired surrogates (U+D800 to U+DFFF). This is for compatibility with existing formats
+that are agnostic about them. \
+But their presence of unpaired surrogates is likely an indication of mistakes or bad tooling.
+Their use is not recommended, and linting (if present) can be used to prevent them.
+
 Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew) 
 also poses a potential source of confusion for users. 
 Since MessageFormat 2.0's syntax makes use of 
diff --git a/spec/message.abnf b/spec/message.abnf
index 8ab7b5b23..a9293040c 100644
--- a/spec/message.abnf
+++ b/spec/message.abnf
@@ -76,8 +76,7 @@ content-char      = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
                   / %x41-5B        ; omit \ (%x5C)
                   / %x5D-7A        ; omit { | } (%x7B-7D)
                   / %x7E-2FFF      ; omit IDEOGRAPHIC SPACE (%x3000)
-                  / %x3001-D7FF    ; omit surrogates
-                  / %xE000-10FFFF
+                  / %x3001-10FFFF  ; allowing surrogates is intentional
 
 ; Character escapes
 escaped-char = backslash ( backslash / "{" / "|" / "}" )
diff --git a/spec/syntax.md b/spec/syntax.md
index a31c3f921..0cfd75542 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions:
    control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
    (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
    private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
-   U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.
+   U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and literals
+   only (U+D800 through U+DFFF), and other potentially confusing content.
 
 ## Messages and their Syntax
 
@@ -274,8 +275,9 @@ A _quoted pattern_ MAY be empty.
 ### Text
 
 **_<dfn>text</dfn>_** is the translateable content of a _pattern_.
-Any Unicode code point is allowed, except for U+0000 NULL
-and the surrogate code points U+D800 through U+DFFF inclusive.
+Any Unicode code point is allowed, except for U+0000 NULL.
+Unpaired surrogates code points (U+D800 through U+DFFF inclusive) are allowed
+in localizable elements, but using them is likely a mistake and not recommended.
 The characters U+005C REVERSE SOLIDUS `\`,
 U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
 MUST be escaped as `\\`, `\{`, and `\}` respectively.
@@ -691,8 +693,7 @@ A _literal_ can appear
 as a _key_ value,
 as the _operand_ of a _literal-expression_,
 or in the value of an _option_.
-A _literal_ MAY include any Unicode code point
-except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
+A _literal_ MAY include any Unicode code point except for U+0000 NULL.
 
 All code points are preserved.
 

From 617e39d4d2b7567d7727134c99cc6ef1947560c2 Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Sun, 20 Oct 2024 17:47:43 -0700
Subject: [PATCH 02/10] Grammar and typos, linkify terms, make into a note, and
 fix 2119 keywords

Thanks Addison!

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
---
 spec/appendices.md | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/spec/appendices.md b/spec/appendices.md
index 2f5d6143a..417c7e6db 100644
--- a/spec/appendices.md
+++ b/spec/appendices.md
@@ -23,11 +23,21 @@ that abnormally affect the display of the _message_
 when viewed as source code, or in resource formats or translation tools,
 but do not generate errors from MessageFormat parsers or processing APIs.
 
-The localizable elements of a message (text and string literals) allow the presence of
-unpaired surrogates (U+D800 to U+DFFF). This is for compatibility with existing formats
-that are agnostic about them. \
-But their presence of unpaired surrogates is likely an indication of mistakes or bad tooling.
-Their use is not recommended, and linting (if present) can be used to prevent them.
+> [!IMPORTANT]
+> _Text_ and _literals_ allow unpaired surrogate code points
+> (`U+D800` to `U+DFFF`).
+> This is for compatibility with formats or data structures 
+> that use the UTF-16 encoding 
+> and do not check for unpaired surrogates.
+> (Strings in Java or JavaScript are examples of this.)
+> These code points SHOULD NOT be used in a _message_.
+> Unpaired surrogate code points are likely an indication of mistakes
+> or errors in the creation, serialization, or processing of the _message_.
+> Many processes will convert them to 
+> &#xfffd; U+FFFD REPLACEMENT CHARACTER
+> during processing or display.
+> Implementations not based on UTF-16 might not be able to represent
+> a _message_ containing such code points.
 
 Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew) 
 also poses a potential source of confusion for users. 

From 41ed95962e0ee39d2a2d49ba9841b6be18495cef Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Sun, 20 Oct 2024 17:50:00 -0700
Subject: [PATCH 03/10] Not using "localizable elements"

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
---
 spec/syntax.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/spec/syntax.md b/spec/syntax.md
index 0cfd75542..f90f6e18e 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -276,8 +276,10 @@ A _quoted pattern_ MAY be empty.
 
 **_<dfn>text</dfn>_** is the translateable content of a _pattern_.
 Any Unicode code point is allowed, except for U+0000 NULL.
-Unpaired surrogates code points (U+D800 through U+DFFF inclusive) are allowed
-in localizable elements, but using them is likely a mistake and not recommended.
+> [!NOTE]
+> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
+> are allowed for compatibility with UTF-16 based implementations
+> that do not check for this encoding error.
 The characters U+005C REVERSE SOLIDUS `\`,
 U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
 MUST be escaped as `\\`, `\{`, and `\}` respectively.

From 6fc5d68c13d0f79f9533fb7eca702a4ff3df1135 Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Sun, 20 Oct 2024 17:55:07 -0700
Subject: [PATCH 04/10] Keep syntax.md in sync with message.abnf

---
 spec/syntax.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/spec/syntax.md b/spec/syntax.md
index f90f6e18e..321be1a89 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -305,7 +305,6 @@ content-char      = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
                   / %x41-5B        ; omit \ (%x5C)
                   / %x5D-7A        ; omit { | } (%x7B-7D)
                   / %x7E-2FFF      ; omit IDEOGRAPHIC SPACE (%x3000)
-                  / %x3001-D7FF    ; omit surrogates
                   / %xE000-10FFFF
 ```
 

From 4a26e5dfbd1ed8d6bb48fd6a5625ee6ef7f3a66f Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Sun, 20 Oct 2024 18:03:56 -0700
Subject: [PATCH 05/10] Added note about surrogates to quoted literals

---
 spec/appendices.md |  2 +-
 spec/syntax.md     | 11 ++++++++---
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/spec/appendices.md b/spec/appendices.md
index 417c7e6db..4d3e22a5e 100644
--- a/spec/appendices.md
+++ b/spec/appendices.md
@@ -24,7 +24,7 @@ when viewed as source code, or in resource formats or translation tools,
 but do not generate errors from MessageFormat parsers or processing APIs.
 
 > [!IMPORTANT]
-> _Text_ and _literals_ allow unpaired surrogate code points
+> _Text_ and _quoted literals_ allow unpaired surrogate code points
 > (`U+D800` to `U+DFFF`).
 > This is for compatibility with formats or data structures 
 > that use the UTF-16 encoding 
diff --git a/spec/syntax.md b/spec/syntax.md
index 321be1a89..74c546a88 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -60,8 +60,8 @@ The syntax specification takes into account the following design restrictions:
    control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
    (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
    private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
-   U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and literals
-   only (U+D800 through U+DFFF), and other potentially confusing content.
+   U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and
+   quoted literals only (U+D800 through U+DFFF), and other potentially confusing content.
 
 ## Messages and their Syntax
 
@@ -305,7 +305,7 @@ content-char      = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
                   / %x41-5B        ; omit \ (%x5C)
                   / %x5D-7A        ; omit { | } (%x7B-7D)
                   / %x7E-2FFF      ; omit IDEOGRAPHIC SPACE (%x3000)
-                  / %xE000-10FFFF
+                  / %x3001-10FFFF  ; allowing surrogates is intentional
 ```
 
 When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the
@@ -716,6 +716,11 @@ A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
 The characters `\` and `|` within a _quoted literal_ MUST be
 escaped as `\\` and `\|`.
 
+> [!NOTE]
+> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
+> are allowed in quoted literals for compatibility with UTF-16 based
+> implementations that do not check for this encoding error.
+
 An **_<dfn>unquoted literal</dfn>_** is a _literal_ that does not require the `|`
 quotes around it to be distinct from the rest of the _message_ syntax.
 An _unquoted literal_ MAY be used when the content of the _literal_

From d7de9debf7b00fe24ad29d5caf4e3d6325225538 Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Mon, 21 Oct 2024 16:52:02 -0700
Subject: [PATCH 06/10] Moved the note about surrogates from Security
 Considerations to The Message

---
 spec/appendices.md | 16 ----------------
 spec/syntax.md     | 16 ++++++++++++++++
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/spec/appendices.md b/spec/appendices.md
index 4d3e22a5e..b65036c6c 100644
--- a/spec/appendices.md
+++ b/spec/appendices.md
@@ -23,22 +23,6 @@ that abnormally affect the display of the _message_
 when viewed as source code, or in resource formats or translation tools,
 but do not generate errors from MessageFormat parsers or processing APIs.
 
-> [!IMPORTANT]
-> _Text_ and _quoted literals_ allow unpaired surrogate code points
-> (`U+D800` to `U+DFFF`).
-> This is for compatibility with formats or data structures 
-> that use the UTF-16 encoding 
-> and do not check for unpaired surrogates.
-> (Strings in Java or JavaScript are examples of this.)
-> These code points SHOULD NOT be used in a _message_.
-> Unpaired surrogate code points are likely an indication of mistakes
-> or errors in the creation, serialization, or processing of the _message_.
-> Many processes will convert them to 
-> &#xfffd; U+FFFD REPLACEMENT CHARACTER
-> during processing or display.
-> Implementations not based on UTF-16 might not be able to represent
-> a _message_ containing such code points.
-
 Bidirectional text containing right-to-left characters (such as used for Arabic or Hebrew) 
 also poses a potential source of confusion for users. 
 Since MessageFormat 2.0's syntax makes use of 
diff --git a/spec/syntax.md b/spec/syntax.md
index 74c546a88..0fc72c381 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -114,6 +114,22 @@ A **_<dfn>local variable</dfn>_** is a _variable_ created as the result of a _lo
 > In particular, it avoids using quote characters common to many file formats and formal languages
 > so that these do not need to be escaped in the body of a _message_.
 
+> [!NOTE]
+> _Text_ and _quoted literals_ allow unpaired surrogate code points
+> (`U+D800` to `U+DFFF`).
+> This is for compatibility with formats or data structures 
+> that use the UTF-16 encoding 
+> and do not check for unpaired surrogates.
+> (Strings in Java or JavaScript are examples of this.)
+> These code points SHOULD NOT be used in a _message_.
+> Unpaired surrogate code points are likely an indication of mistakes
+> or errors in the creation, serialization, or processing of the _message_.
+> Many processes will convert them to 
+> &#xfffd; U+FFFD REPLACEMENT CHARACTER
+> during processing or display.
+> Implementations not based on UTF-16 might not be able to represent
+> a _message_ containing such code points.
+
 > [!NOTE]
 > In general (and except where required by the syntax), whitespace carries no meaning in the structure
 > of a _message_. While many of the examples in this spec are written on multiple lines, the formatting

From 16b90b30535bf9a838c69ee7c446a6aef6a1ebfc Mon Sep 17 00:00:00 2001
From: Addison Phillips <addisonI18N@gmail.com>
Date: Mon, 21 Oct 2024 16:54:13 -0700
Subject: [PATCH 07/10] Update spec/syntax.md

---
 spec/syntax.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/spec/syntax.md b/spec/syntax.md
index 0fc72c381..b36f582a0 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -292,6 +292,7 @@ A _quoted pattern_ MAY be empty.
 
 **_<dfn>text</dfn>_** is the translateable content of a _pattern_.
 Any Unicode code point is allowed, except for U+0000 NULL.
+
 > [!NOTE]
 > Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
 > are allowed for compatibility with UTF-16 based implementations

From 07d6fe55815a03f61ce6bdb5b4b179b3d5ef8d65 Mon Sep 17 00:00:00 2001
From: Addison Phillips <addisonI18N@gmail.com>
Date: Mon, 21 Oct 2024 16:55:13 -0700
Subject: [PATCH 08/10] Update spec/syntax.md

---
 spec/syntax.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/spec/syntax.md b/spec/syntax.md
index b36f582a0..6fe046508 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -735,7 +735,7 @@ escaped as `\\` and `\|`.
 
 > [!NOTE]
 > Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
-> are allowed in quoted literals for compatibility with UTF-16 based
+> are allowed in _quoted literals_ for compatibility with UTF-16 based
 > implementations that do not check for this encoding error.
 
 An **_<dfn>unquoted literal</dfn>_** is a _literal_ that does not require the `|`

From 17f2bbc5f36829cc78563673d0a41c477fdfbbff Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Mon, 21 Oct 2024 16:55:09 -0700
Subject: [PATCH 09/10] Italicize  in a couple of places

---
 spec/syntax.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/spec/syntax.md b/spec/syntax.md
index 6fe046508..2817e2658 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -61,7 +61,7 @@ The syntax specification takes into account the following design restrictions:
    (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
    private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
    U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and
-   quoted literals only (U+D800 through U+DFFF), and other potentially confusing content.
+   _quoted literals_ only (U+D800 through U+DFFF), and other potentially confusing content.
 
 ## Messages and their Syntax
 

From b7877a4df9d17cbcd8d21c5be7be3dedbf879dfe Mon Sep 17 00:00:00 2001
From: Mihai Nita <nmihai_2000@yahoo.com>
Date: Mon, 21 Oct 2024 17:02:01 -0700
Subject: [PATCH 10/10] Implemeted more (all?) feedback from review

---
 spec/syntax.md | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/spec/syntax.md b/spec/syntax.md
index 2817e2658..38725a053 100644
--- a/spec/syntax.md
+++ b/spec/syntax.md
@@ -60,8 +60,8 @@ The syntax specification takes into account the following design restrictions:
    control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
    (U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
    private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
-   U+100000 through U+10FFFD), unassigned code points, unpaired surrogates in messages and
-   _quoted literals_ only (U+D800 through U+DFFF), and other potentially confusing content.
+   U+100000 through U+10FFFD), unassigned code points, unpaired surrogates (U+D800 through U+DFFF),
+   and other potentially confusing content.
 
 ## Messages and their Syntax
 
@@ -293,10 +293,6 @@ A _quoted pattern_ MAY be empty.
 **_<dfn>text</dfn>_** is the translateable content of a _pattern_.
 Any Unicode code point is allowed, except for U+0000 NULL.
 
-> [!NOTE]
-> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
-> are allowed for compatibility with UTF-16 based implementations
-> that do not check for this encoding error.
 The characters U+005C REVERSE SOLIDUS `\`,
 U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
 MUST be escaped as `\\`, `\{`, and `\}` respectively.
@@ -325,6 +321,11 @@ content-char      = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
                   / %x3001-10FFFF  ; allowing surrogates is intentional
 ```
 
+> [!NOTE]
+> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
+> are allowed for compatibility with UTF-16 based implementations
+> that do not check for this encoding error.
+
 When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the
 resulting _message_ can be embedded into
 various formats regardless of the container's whitespace trimming rules.