Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow surrogates in content, issue #895 #906

Merged
merged 10 commits into from
Oct 22, 2024
6 changes: 2 additions & 4 deletions spec/appendices.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,10 @@ host environments, their serializations and resource formats,
that might be sufficient to prevent most problems.
However, MessageFormat itself does not supply such a restriction.

MessageFormat _messages_ permit nearly all Unicode code points,
with the exception of surrogates,
MessageFormat _messages_ permit nearly all Unicode code points
to appear in _literals_, including the text portions of a _pattern_.
This means that it can be possible for a _message_ to contain invisible characters
(such as bidirectional controls,
ASCII control characters in the range U+0000 to U+001F,
(such as bidirectional controls, ASCII control characters in the range U+0000 to U+001F,
or characters that might be interpreted as escapes or syntax in the host format)
that abnormally affect the display of the _message_
when viewed as source code, or in resource formats or translation tools,
Expand Down
3 changes: 1 addition & 2 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -76,8 +76,7 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
/ %x41-5B ; omit \ (%x5C)
/ %x5D-7A ; omit { | } (%x7B-7D)
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
/ %x3001-D7FF ; omit surrogates
/ %xE000-10FFFF
/ %x3001-10FFFF ; allowing surrogates is intentional

; Character escapes
escaped-char = backslash ( backslash / "{" / "|" / "}" )
Expand Down
39 changes: 32 additions & 7 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ The syntax specification takes into account the following design restrictions:
control characters such as U+0000 NULL and U+0009 TAB, permanently reserved noncharacters
(U+FDD0 through U+FDEF and U+<i>n</i>FFFE and U+<i>n</i>FFFF where <i>n</i> is 0x0 through 0x10),
private-use code points (U+E000 through U+F8FF, U+F0000 through U+FFFFD, and
U+100000 through U+10FFFD), unassigned code points, and other potentially confusing content.
U+100000 through U+10FFFD), unassigned code points, unpaired surrogates (U+D800 through U+DFFF),
and other potentially confusing content.

## Messages and their Syntax

Expand Down Expand Up @@ -113,6 +114,22 @@ A **_<dfn>local variable</dfn>_** is a _variable_ created as the result of a _lo
> In particular, it avoids using quote characters common to many file formats and formal languages
> so that these do not need to be escaped in the body of a _message_.

> [!NOTE]
> _Text_ and _quoted literals_ allow unpaired surrogate code points
> (`U+D800` to `U+DFFF`).
> This is for compatibility with formats or data structures
> that use the UTF-16 encoding
> and do not check for unpaired surrogates.
> (Strings in Java or JavaScript are examples of this.)
> These code points SHOULD NOT be used in a _message_.
> Unpaired surrogate code points are likely an indication of mistakes
> or errors in the creation, serialization, or processing of the _message_.
> Many processes will convert them to
> &#xfffd; U+FFFD REPLACEMENT CHARACTER
> during processing or display.
> Implementations not based on UTF-16 might not be able to represent
> a _message_ containing such code points.

> [!NOTE]
> In general (and except where required by the syntax), whitespace carries no meaning in the structure
> of a _message_. While many of the examples in this spec are written on multiple lines, the formatting
Expand Down Expand Up @@ -274,8 +291,8 @@ A _quoted pattern_ MAY be empty.
### Text

**_<dfn>text</dfn>_** is the translateable content of a _pattern_.
Any Unicode code point is allowed, except for U+0000 NULL
and the surrogate code points U+D800 through U+DFFF inclusive.
Any Unicode code point is allowed, except for U+0000 NULL.

The characters U+005C REVERSE SOLIDUS `\`,
U+007B LEFT CURLY BRACKET `{`, and U+007D RIGHT CURLY BRACKET `}`
MUST be escaped as `\\`, `\{`, and `\}` respectively.
Expand All @@ -301,10 +318,14 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
/ %x41-5B ; omit \ (%x5C)
/ %x5D-7A ; omit { | } (%x7B-7D)
/ %x7E-2FFF ; omit IDEOGRAPHIC SPACE (%x3000)
/ %x3001-D7FF ; omit surrogates
/ %xE000-10FFFF
/ %x3001-10FFFF ; allowing surrogates is intentional
```

> [!NOTE]
> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
> are allowed for compatibility with UTF-16 based implementations
> that do not check for this encoding error.

When a _pattern_ is quoted by embedding the _pattern_ in curly brackets, the
resulting _message_ can be embedded into
various formats regardless of the container's whitespace trimming rules.
Expand Down Expand Up @@ -691,8 +712,7 @@ A _literal_ can appear
as a _key_ value,
as the _operand_ of a _literal-expression_,
or in the value of an _option_.
A _literal_ MAY include any Unicode code point
except for U+0000 NULL or the surrogate code points U+D800 through U+DFFF.
A _literal_ MAY include any Unicode code point except for U+0000 NULL.
aphillips marked this conversation as resolved.
Show resolved Hide resolved

All code points are preserved.

Expand All @@ -714,6 +734,11 @@ A **_<dfn>quoted literal</dfn>_** begins and ends with U+005E VERTICAL BAR `|`.
The characters `\` and `|` within a _quoted literal_ MUST be
escaped as `\\` and `\|`.

> [!NOTE]
> Unpaired surrogate code points (`U+D800` through `U+DFFF` inclusive)
> are allowed in _quoted literals_ for compatibility with UTF-16 based
> implementations that do not check for this encoding error.

An **_<dfn>unquoted literal</dfn>_** is a _literal_ that does not require the `|`
quotes around it to be distinct from the rest of the _message_ syntax.
An _unquoted literal_ MAY be used when the content of the _literal_
Expand Down