Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bidi support and address UAX31/UTS55 requirements #884

Merged
merged 28 commits into from
Sep 16, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
82fcef3
Add bidi support and address UAX31/UTS55 requirements
aphillips Sep 11, 2024
c5baba6
Update syntax.md including text from previous PR
aphillips Sep 11, 2024
ca63819
Repair the guidance on strongly directional marks
aphillips Sep 11, 2024
1e172fd
Fix formatting of the "important"
aphillips Sep 11, 2024
afd5ef0
Add bidi characters to description of whitespace.
aphillips Sep 11, 2024
c7a41fc
Permit bidi in a few more places
aphillips Sep 11, 2024
b0cd0a5
Update syntax.md ABNF
aphillips Sep 11, 2024
cacc5e9
Update formatting.md
aphillips Sep 11, 2024
1fb0f92
Address comment about name/identifier
aphillips Sep 11, 2024
a79fb8d
Address comments related to bidi in `name`
aphillips Sep 11, 2024
86a20f8
Fix variable's location
aphillips Sep 11, 2024
768a8a8
Address comment about the list of LRI/PDI targets
aphillips Sep 11, 2024
fd9fc57
One character typo :-P
aphillips Sep 11, 2024
734ef49
Update spec/syntax.md
aphillips Sep 12, 2024
4541758
Address comments about rule R3a-1
aphillips Sep 12, 2024
d751181
Update spec/syntax.md
aphillips Sep 12, 2024
cbd0457
Address comment about U+061C
aphillips Sep 12, 2024
0df963e
Change [o]wsp => `o` or `s`
aphillips Sep 12, 2024
be8fa43
Match syntax spec to abnf
aphillips Sep 12, 2024
f110af7
Remove *
aphillips Sep 12, 2024
d8c6d0f
Update syntax.md
aphillips Sep 12, 2024
d5fb3bb
Update spec/syntax.md
aphillips Sep 12, 2024
82af41f
Update spec/message.abnf
aphillips Sep 12, 2024
d9d79bc
Update spec/message.abnf
aphillips Sep 12, 2024
7858961
Update syntax.md
aphillips Sep 12, 2024
e7aa24c
Update spec/message.abnf
aphillips Sep 12, 2024
86fc1d4
Update spec/syntax.md
aphillips Sep 12, 2024
d5303c2
Update spec/syntax.md
aphillips Sep 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion spec/formatting.md
Original file line number Diff line number Diff line change
Expand Up @@ -768,7 +768,16 @@ That is, the text can can consist of a mixture of left-to-right and right-to-lef
The display of bidirectional text is defined by the
[Unicode Bidirectional Algorithm](http://www.unicode.org/reports/tr9/) [UAX9].

The directionality of the message as a whole is provided by the _formatting context_.
The directionality of the formatted _message_ as a whole is provided by the _formatting context_.

> [!NOTE]
> Keep in mind the difference between the formatted output of a _message_,
> which is the topic of this section,
> and the syntax of _message_ prior to formatting.
> The processing of a _message_ depends on the logical sequence of Unicode code points,
> not on the presentation of the _message_.
> Affordances to allow users appropriate control over the appearance of the
> _message_'s syntax have been provided.

When a _message_ is formatted, _placeholders_ are replaced
with their formatted representation.
Expand Down
52 changes: 31 additions & 21 deletions spec/message.abnf
Original file line number Diff line number Diff line change
@@ -1,41 +1,41 @@
message = simple-message / complex-message

simple-message = [s] [simple-start pattern]
simple-message = owsp [simple-start pattern]
simple-start = simple-start-char / escaped-char / placeholder
pattern = *(text-char / escaped-char / placeholder)
placeholder = expression / markup

complex-message = [s] *(declaration [s]) complex-body [s]
complex-message = owsp *(declaration owsp) complex-body owsp
declaration = input-declaration / local-declaration
complex-body = quoted-pattern / matcher

input-declaration = input [s] variable-expression
local-declaration = local s variable [s] "=" [s] expression
input-declaration = input owsp variable-expression
local-declaration = local wsp variable owsp "=" owsp expression

quoted-pattern = "{{" pattern "}}"
quoted-pattern = owsp "{{" pattern "}}" owsp

matcher = match-statement s variant *([s] variant)
match-statement = match 1*(s selector)
matcher = match-statement wsp variant *(owsp variant)
match-statement = match 1*(wsp selector)
selector = variable
variant = key *(s key) [s] quoted-pattern
variant = owsp key *(wsp key) quoted-pattern
key = literal / "*"

; Expressions
expression = literal-expression
/ variable-expression
/ function-expression
literal-expression = "{" [s] literal [s function] *(s attribute) [s] "}"
variable-expression = "{" [s] variable [s function] *(s attribute) [s] "}"
function-expression = "{" [s] function *(s attribute) [s] "}"
literal-expression = "{" owsp literal [wsp function] *(wsp attribute) owsp "}"
variable-expression = "{" owsp variable [wsp function] *(wsp attribute) owsp "}"
function-expression = "{" owsp function *(wsp attribute) owsp "}"

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
markup = "{" owsp "#" identifier *(wsp option) *(wsp attribute) owsp ["/"] "}" ; open and standalone
/ "{" owsp "/" identifier *(wsp option) *(wsp attribute) owsp "}" ; close

; Expression and literal parts
function = ":" identifier *(s option)
option = identifier [s] "=" [s] (literal / variable)
function = ":" identifier *(wsp option)
option = identifier owsp "=" owsp (literal / variable)

attribute = "@" identifier [[s] "=" [s] (literal / variable)]
attribute = "@" identifier [owsp "=" owsp (literal / variable)]

variable = "$" name

Expand All @@ -52,13 +52,13 @@ match = %s".match"

; Names and identifiers
; identifier matches https://www.w3.org/TR/REC-xml-names/#NT-QName
; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName but excludes U+FFFD
; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName but excludes U+FFFD and U+061C
identifier = [namespace ":"] name
namespace = name
name = name-start *name-char
name = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-1FFF / %x200C-200D
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "."
Expand All @@ -83,5 +83,15 @@ content-char = %x01-08 ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
escaped-char = backslash ( backslash / "{" / "|" / "}" )
backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 )
; Optional whitespace
owsp = *(s / bidi)

; Required whitespace
wsp = (owsp) 1*s (owsp)

; Bidirectional marks and isolates
; ALM / LRM / RLM / LRI, RLI, FSI & PDI
bidi = %x061C / %x200E / %x200F / %x2066-2069

; Whitespace characters
s = SP / HTAB / CR / LF / %x3000
162 changes: 127 additions & 35 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,17 +134,23 @@ A **_<dfn>local variable</dfn>_** is a _variable_ created as the result of a _lo
> > An exception to this is: whitespace inside a _pattern_ is **always** significant.

> [!NOTE]
> The syntax assumes that each _message_ will be displayed with a left-to-right display order
> The MessageFormat 2 syntax assumes that each _message_ will be displayed
> with a left-to-right display order
> and be processed in the logical character order.
> The syntax also permits the use of right-to-left characters in _identifiers_,
> The syntax permits the use of right-to-left characters in _identifiers_,
> _literals_, and other values.
> This can result in confusion when viewing the _message_.
> This can result in confusion when viewing the message
> or users might incorrectly insert bidi controls or marks that negatively affect the output
> of the message.
>
> To assist with this, the syntax permits the use of various controls and
> strongly-directional markers in both optional and required _whitespace_
> in a _message_, as well was encouraging the use of isolating controls
> with _expressions_ and _quoted patterns_.
> See: [whitespace](#whitespace) (below) for more information.
>
> Additional restrictions or requirements,
> such as permitting the use of certain bidirectional control characters in the syntax,
> might be added during the Tech Preview to better manage bidirectional text.
> Feedback on the creation and management of _messages_
> containing bidirectional tokens is strongly desired.
> Additional restrictions or requirements might be added during the
> Tech Preview to better manage bidirectional text.

A _message_ can be a _simple message_ or it can be a _complex message_.

Expand All @@ -160,7 +166,7 @@ Whitespace at the start or end of a _simple message_ is significant,
and a part of the _text_ of the _message_.

```abnf
simple-message = [s] [simple-start pattern]
simple-message = owsp [simple-start pattern]
simple-start = simple-start-char / escaped-char / placeholder
```

Expand All @@ -176,7 +182,7 @@ Whitespace at the start or end of a _complex message_ is not significant,
and does not affect the processing of the _message_.

```abnf
complex-message = [s] *(declaration [s]) complex-body [s]
complex-message = owsp *(declaration owsp) complex-body owsp
```

### Declarations
Expand All @@ -193,8 +199,8 @@ A **_<dfn>local-declaration</dfn>_** binds a _variable_ to the resolved value of

```abnf
declaration = input-declaration / local-declaration
input-declaration = input [s] variable-expression
local-declaration = local s variable [s] "=" [s] expression
input-declaration = input owsp variable-expression
local-declaration = local wsp variable owsp "=" owsp expression
```

_Variables_, once declared, MUST NOT be redeclared.
Expand Down Expand Up @@ -254,7 +260,7 @@ A _quoted pattern_ starts with a sequence of two U+007B LEFT CURLY BRACKET `{{`
and ends with a sequence of two U+007D RIGHT CURLY BRACKET `}}`.

```abnf
quoted-pattern = "{{" pattern "}}"
quoted-pattern = owsp "{{" pattern "}}" owsp
```

A _quoted pattern_ MAY be empty.
Expand Down Expand Up @@ -352,8 +358,8 @@ otherwise, a corresponding _Data Model Error_ will be produced during processing
_Literal_ _keys_ are compared by their contents, not their syntactical appearance.

```abnf
matcher = match-statement s variant *([s] variant)
match-statement = match 1*(s selector)
matcher = match-statement wsp variant *(owsp variant)
match-statement = match 1*(wsp selector)
```

> A _message_ with a _matcher_:
Expand Down Expand Up @@ -425,7 +431,7 @@ Each _key_ is separated from each other by whitespace.
Whitespace is permitted but not required between the last _key_ and the _quoted pattern_.

```abnf
variant = key *(s key) [s] quoted-pattern
variant = owsp key *(wsp key) owsp quoted-pattern
key = literal / "*"
```

Expand Down Expand Up @@ -461,9 +467,9 @@ A **_<dfn>function-expression</dfn>_** contains a _function_ without an _operand
expression = literal-expression
/ variable-expression
/ function-expression
literal-expression = "{" [s] literal [s function] *(s attribute) [s] "}"
variable-expression = "{" [s] variable [s function] *(s attribute) [s] "}"
function-expression = "{" [s] function *(s attribute) [s] "}"
literal-expression = "{" owsp literal [wsp function] *(wsp attribute) owsp "}"
variable-expression = "{" owsp variable [wsp function] *(wsp attribute) owsp "}"
function-expression = "{" owsp function *(wsp attribute) owsp "}"
```

There are several types of _expression_ that can appear in a _message_.
Expand Down Expand Up @@ -520,7 +526,7 @@ The _identifier_ MAY be followed by one or more _options_.
_Options_ are not required.

```abnf
function = ":" identifier *(s option)
function = ":" identifier *(wsp option)
```

> A _message_ with a _function_ operating on the _variable_ `$now`:
Expand Down Expand Up @@ -549,7 +555,7 @@ and will produce a _Duplicate Option Name_ error during processing.
The order of _options_ is not significant.

```abnf
option = identifier [s] "=" [s] (literal / variable)
option = identifier owsp "=" owsp (literal / variable)
```

> Examples of _functions_ with _options_
Expand Down Expand Up @@ -594,8 +600,8 @@ It MAY include _options_.
is a _pattern_ part ending a span.

```abnf
markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
markup = "{" owsp "#" identifier *(wsp option) *(wsp attribute) owsp ["/"] "}" ; open and standalone
/ "{" owsp "/" identifier *(wsp option) *(wsp attribute) owsp "}" ; close
```

> A _message_ with one `button` markup span and a standalone `img` markup element:
Expand Down Expand Up @@ -637,7 +643,7 @@ all but the last _attribute_ with the same _identifier_ are ignored.
The order of _attributes_ is not otherwise significant.

```abnf
attribute = "@" identifier [[s] "=" [s] literal]
attribute = "@" identifier [owsp "=" owsp literal]
```

> Examples of _expressions_ and _markup_ with _attributes_:
Expand Down Expand Up @@ -727,7 +733,13 @@ A **_<dfn>name</dfn>_** is a character sequence used in an _identifier_
or as the name for a _variable_
or the value of an _unquoted literal_.

_Variable_ names are prefixed with `$`.
A _name_ can be preceded or followed by bidirectional marks or isolating controls
to aid in presenting names that contain right-to-left or neutral characters.
These characters are **not** part of the _name_ and MUST be treated as if they were not present
when matching _name_ or _identifier_ strings or _unquoted literal_ values.
Implementations MAY remove these characters from a _message_.

_Variable_ _names_ are prefixed with `$`.

Valid content for _names_ is based on <cite>Namespaces in XML 1.0</cite>'s
[NCName](https://www.w3.org/TR/xml-names/#NT-NCName).
Expand Down Expand Up @@ -763,14 +775,14 @@ in this release.

```abnf
variable = "$" name
option = identifier [s] "=" [s] (literal / variable)
option = identifier owsp "=" owsp (literal / variable)

identifier = [namespace ":"] name
namespace = name
name = name-start *name-char
name = [bidi] name-start *name-char [bidi]
name-start = ALPHA / "_"
/ %xC0-D6 / %xD8-F6 / %xF8-2FF
/ %x370-37D / %x37F-1FFF / %x200C-200D
/ %x370-37D / %x37F-61B / %x61D-1FFF / %x200C-200D
/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF
/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF
name-char = name-start / DIGIT / "-" / "."
Expand Down Expand Up @@ -803,24 +815,104 @@ and inside _patterns_ only escape `{` and `}`.

### Whitespace

**_<dfn>Whitespace</dfn>_** is defined as one or more of
U+0009 CHARACTER TABULATION (tab),
U+000A LINE FEED (new line),
U+000D CARRIAGE RETURN,
U+3000 IDEOGRAPHIC SPACE,
or U+0020 SPACE.
The syntax limits whitespace characters outside of a _pattern_ to the following:
`U+0009 CHARACTER TABULATION` (tab),
`U+000A LINE FEED` (new line),
`U+000D CARRIAGE RETURN`,
`U+3000 IDEOGRAPHIC SPACE`,
or `U+0020 SPACE`.

Inside _patterns_ and _quoted literals_,
whitespace is part of the content and is recorded and stored verbatim.
Whitespace is not significant outside translatable text, except where required by the syntax.

There are two whitespace productions in the syntax.
**_<dfn>Optional whitespace</dfn>_** is whitespace that is not required by the syntax,
but which users might want to include to increase the readability of a _message_.
**_<dfn>Required whitespace</dfn>_** is whitespace that is required by the syntax.

Both types of whitespace optionally permit the use of the bidirectional isolate controls
and certain strongly directional marks.
These can assist users in presenting _messages_ that contain right-to-left
text, _literals_, or _names_ (including those for _functions_, _options_,
_option values_, and _keys_)

_Messages_ that contain right-to-left (aka RTL) characters SHOULD use one of the
following mechanisms to make messages display intelligibly in plain-text editors:

1. Use paired isolating bidi controls `U+2066 LEFT-TO-RIGHT ISOLATE` ("LRI")
and `U+2069 POP DIRECTIONAL ISOLATE` ("PDI") as permitted by the ABNF around
parts of any _message_ containing RTL characters:
- _inside_ of _placeholder_ markers `{` and `}`
- _outside_ _quoted-pattern_ markers `{{` and `}}`
- _outside_ of _literals_, paying particular attention to _keys_ in a _variant_
- _outside_ of _variable_, _function_, _markup_, or _attribute_ _names_/_identifiers_,
including the identifying sigil (e.g. `<LRI>$var</PDI>` or `<LRI>:ns:name</PDI>`)
2. Use the 'local-effect' bidi marks
`U+061C ARABIC LETTER MARK`, `U+200E LEFT-TO-RIGHT MARK` or
`U+200F RIGHT-TO-LEFT MARK` as permitted by the ABNF before or after _identifiers_,
_names_, unquoted _literals_, or _option_ values,
especially when the values contain a mix of neutral, weakly directional, and
strongly directional characters.

> [!IMPORTANT]
> Always take care **not** to add bidirectional controls or marks
> where they would be semantically significant
> or where they would unintentionally become part of the _message_'s output:
> - do not put them inside of a _literal_ except when they are part of the value,
> (instead put them outside of _literal_ quotes, such as `<LRM>|...|<LRM>`)
> - do not put them inside quoted _patterns_ except when they are part of the text,
> (instead put them outside of quoted _patterns_, such as `<LRI>{{...}}<PDI>`)
> - do not put them outside _placeholders_,
> (instead put them inside the _placeholder_, such as `{<LRI>$foo :number<PDI>}`)
>
> Controls placed inside _literal_ quotes or quoted _patterns_ are part of the _literal_
> or _pattern_.
> Controls in a _pattern_ will appear in the output of the message.
> Controls inside _literal_ quotes are part of the _literal_ and
> will be considered in operations such as matching a _key_ to a _selector_.

> [!NOTE]
> Users cannot be expected to create or manage bidirectional controls or
> marks in _messages_, since the characters are invisible and can be difficult
> to manage.
> Tools (such as resource editors or translation editors)
> and other implementations of MessageFormat 2 serialization are strongly
> encouraged to provide paired isolates around any right-to-left
> syntax as described above so that _messages_ display appropriately as plain text.

These definitions of _whitespace_ implement
[UAX#31 Requirement R3a-2](https://www.unicode.org/reports/tr31/#R3a-2).
It is a profile of R3a-1 in that specification because:
the following pattern whitespace characters are not allowed:
`U+000B FORM FEED`,
`U+000C VERTICAL TABULATION`,
`U+0085 NEXT LINE`,
`U+2028 LINE SEPARATOR` and
`U+2029 PARAGRAPH SEPARATOR`;
the character `U+3000 IDEOGRAPHIC SPACE`
_is_ interpreted as whitespace,
and the directional isolates U+2066..U+2069
are treated as ignorable format controls.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is U+061C, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and LRM/RLM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but LRM and RLM are part of R3a-1, see the first note under https://www.unicode.org/reports/tr31/#R3a.

The profile includes adding U+061C to the ignorable format controls, not just the isolates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second note talks about ALM, including:

If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.

In any case, I now have a list of ignorable format controls, which might be overkill, but saves reading rule R3a 😉.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is added to the set of whitespace characters by a profile, it is interpreted as an ignorable format control.

Indeed, but you still need to say that the profile adds it!

I agree that listing the set is probably better than the diff at this point.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also not allow ALM as a bidi character, as there should be no place in the syntax where an RLM couldn't be used just as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel uncomfortable removing bidi marks. The ALM was added many years after RLM/LRM and its differences with RLM are minor. But we want bidi language users to have the tools they need to make things look right (and still be functional). I would hate to remove it because we need to add a couple of words to the spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that treating it as an ignorable format control requires us to deviate further from the XML name production.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is the right thing to do.

ALM is an invisible, default-ignorable, non-spacing code point. As noted elsewhere, it was added to Unicode after XML/XMLName were defined. According to XML's rules, an ALM all-by-itself is a valid identifier. That seems like a bug, not a feature. Maybe we should call out the deviation more clearly and maybe (wearing my other chair hat) W3C should be called on to do an erratum.

@macchiati Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly agree; it should be added.


> [!NOTE]
> The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for
> compatibility with certain East Asian keyboards and input methods,
> in which users might accidentally create these characters in a _message_.

```abnf
s = 1*( SP / HTAB / CR / LF / %x3000 )
; Optional whitespace
owsp = *(s / bidi)

; Required whitespace
wsp = (owsp) 1*s (owsp)

; Bidirectional marks and isolates
; ALM / LRM / RLM / LRI, RLI, FSI & PDI
bidi = %x061C / %x200E / %x200F / %x2066-2069

; Whitespace characters
s = SP / HTAB / CR / LF / %x3000
```

## Complete ABNF
Expand Down