Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b387301
Allow LRM in whitespace
aphillips Feb 19, 2024
ae712d7
Update syntax.md
aphillips Feb 19, 2024
c556ecf
Make 3a-2 definition consistent with requirements
aphillips Feb 19, 2024
8558eef
Update spec/syntax.md
aphillips Feb 19, 2024
d830550
Replace `[s]` with owsp production
aphillips Feb 19, 2024
42b7744
`s` != `wsp`
aphillips Feb 19, 2024
af75420
Update syntax.md
aphillips Feb 19, 2024
6a6dc70
Update message.abnf
aphillips Feb 19, 2024
db5e97c
Update spec/syntax.md
aphillips Feb 19, 2024
3fede7a
Update spec/syntax.md
aphillips Feb 20, 2024
586a4d6
Update spec/syntax.md
aphillips Feb 20, 2024
502e1fe
Update spec/syntax.md
aphillips Feb 20, 2024
d7af5eb
Address literals
aphillips Feb 20, 2024
68f32f8
Fix converting some `s` productions to `wsp`
aphillips Feb 21, 2024
e36d01d
Add bidi isolate support
aphillips Feb 21, 2024
5609dc0
Missing one "x"
aphillips Feb 21, 2024
e74b34c
Add a warning/tech preview feedback note
aphillips Feb 21, 2024
9557849
Update spec/syntax.md
aphillips Feb 21, 2024
ec369ec
Update spec/syntax.md
aphillips Feb 21, 2024
2532206
Address @macchiati's comment
aphillips Feb 21, 2024
f985860
Update spec/syntax.md
aphillips Feb 21, 2024
52c5c86
Update spec/syntax.md
aphillips Feb 21, 2024
4a6c96c
Fix up @machiatti's suggested text
aphillips Feb 21, 2024
19864c3
Address @mihnita's suggestion, fix expression brackets
aphillips Feb 22, 2024
27b9e42
Make syntax.md make abnf
aphillips Feb 26, 2024
00e3796
Merge branch 'main' into aphillips-allow-lrm
aphillips Jun 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 27 additions & 19 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,45 @@ message = simple-message / complex-message
simple-message = [simple-start pattern]
simple-start = simple-start-char / text-escape / placeholder
pattern = *(text-char / text-escape / placeholder)
placeholder = expression / markup
placeholder = "{" (expression / markup) "}"
/ "{" %x2066 (expression / markup) %x2069 "}"

complex-message = *(declaration [s]) complex-body
complex-message = *(declaration owsp) complex-body
declaration = input-declaration / local-declaration / reserved-statement
complex-body = quoted-pattern / matcher

input-declaration = input [s] variable-expression
local-declaration = local s variable [s] "=" [s] expression
input-declaration = input owsp variable-expression
local-declaration = local wsp variable owsp "=" owsp expression

quoted-pattern = "{{" pattern "}}"
/ %x2066 "{{" pattern "}}" %x2069

matcher = match-statement 1*([s] variant)
match-statement = match 1*([s] selector)
matcher = match-statement 1*(owsp variant)
match-statement = match 1*(owsp selector)
selector = expression
variant = key *(s key) [s] quoted-pattern
variant = key *(wsp key) owsp quoted-pattern
key = literal / "*"

; Expressions
expression = literal-expression
/ variable-expression
/ annotation-expression
literal-expression = "{" [s] literal [s annotation] *(s attribute) [s] "}"
variable-expression = "{" [s] variable [s annotation] *(s attribute) [s] "}"
annotation-expression = "{" [s] annotation *(s attribute) [s] "}"
literal-expression = owsp literal [wsp annotation] *(wsp attribute) owsp
variable-expression = owsp variable [wsp annotation] *(wsp attribute) owsp
annotation-expression = owsp annotation *(wsp attribute) owsp

annotation = function
/ private-use-annotation
/ reserved-annotation

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
markup = owsp "#" identifier *(wsp option) *(wsp attribute) owsp ["/"] ; open and standalone
/ owsp "/" identifier *(wsp option) *(wsp attribute) owsp ; close
mihnita marked this conversation as resolved.
Show resolved Hide resolved

; Expression and literal parts
function = ":" identifier *(s option)
option = identifier [s] "=" [s] (literal / variable)
function = ":" identifier *(wsp option)
option = identifier owsp "=" owsp (literal / variable)
; Attributes are reserved for future standardization
attribute = "@" identifier [[s] "=" [s] (literal / variable)]
attribute = "@" identifier [owsp "=" owsp (literal / variable)]

variable = "$" name
literal = quoted / unquoted
Expand All @@ -54,7 +56,7 @@ local = %s".local"
match = %s".match"

; Reserve additional .keywords for use by future versions of this specification.
reserved-statement = reserved-keyword [s reserved-body] 1*([s] expression)
reserved-statement = reserved-keyword [s reserved-body] 1*(owsp expression)
; Note that the following production is a simplification,
; as this rule MUST NOT be considered to match existing keywords
; (`.input`, `.local`, and `.match`).
Expand All @@ -67,7 +69,7 @@ reserved-annotation-start = "!" / "%" / "*" / "+" / "<" / ">" / "?" / "~"
; Reserve sigils for private-use by implementations.
private-use-annotation = private-start reserved-body
private-start = "^" / "&"
reserved-body = *([s] 1*(reserved-char / reserved-escape / quoted))
reserved-body = *(owsp 1*(reserved-char / reserved-escape / quoted))

; Names and identifiers
; identifier matches https://www.w3.org/TR/REC-xml-names/#NT-QName
Expand Down Expand Up @@ -104,5 +106,11 @@ quoted-escape = backslash ( backslash / "|" )
reserved-escape = backslash ( backslash / "{" / "|" / "}" )
backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 )
; optional whitespace
owsp = *( s / %x200E / %x200F / %x2066-2069 )

; required whitespace
wsp = [ (%x200E / %x200F / %x2066-2069 ) ] 1*s [ (%x200E / %x200F / %x2066-2069 ) ]
mihnita marked this conversation as resolved.
Show resolved Hide resolved

; whitespace characters
s = ( SP / HTAB / CR / LF / %x3000 )
102 changes: 93 additions & 9 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,14 +97,14 @@ Attempting to parse a _message_ that is not _valid_ will result in a _Data Model

A **_<dfn>message</dfn>_** is the complete template for a specific message formatting request.

> **Note**
> [!NOTE]
> This syntax is designed to be embeddable into many different programming languages and formats.
> As such, it avoids constructs, such as character escapes, that are specific to any given file
> format or processor.
> In particular, it avoids using quote characters common to many file formats and formal languages
> so that these do not need to be escaped in the body of a _message_.

> **Note**
> [!NOTE]
> In general (and except where required by the syntax), whitespace carries no meaning in the structure
> of a _message_. While many of the examples in this spec are written on multiple lines, the formatting
> shown is primarily for readability.
Expand All @@ -124,6 +124,27 @@ A **_<dfn>message</dfn>_** is the complete template for a specific message forma
> >
> > An exception to this is: whitespace inside a _pattern_ is **always** significant.

> [!NOTE]
> The MessageFormat 2 syntax assumes that each _message_ will be displayed
> with a left-to-right display order
> and be processed in the logical character order
> while permitting the use of right-to-left characters in _identifiers_,
> _literals_, and other values.
> This can result in confusion when viewing the message
> or in users incorrectly inserting controls that negatively affect the output
> of the message.
>
> To assist with this, the syntax permits the use of various controls and
> strongly-directional markers in both optional and required _whitespace_
> in a _message_, as well was encouraging the use of isolating controls
> with _expressions_ and _quoted patterns_.
> See: [whitespace](#whitespace) (below) for more information.
>
> Additional restrictions or requirements might be added during the
> Tech Preview to better manage bidirectional text.
> Feedback on the creation and management of _messages_
> containing bidirectional tokens is strongly desired.

A _message_ can be a _simple message_ or it can be a _complex message_.

```abnf
Expand Down Expand Up @@ -905,24 +926,87 @@ backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

### Whitespace

**_<dfn>Whitespace</dfn>_** is defined as one or more of
U+0009 CHARACTER TABULATION (tab),
U+000A LINE FEED (new line),
U+000D CARRIAGE RETURN,
U+3000 IDEOGRAPHIC SPACE,
or U+0020 SPACE.
The syntax limits whitespace characters outside of a _pattern_ to the following:
`U+0009 CHARACTER TABULATION` (tab),
`U+000A LINE FEED` (new line),
`U+000D CARRIAGE RETURN`,
`U+3000 IDEOGRAPHIC SPACE`,
or `U+0020 SPACE`.

Inside _patterns_ and _quoted literals_,
whitespace is part of the content and is recorded and stored verbatim.
Whitespace is not significant outside translatable text, except where required by the syntax.

There are two whitespace productions in the syntax.
**_<dfn>Optional whitespace</dfn>_** is whitespace that is not required by the syntax,
but which users might want to include to increase the readability of a _message_.
**_<dfn>Required whitespace</dfn>_** is whitespace that is required by the syntax.

_Messages_ that contain right-to-left (aka RTL) characters SHOULD use one of the
following mechanisms to make messages display intelligibly in plain-text editors:

1. Use paired isolating bidi controls `U+2066 LEFT-TO-RIGHT ISOLATE`
and `U+2069 POP DIRECTIONAL ISOLATE` as permitted by the ABNF around
parts of any _message_ containing RTL characters:
- _inside_ of _placeholder_ markers `{` and `}`
- _outside_ _quoted-pattern_ markers `{{` and `}}`
- _identifiers_
- _literals_ (This is especially important for individual _keys_ in a _variant_)
- _option_ values
2. Use the 'local-effect' bidi controls`U+200E LEFT-TO-RIGHT MARK` or
`U+200F RIGHT-TO-LEFT MARK` as permitted by the ABNF around
parts of any _message_ containing RTL characters:
- _identifiers_
- _literals_ (taking care not to include the mark inside any quotes),
- _option_ values

> [!IMPORTANT]
> Always take care **not** to add a bidi control where it is semantically significant:
> - put them outside of _literal_ quotes, such as `<LRM>|...|<LRM>`
> - put them outside of quoted _patterns_, such as `<LRI>{{...}}<PDI>`
Comment on lines +1002 to +1003
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding words Just to make it clear what the list is telling people to not do (vs. whether they are supposed to do it). Also, I think you want a newline to ensure that the following sentences of the note don't get collapsed into the second bullet point.

Suggested change
> - put them outside of _literal_ quotes, such as `<LRM>|...|<LRM>`
> - put them outside of quoted _patterns_, such as `<LRI>{{...}}<PDI>`
> - do not put them outside of _literal_ quotes, such as `<LRM>|...|<LRM>`
> - do not put them outside of quoted _patterns_, such as `<LRI>{{...}}<PDI>`
>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am making changes to this text as I incorporate it into a new PR that replaces this one and implements the bidi design.

> Controls placed inside _literal_ quotes or quoted _patterns_ are part of the literal
> or pattern.
> Controls in a _pattern_ will appear in the output of the message.
> Controls inside _literal_ quotes are part of the _literal_ and
> will be considered in operations such as matching a _key_ to a _selector_.

> [!NOTE]
> Users cannot be expected to create or manage bidirectional controls or
> marks in _messages_, since the characters are invisible and can be difficult
> to manage.
> Tools (such as resource editors or translation editors)
> and other implementations of MessageFormat 2 serialization are strongly
> encouraged to provide paired isolates around any right-to-left
> syntax as described above so that _messages_ display appropriately as plain text.

These definitions of _whitespace_ implement
[UAX#31 Requirement R3a-2](https://www.unicode.org/reports/tr31/#R3a-2).
It is a profile of R3a-1 in that specification because:
the following pattern whitespace characters are not allowed:
`U+000B FORM FEED`,
`U+000C VERTICAL TABULATION`,
`U+0085 NEXT LINE`,
`U+2028 LINE SEPARATOR` and
`U+2029 PARAGRAPH SEPARATOR`;
the character `U+3000 IDEOGRAPHIC SPACE`
_is_ interpreted as whitespace,
and the directional isolates U+2066..U+2069
are treated as ignorable format controls.

> [!NOTE]
> The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for
> compatibility with certain East Asian keyboards and input methods,
> in which users might accidentally create these characters in a _message_.

```abnf
s = 1*( SP / HTAB / CR / LF / %x3000 )
; optional whitespace
owsp = *( s / %x200E / %x200F / %x2066-2069 )

; required whitespace
wsp = [ (%x200E / %x200F / %x2066-2069 ) ] 1*s [ (%x200E / %x200F / %x2066-2069 ) ]

; whitespace characters
s = ( SP / HTAB / CR / LF / %x3000 )
```

## Complete ABNF
Expand Down