Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b387301
Allow LRM in whitespace
aphillips Feb 19, 2024
ae712d7
Update syntax.md
aphillips Feb 19, 2024
c556ecf
Make 3a-2 definition consistent with requirements
aphillips Feb 19, 2024
8558eef
Update spec/syntax.md
aphillips Feb 19, 2024
d830550
Replace `[s]` with owsp production
aphillips Feb 19, 2024
42b7744
`s` != `wsp`
aphillips Feb 19, 2024
af75420
Update syntax.md
aphillips Feb 19, 2024
6a6dc70
Update message.abnf
aphillips Feb 19, 2024
db5e97c
Update spec/syntax.md
aphillips Feb 19, 2024
3fede7a
Update spec/syntax.md
aphillips Feb 20, 2024
586a4d6
Update spec/syntax.md
aphillips Feb 20, 2024
502e1fe
Update spec/syntax.md
aphillips Feb 20, 2024
d7af5eb
Address literals
aphillips Feb 20, 2024
68f32f8
Fix converting some `s` productions to `wsp`
aphillips Feb 21, 2024
e36d01d
Add bidi isolate support
aphillips Feb 21, 2024
5609dc0
Missing one "x"
aphillips Feb 21, 2024
e74b34c
Add a warning/tech preview feedback note
aphillips Feb 21, 2024
9557849
Update spec/syntax.md
aphillips Feb 21, 2024
ec369ec
Update spec/syntax.md
aphillips Feb 21, 2024
2532206
Address @macchiati's comment
aphillips Feb 21, 2024
f985860
Update spec/syntax.md
aphillips Feb 21, 2024
52c5c86
Update spec/syntax.md
aphillips Feb 21, 2024
4a6c96c
Fix up @machiatti's suggested text
aphillips Feb 21, 2024
19864c3
Address @mihnita's suggestion, fix expression brackets
aphillips Feb 22, 2024
27b9e42
Make syntax.md make abnf
aphillips Feb 26, 2024
00e3796
Merge branch 'main' into aphillips-allow-lrm
aphillips Jun 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 24 additions & 18 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,41 @@ simple-start = simple-start-char / text-escape / placeholder
pattern = *(text-char / text-escape / placeholder)
placeholder = expression / markup

complex-message = *(declaration [s]) complex-body
complex-message = *(declaration owsp) complex-body
declaration = input-declaration / local-declaration / reserved-statement
complex-body = quoted-pattern / matcher

input-declaration = input [s] variable-expression
local-declaration = local s variable [s] "=" [s] expression
input-declaration = input owsp variable-expression
local-declaration = local s variable owsp "=" owsp expression

quoted-pattern = "{{" pattern "}}"

matcher = match-statement 1*([s] variant)
match-statement = match 1*([s] selector)
matcher = match-statement 1*(owsp variant)
match-statement = match 1*(owsp selector)
selector = expression
variant = key *(s key) [s] quoted-pattern
variant = key *(wsp key) owsp quoted-pattern
key = literal / "*"

; Expressions
expression = literal-expression
/ variable-expression
/ annotation-expression
literal-expression = "{" [s] literal [s annotation] *(s attribute) [s] "}"
variable-expression = "{" [s] variable [s annotation] *(s attribute) [s] "}"
annotation-expression = "{" [s] annotation *(s attribute) [s] "}"
literal-expression = "{" owsp literal [s annotation] *(wsp attribute) owsp "}"
variable-expression = "{" owsp variable [s annotation] *(wsp attribute) owsp "}"
annotation-expression = "{" owsp annotation *(wsp attribute) owsp "}"

annotation = function
/ private-use-annotation
/ reserved-annotation

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
markup = "{" owsp "#" identifier *(wsp option) *(wsp attribute) owsp ["/"] "}" ; open and standalone
/ "{" owsp "/" identifier *(wsp option) *(wsp attribute) owsp "}" ; close

; Expression and literal parts
function = ":" identifier *(s option)
option = identifier [s] "=" [s] (literal / variable)
function = ":" identifier *(wsp option)
option = identifier owsp "=" owsp (literal / variable)
; Attributes are reserved for future standardization
attribute = "@" identifier [[s] "=" [s] (literal / variable)]
attribute = "@" identifier [owsp "=" owsp (literal / variable)]

variable = "$" name
literal = quoted / unquoted
Expand All @@ -54,7 +54,7 @@ local = %s".local"
match = %s".match"

; Reserve additional .keywords for use by future versions of this specification.
reserved-statement = reserved-keyword [s reserved-body] 1*([s] expression)
reserved-statement = reserved-keyword [s reserved-body] 1*(owsp expression)
; Note that the following production is a simplification,
; as this rule MUST NOT be considered to match existing keywords
; (`.input`, `.local`, and `.match`).
Expand All @@ -67,7 +67,7 @@ reserved-annotation-start = "!" / "%" / "*" / "+" / "<" / ">" / "?" / "~"
; Reserve sigils for private-use by implementations.
private-use-annotation = private-start reserved-body
private-start = "^" / "&"
reserved-body = *([s] 1*(reserved-char / reserved-escape / quoted))
reserved-body = *(owsp 1*(reserved-char / reserved-escape / quoted))

; Names and identifiers
; identifier matches https://www.w3.org/TR/REC-xml-names/#NT-QName
Expand Down Expand Up @@ -104,5 +104,11 @@ quoted-escape = backslash ( backslash / "|" )
reserved-escape = backslash ( backslash / "{" / "|" / "}" )
backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 )
; optional whitespace
owsp = *( s / %x200E / %x200F )

; required whitespace
wsp = [ (%x200E / %x200F) ] 1*s [ (%x200E / %x200F) ]

; whitespace characters
s = ( SP / HTAB / CR / LF / %x3000 )
45 changes: 38 additions & 7 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -905,24 +905,55 @@ backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

### Whitespace

**_<dfn>Whitespace</dfn>_** is defined as one or more of
U+0009 CHARACTER TABULATION (tab),
U+000A LINE FEED (new line),
U+000D CARRIAGE RETURN,
U+3000 IDEOGRAPHIC SPACE,
or U+0020 SPACE.
The syntax limits whitespace characters outside of a _pattern_ to the following:
`U+0009 CHARACTER TABULATION` (tab),
`U+000A LINE FEED` (new line),
`U+000D CARRIAGE RETURN`,
`U+3000 IDEOGRAPHIC SPACE`,
or `U+0020 SPACE`.

Inside _patterns_ and _quoted literals_,
whitespace is part of the content and is recorded and stored verbatim.
Whitespace is not significant outside translatable text, except where required by the syntax.

There are two whitespace productions in the syntax.
**_<dfn>Optional whitespace</dfn>_** is whitespace that is not required by the syntax,
but which users might want to include to increase the readability of a _message_.
**_<dfn>Required whitespace</dfn>_** is whitespace that is required by the syntax.

Tools SHOULD generate `U+200E LEFT-TO-RIGHT MARK` or `U+200F RIGHT-TO-LEFT MARK`
characters where permitted by the syntax before or following _identifiers_,
_unquoted literals_, or _option_ values that use right-to-left characters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why there is the restriction on unquoted literals. Shouldn't it be any literal? That is, anywhere a unquoted literal can appear, and unquoted one can. So it seems like either both (aka just 'literal') or neither can appear.

I could match on:

X y ⎨⎨$count⎬⎬

Where X is a RTL character, or on

⎸X⎸ y ⎨⎨$count⎬⎬

In both cases the result is jumbled (disregard the direction of the fake braces, the tool just reorders.

⎬	⎬	y		⎨	⎨	$	c	o	u	n	t		X

⎬	⎬	y		⎨	⎨	$	c	o	u	n	t		⎸	X	⎸

Now, in this case I could put LRMs before the first ⎸ and after the second, because both positions allow whitespace. Are there circumstances around literals where it makes a difference in the insertability of LRM/RLM because of the quoting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that, the changes look good to me. Note that if there are any issues in the WG about this, we refrain from these changes until after the v45 release, just leaving a note that we're looking at the bidi ordering issues...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The key thing I think is to remind tool writers not to quote the mark onto the value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if there are any issues in the WG about this, we refrain from these changes until after the v45 release, just leaving a note that we're looking at the bidi ordering issues...

The changes are to the syntax and I think important enough to merit doing the change now--the better to stabilize the syntax. It does represent a relaxation of what is allowed in free whitespace. I would like to avoid having a lot of Tech Preview implementations reject bidi-friendlier messages in the fall.

OTOH, it does represent a departure from how we set up the s production.

Copy link
Member

@macchiati macchiati Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, for a LTR reading, you want to put LRM around any 'element' that could contain RTL characters. Each literal being matched, literals in option values, etc. So in

{{STUFF {$value option=|JUNK| ...} TO READ}}

You want to insert like:

{{<LRM>STUFF {<LRM>$value option=<LRM>|JUNK|<LRM> ...<LRM>} TO READ<LRM>}}

Of course:

  • With LRM/RLM you can't get the the RTL message parts to reorder around the {$value}, but the order is predictable and far better than the raw message.
  • in practice you want tooling to do this, not humans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{<LRM>STUFF {<LRM>$value option=<LRM>|JUNK|<LRM> ...<LRM>} TO READ<LRM>}}

You definitely do not want the LRMs in the text part of the message, which is where the ones after the {{ and before the }} are. Instead there should be an LRM following the pattern so that any keys in the next variant aren't reordered:

<LRM>KEY1<LRM> KEY2<LRM> {{STUFF {<LRM>$value option=<LRM>|JUNK|<LRM>...<LRM>} TO READ}}
<LRM>key1<LRM> key2<LRM> {{ ... next variant...}}

Agreed that this is a job for tools.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yes, imediately outside {{ and }}, not inside.

so that the _message_ displays intelligibly.

These definitions of _whitespace_ implement
[UTR#31 Rule 3a-2](https://www.unicode.org/reports/tr31/#R3a-2).
aphillips marked this conversation as resolved.
Show resolved Hide resolved
It is a profile of R3a-1 in that specification because:
the following pattern whitespace characters are not allowed:
`U+000B FORM FEED`,
`U+000C VERTICAL TABULATION`,
`U+0085 NEXT LINE`,
`U+2028 LINE SEPARATOR` and
`U+2029 PARAGRAPH SEPARATOR`;
the character `U+3000 IDEOGRAPHIC SPACE`
_is_ interpreted as whitespace;
and the following character is not included in ignorable format controls:
aphillips marked this conversation as resolved.
Show resolved Hide resolved
`U+200F RIGHT-TO-LEFT MARK`.
Copy link
Member

@eggrobin eggrobin Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had suggested more thing, they are now obsolete. ``` the following character is not interpreted as whitespace (in particular, it is not treated as an ignorable format control): `U+200F RIGHT-TO-LEFT MARK`; the ignorable format control U+200E LEFT-TO-RIGHT mark is only allowed in contexts UAX31-I1 and UAX31-I3, further restricted to the beginning of a nonempty sequence of horizontal spaces and line terminators. ```

Technically, we don't have lines

You don’t, but the Unicode Standard does :-)


> [!NOTE]
> The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for
> compatibility with certain East Asian keyboards and input methods,
> in which users might accidentally create these characters in a _message_.

```abnf
s = 1*( SP / HTAB / CR / LF / %x3000 )
; optional whitespace
owsp = *( s / %x200E / %x200F )

; required whitespace
wsp = [ (%x200E / %x200F) ] 1*s [ (%x200E / %x200F) ]

; whitespace characters
s = ( SP / HTAB / CR / LF / %x3000 )
aphillips marked this conversation as resolved.
Show resolved Hide resolved
```

## Complete ABNF
Expand Down