Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DESIGN] Bidi usability #754

Merged
merged 21 commits into from
Apr 15, 2024
Merged
Changes from 20 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
350 changes: 350 additions & 0 deletions exploration/bidi-usability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,350 @@
# Bidi Usability

Status: **Proposed**

<details>
<summary>Metadata</summary>
<dl>
<dt>Contributors</dt>
<dd>@aphillips</dd>
<dt>First proposed</dt>
<dd>2024-03-27</dd>
<dt>Pull Requests</dt>
<dd>#000</dd>
</dl>
</details>

## Objective

_What is this proposal trying to achieve?_

The MessageFormat 2 syntax uses whitespace as a required delimiter
as well as permitting the use of whitespace to make _messages_ easier to read.
In addition, a _message_ can include bidirectional text in identifiers and literals.

MessageFormat's syntax also uses a variety of "sigils" and markers to form the structure of a _message_.
These sigils are ASCII punctuation characters that have neutral directionality.
This means that the inclusion of right-to-left ("RTL") identifiers or literals in a _message_
can result in the syntax looking "scrambled" or, in extreme cases, appearing to have a different meaning
due to [spillover](https://www.w3.org/TR/i18n-glossary/#dfn-spillover-effects).

To prevent spillover effects and to allow users (particularly RTL language users)
to author _messages_ in a straightforward way, we want to allow the syntax to include appropriate
bidirectional support and to recommend to tool and translation technology implementers
mechanisms to make _messages_ that include RTL characters easy to work with
without introducing spoofing or "Trojan Source" attack vectors.

## Background

_What context is helpful to understand this proposal?_

If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction
[here](https://www.w3.org/International/articles/inline-bidi-markup/uba-basics).

MessageFormat _message_ strings are created and edited primarily by humans.
The original _message_ is often written by a software developer or user experience designer.
Translators need to work with the target-language versions of each _message_.
Like many templating or domain-specific languages, MF2 uses neutrally-directional symbols
to form portions of the syntax.
When the _message_ contains right-to-left (RTL) characters in translations or
in portions of the syntax,
the plain-text of the message and the Unicode Bidirectional Algorithm (UBA, UAX#9)
can interact in ways that make the _message_ unintelligible or difficult to parse visually.

Machines do not have a problem parsing _messages_ that contain RTL characters,
but users need to be able to discern what a _message_ does,
what _variant_ will be selected,
or what a _placeholder_ will evaluate to.

In addition, it is possible to construct messages that use bidi characters to spoof
users into believing that a _message_ does something different than what it actually does.

The current syntax does not permit bidi controls in _name_ tokens,
_unquoted_ literals,
or in the whitespace portions of a _message_.

Permitting the **isolate** controls and the standalone strongly-directional markers
would enable tools, including translation tools, and users who are writing in RTL languages
to format a _message_ so that its plain-text representation and its function
are unambiguous.

The isolate controls are paired invisible control characters inserted around a portion of a string.
The start of an isolate sequence is one of:
- U+2066 LEFT-TO-RIGHT ISOLATE (LRI)
- U+2067 RIGHT-TO-LEFT ISOLATE (RLI)
- U+2068 FIRST-STRONG ISOLATE (FSI)

The end of an isolate sequence is U+2069 POP DIRECTIONAL ISOLATE (PDI).

The characters inside an isolate sequence have the initial string (paragraph) direction
corresponding to the starting control (LTR for LRI, RTL for RLI, auto for FSI).
Comment on lines +79 to +80
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all editors reset the paragraph direction after a newline? For example, if there's a newline between an LRI and an FSI, how is the paragraph direction of the second line determined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal application of the bidi algorithm requires a reset on each paragraph, wherein a newline breaks paragraphs.

"The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati is correct. That's why it's called "paragraph direction". Note that newlines don't help us that much: they are optional in our syntax (outside literals) and technically normalize to space (or nothing). That is, the newline doesn't help us if we end up writing the message as a single-line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so given that we allow for newlines within "code" and, specifically, expressions, I think we need to account for that so that we can keep the direction of the code as left-to-right, even when the first strongly directional character on the line is RTL.

As I understand it, not even an LRI/FSI pair inside the braces is always enough to keep the $ on the left side of its name if it's preceded by a newline:

a = 'אחד'
b = 'שתיים'
s = a + '{\u2066\n$' + b + '\u2069}'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Getting the sigils to stay on the left side needs a base direction of LTR. An LRM doesn't help in your example either (except to prevent spillover with the following annotation if there were any). My proposal is not 100% bulletproof (and requires some action on the part of tools or users).

A bulletproof design would require more isolates and would probably be limited to using LRI/PDI pairs. It would be difficult to work with, given that there would be a lot of invisible control characters inside subcomponents of an expression, e.g.:

<LRI><LRI>option<PDI>[whitespace]=[whitespace]<LRI>value<PDI><PDI>

The isolate sequence is **isolated** from surrounding text.
This means that the surrounding text treats it as-if the sequence were a single neutral character.

> [!NOTE]
> One of the side-effects of using `{`/`}` and `{{`/`}}` to delimit _expressions_
> and _patterns_ is that these paired enclosing punctuations provide a measure of
> isolation in UBA.
> This is an additional reason not to change over to quote marks (which are not enclosing)
> around patterns.

This design also allows for the use of strongly directional marker characters.
These include:
- U+200E LEFT-TO-RIGHT MARK (LRM)
- U+200F RIGHT-TO-LEFT MARK (RLM)
- U+061C ARABIC LETTER MARK (ALM)

These characters are invisible strongly-directional characters used in bidirectional
text to coerce certain directional behavior (usually to mark the end of
a sequence of characters that would otherwise be ambiguous or interact with
neutrals or opposite direction runs in an unhelpful way).

## Use-Cases

_What use-cases do we see? Ideally, quote concrete examples._

1. Presentation of _keys_ can change if the text of the _key's_ _literal_ is not isolated:
```
.match {$م2صر :string}{$num :integer}
م2صر 0 {{The {$م2صر} is actually the first key}}
م2صر * {{This one appears okay}}
aphillips marked this conversation as resolved.
Show resolved Hide resolved
```

> [!NOTE]
>
> The first _variant_ in the use case above is actually:
>```
> \u06452\u0635\u0631 0 {{The {$\u06452\u0635\u0631} is actually the first key}}
>```


2. Presentation in an expression can change if portions of the expression
are not isolated or do not restore LTR order:
> In the following example, we use the same string with a number inserted into the middle of
> the string to make the bidi effects visible.
> The numbers correspond to:
> 1. operand
> 2. function
> 3. option name
> 4. option value

```
You have {$م1صر :م2صر م3صر=م4صر} <- no controls
You have {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎} <- LRM after each RTL token
aphillips marked this conversation as resolved.
Show resolved Hide resolved
```

3. As a developer or translator, I want to make RTL literal or names appear correctly
in my plain-text editing environment.
I don't want to have to manage a lot of paired controls, when I can get the right effect using
strongly directional mark characters (LRM, RLM, ALM)
aphillips marked this conversation as resolved.
Show resolved Hide resolved

4. As a translation tool or MF2 implementation, I want to automatically generate
_messages_ which display correctly when they contain RTL text or substring with minimal user intervention.

## Requirements

_What properties does the solution have to manifest to enable the use-cases above?_

To prevent RTL _literals_ from having spillover effects with surrounding syntax,
it should be possible to bidi isolate a _quoted_ or _unquoted_ _literal_.

>```
> .local $title = {|البحرين مصر الكويت!|}
catamorphism marked this conversation as resolved.
Show resolved Hide resolved
> .local $egypt = {مصر :string}
>```

To prevent _patterns_ from having spillover effects with other parts of a _message_,
particularly with _keys_ in a _variant_,
it should be possible to bidi-isolate a _quoted-pattern_.

>```
> .match {$foo :string}
> isolate {{البحرين مصر الكويت!}}
catamorphism marked this conversation as resolved.
Show resolved Hide resolved
>```

To prevent _markup_, _placeholders_, or _expressions_ from having spillover effects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To prevent _markup_, _placeholders_, or _expressions_ from having spillover effects
To prevent _placeholders_ (_markup_ or _expressions_) from having spillover effects

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spillover can also occur in declarations and the .match statement. It won't have an effect on the parsing, but the appearance to a user.

with other parts of a _message_
it should be possible to bidi isolate the contents of a _markup_ or an _expression_.

>```
> You can find it in {$مصر}.
>```

To prevent RTL identifiers from having spillover effects with other parts of an _expression_,
it should be possible to include "local effect" bidi controls following an _identifier_,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we omit "identifier" since an identifier ends with a name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identifiers end with names, but also contain names in the namespace position. I wanted to be clear that we meant the end of an identifier in this case.

_name_,
_option value_,
or _literal_.
These controls must not be included into the _identifier_, _name_, _option value_, or _literal_,
that is, it must be possible to distinguish these characters from the identifier,
name, option value, or literal in question.

>```
> You can use {$م1صر‎ :م2صر‎ م3صر‎=م4صر‎}
>```

To prevent RTL _namespace_ names from having spillover effects with _function_ names,
it should be possible to include "local effect" strongly directional marks in an _identifier_:
> In this example, the _namespace_ is `:م2` and the _name_ is `:ن⁩3`, but the sequence is displayed
> with a spillover effect.
> (Note that the number in each name _trails_ the Arabic letter: it appears to the left because the
> string is RTL!).
>```
> {$a1 :b2:c3}
> {$م1 :م2:ن3} spillover effects
> {⁦$م1‎ :م2‎:ن3‎⁩} with isolates and LRMs
>```

Newlines inside of messages should not harm later syntax.

```
* * {{\u0645<br>\u0646}} 123 456 {{ No LRM==bad }}
* * {{م
ن}} 123 456 {{ No LRM==bad }}

* * {{\u0645<br>\u0646}}\u200e 123 456 {{ LRM }}
* * {{م
ن}}‎ 123 456 {{ LRM }}
```

## Constraints

_What prior decisions and existing conditions limit the possible design?_

Users cannot be expected to create or manage bidirectional controls or
marks in _messages_, since the characters are invisible and can be difficult
to manage.
Tools (such as resource editors or translation editors)
and other implementations of MessageFormat 2 serialization are strongly
encouraged to provide paired isolates around any right-to-left
syntax as described in this design so that _messages_ display appropriately as plain text.

Ideally we do not want RLM/LRM/ALM to be part of the parsed
`name`, `variable`, `reserved-keyword`, `unquoted`, or any other term
defined in terms of `name`.
This is complicated to do in ABNF because each of these tokens is followed either by
whitespace or by some closing marker such as `}`.
The workaround in #763 was to permit these characters _before_ or _after_ whitespace
using the various whitespace productions.
This works at the cost of allowing spurious markers.

## Proposed Design

_Describe the proposed solution. Consider syntax, formatting, errors, registry, tooling, interchange._

To start with, we should establish that _message_ editing should always use a left-to-right
base direction.
Further, each _line_ of a message should be displayed for editing with a base paragraph direction of LTR.
This is because the syntax of a _message_ depends on LTR word tokens,
as well as token ordering (as in a placeholder or with variant keys).
This is not the disadvantage to RTL languages that it might first appear:
- Bidi inside of patterns works normally;
only placeholders/markup have special usage of bidi controls and this usage is isolated
so that placeholders and markup are treated as neutrals.

Permit isolating bidi controls to be used on the **outside** of the following:
- unquoted literals
- quoted literals
- quoted patterns
Comment on lines +249 to +252
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also allow for an LRI/FSI pair immediately inside expressions and markup, or is there a reason not to do so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that also. It doesn't solve the problem of expression/markup internal bidi, though.

Copy link
Collaborator

@eemeli eemeli Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly here thinking of content like:

a = 'אחד'
b = 'שתיים'
s = a + '{$' + b + '}'

where we have an RTL variable name inside a placeholder in an RTL pattern.

How, except with an LRI/FSI pair inside the braces, can we get that to render so that the $ is to the left of the name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #discussion_r1542105763
For those implementations, RLM/LRM are the best one can do.


This would change the ABNF as follows:
(Notice that this change includes a production `bidi` described further down
in this document)
```abnf
literal = ( open-isolate (quoted / (unquoted [bidi])) close-isolate)
/ (quoted / (unquoted [bidi]))
quoted-pattern = ( open-isolate "{{" pattern "}}" close-isolate)
/ ("{{" pattern "}}")

open-isolate = %x2066-2068
close-isolate = %x2069
```

> [!IMPORTANT]
> The isolating controls go on the **_outside_** of the various _literal_ and _pattern_
> productions because characters on the **_inside_** of these are part of the _literal_'s
> or _pattern_'s textual content.
> We need to allow users to include bidi controls in the output of MF2.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth adding to the "constraints" section? Constraint: we must allow bidi controls as either literal (interpreted by the MF2 parser) or escaped (treated as regular text). (Their position introduces an implicit escape.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow? The point of the note is to show that bidi controls are just normal text inside literal contexts (the body of a pattern or inside of quoted literals)


Permit isolating bidi controls to be used **immediately inside** the following:
- expressions
- markup

This would change the ABNF as follows (assuming the above changes are also incorporated):
```abnf
expression = "{" open-isolate (literal-expression / variable-expression / annotation-expression) close-isolate "}"
eemeli marked this conversation as resolved.
Show resolved Hide resolved
/ "{" (literal-expression / variable-expression / annotation-expression) "}"
literal-expression = [s] literal [s annotation] *(s attribute) [s]
variable-expression = [s] variable [s annotation] *(s attribute) [s]
annotation-expression = [s] annotation *(s attribute) [s]
eemeli marked this conversation as resolved.
Show resolved Hide resolved
markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close
/ "{" open-isolate [s] "#" identifier *(s option) *(s attribute) [s] ["/"] close-isolate "}" ; open and standalone
/ "{" open-isolate [s] "/" identifier *(s option) *(s attribute) [s] close-isolate "}" ; close
```

Permit the use of LRM, RLM, or ALM stronly directional marks immediately following any of the items that
**end** with the `name` production in the ABNF.
Comment on lines +300 to +301
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is too messy, and doesn't solve the problem as well as isolation, but I'm ok with considering that separately.

My preferred overall solution would be to:

  1. Optionally LR/RL/FS -isolate quoted-pattern, quoted, and name;
  2. Optionally LR-isolate expression; and
  3. Allow for a single LRM after a newline in code, or may at the end of whitespace containing a newline.

Put together, that should allow for rendering all code as LTR, and all possibly-RTL content as RTL.

This includes _identifiers_ found in the names of
_functions_
and _options_,
plus the names of _variables_,
as well as the contents of _unquoted_ literals.

> [!NOTE]
> Notice that _unquoted_ literals can also be surrounded by bidi isolates
> using the previous syntax modification just above.

> [!NOTE]
> Notice that `reserved-annotation` is not in the ABNF changes because it already
> permits the marks in question.
> Any syntax derived from `reserved-annotation`
> (i.e. when unreserving a new statement in a future addition)
> would need to handle bidi explicitly using the model already established here.

```abnf
variable-expression = "{" [s] variable [bidi] [s annotation] *(s attribute) [s] "}"
function = ":" identifier [bidi] *(s option)
option = identifier [bidi] [s] "=" [s] (literal / (variable [bidi])
attribute = "@" identifier [bidi] [[s] "=" [s] (literal / (variable [bidi])]
markup = "{" [s] "#" identifier [bidi] *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier [bidi] *(s option) *(s attribute) [s] "}" ; close
identifier = [(namespace [bidi] ":")] name
bidi = [ %x200E-200F / %x061C ]
```

## Alternatives Considered

_What other solutions are available?_
_How do they compare against the requirements?_
_What other properties they have?_

### Nothing
We could do nothing.

A likely outcome of doing nothing is that RTL users would insert bidi controls into
_messages_ in an attempt to make the _pattern_ and/or _placeholders_ display correctly.
These controls would become part of the output of the _message_,
showing up inappropriately at runtime.
Because these characters are invisible, users might be very frustrated trying to manage
the results or debug what is wrong with their messages.

By contrast, if users insert too many or the wrong controls using the recommended design,
the _message_ would still be functional and would emit no undesired characters.

### Deeper Syntax Changes
We could alter the syntax to make it more "bidi robust",
such as by using strongly directional instead of neutrals.
eemeli marked this conversation as resolved.
Show resolved Hide resolved

### Forbid RTL characters in `name` and/or `unquoted`
We could alter the syntax to forbid using RTL characters in names and unquoted literals.
This would make the syntax consist solely of LTR and neutral characters.
One flavor of this would be to restrict tokens to US ASCII.

Cons:
- This would break compatibility with NCName/QName; we would be back to
defining our own idiosyncratic namespace
- Unicode could define more RTL characters in the future, making the syntax
brittle
- This is not friendly to non-English/non-Latin users and represents a usability
restriction in environments in which names can be non-ASCII values