Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update design doc for message pattern quoting #503

Merged
merged 19 commits into from
Oct 30, 2023

Conversation

echeran
Copy link
Collaborator

@echeran echeran commented Oct 27, 2023

An update to the design doc, specifically on the topic of: "Do we allow unquoted variant patterns?"

Very much a WIP while in draft mode (not suitable for "drive-by reviews" until officially ready for review).

@mihnita please take an initial look and provide suggestions.

@echeran echeran requested a review from mihnita October 27, 2023 01:59
Copy link
Member

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Thank you both for working on this (especially during "maximum crunch time". I've made a number of suggestions below. Please have a look. They are mostly editorial in nature.

exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
Comment on lines 123 to 124
Rarely do messages that need to include leading or trailing whitespace do so due to
how they will be concatenated with other text,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change made the text far less clear.

To be honest, I would have guessed that you would have replaced the somewhat biased word "Rarely" with something more neutral such as "Some messages need..."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • After the resource file gets parsed as XML, the Android string resource format requiring
  • After the resource file gets parsed as XML, the Android resource compiler requires

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat biased word "Rarely" with something more neutral such as "Some messages need..."

In all fairness, 0.3% is indeed rare.
True, that number comes from an HTML oriented corpus, but I don't have access to much code using Windows / MacOS native formats.

Comment on lines 135 to 136
Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem clear to me?

I think what you might be trying to say is:

Suggested change
Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).
Also, importantly, whether the intentional inclusion of whitespace by a
message author might be considered "desirable" or might be interpreted
as "an internationalization bug",
we need to provide the ability of an author to control the content of a given pattern without ambiguity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this whole point a bit. Something along these lines:


All common OSes (Windows, MacOS, Linux, iOS, Android) have "plain text" widgets and "rich formatting widgets" (usually Web).
The "Web widgets" usually drag with them a whole HTML engine. That is slow, and memory consumming.
So the most commonly used widgets are plain text.
And when that is all you have, spaces and newlines are used to create "fake" formatting.
Things like paragraphs, indents, lists (bulleted or numeric).

Some examples (pick and choose):

Even in HTML there are sometimes reasons to force the space preserving.

TLDR: trailing spaces are not necessarily an i18n bug, so it is not the job of MF2 to discourage them, or to get in the way.


exploration/text-vs-code.md Outdated Show resolved Hide resolved
Messages themselves are "simple strings" and must be considered to be a single
line of text. In many containing formats, newlines will be represented as the local
equivalent of `\n`.
Messages themselves are "simple strings" and must be considered to be WYSIWYG.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect and the change greatly reduces the impact of the document IMO. The thing that is WYSIWYG is the pattern. In the case of simple messages, this is the whole message. But in the case of complex messages, what you see ({#input $foo :number minimumFractionDigits=11}) is not exactly what you get 😉

The most important part of the original statement here is removed, which reminds readers that this message:

myMessage = {{
   {#input $var :number}
   {{You have {$var} message(s)}}
}}

Is actually this message in many storage formats:

myMessage = {{\n   {#input $var :number}\n   {{You have {$var} message(s)}}\n}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But the storage format might remove spaces / newlines, and not only from the beginning

For example I can do this in properties file:

myMessage = {{\
    input {$var :number}\
    {{You have {$var} message(s)}}\
}}

What MF2 sees (once loaded from the file) is a single line, with leading spaces trimmed (from each line) and newlines removed:

{{input {$var :number}{{You have {$var} message(s)}}}}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like there are 2 points here, and that we can and want them both:

  1. patterns are WYSIWYG and have no restrictions on newline or most other characters within
  2. messages are treated as just a string ("simple strings") in the containing format

Our text for point 1 needs to be corrected to say "pattern" in cases where it incorrectly said "message".

We got to writing what we wrote because of the incorrect detail in the original text that said that messages are represented as "must be considered to be a single line of text". That phrase should not be preserved.

Messages themselves are "simple strings" and must be considered to be WYSIWYG.
The WYSIWYG nature of representing a message pattern is independent of whether the message is a single line or contains multiple lines.

There is no restriction that a message must only contain a single line (that is, not contain any newline characters),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would s/message/pattern/g this section (carefully because my suggestion is not always true). Only talk about the message when you intend to include the code.

when there is 1+ declarations in a `match` (selection) message,
or when there are 2+ declarations in a non-`match` complex message.

Cons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding that the message closing pattern characters add no value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About closing brackets, pros and cons:


  • Closing brackets are ingrained in developers.
    {{ something something feels broken because of the missing }}

  • The closing brackets might assist in some storage formats (maybe to be designed), especially some that might be minimized by tools.
    Example:
msg1 = {{ some complex message }}
when = Type your name here.

Minimized: msg1={{ some complex message }}when=Type your name here.


  • One can use the space after closing to add comments, metadata for linters or other tools.
{{
... when ...
}} lint_rules: { maxlen:"80 chars" } ref : { screenshot: "https://example.com/foo.jpg", glossaryId: 1234 }

Not strong arguments. But there is some value.
Even if all it does is prevent the "what the heck is this" reaction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition to the pros (new bullet of changing an existing one): visibility (?)


"No trimming / Always delimit" also makes it clear what spaces are rendered.
Example:

{#when one}   This is a message (when one) condition

vs

{{
when one {   This is a message (when one) condition }
}}

In the second case it is clear what is rendered: leading spaces, no matter what kind.

In the first case it is not clear.
"We trim the ASCII spaces" rule does not help, since the spaces might be non-breaking spaces, or em-space, en-space, ideographic space, and all the other characters that look like space on screen, but are not ASCII space.

So visually I don't know where the message starts / stops.
Even in edit mode (when I translate) I don't know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit worse: as a translator you often get strings with no context.
There is no way to know if through what kind of API the string will go.

So if I get this string "Bill Gates did something" and I have to translate, and I want to put the "honorific space" in front of the name, I don't know how to do it.
Is the message going through MF2 with trimming? Then I have to wrap it in { Bill ....}
But maybe it is not going through MF2.
Then if I wrap it the { ... } will render on screen "as is", not what you want.

We've seen this:
"The bee" => "L'abbeile" : if the message goes through MF1, the apostrophe needs to be escaped
"I'm 1 in 100" => "Eu sunt 1%" : if the string goes through a printf-like API then I need to escape the %

And as a translator I have no why to know what API the dev uses, and I'm not familiar with the escape rules for myriads of APIs.
I am in fact faced with more APIs than a developer.
A dev might do "java + html + js".
A translator often works for many projects, from many companies, so it is exposed to strings consumed by native Windows apps, PHP, some Ruby stuff, C#, SQL, others. And even switch several times per day.

These days you don't pay the bills as a translator if you only handle one single format for one produce from one company.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @mihnita 's point about the possibility "to add comments, metadata for linters or other tools."

We can discuss as a workgroup whether we leave the space after the closing delimiter for as a free-for-all, or we reserve it for us as a standard for future extensions.

Note: we are losing the possibility to do the same for simple messages because we are moving away from our current syntax.

spec/syntax.md Show resolved Hide resolved
Copy link
Collaborator

@stasm stasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you said this wasn't ready for reviews yet, so please consider this my early feedback on an early draft, which I'm sharing now because the day here is soon over :)


—Rico Mariani, MS Research MindSwap Oct 2003. (<a href="https://learn.microsoft.com/en-us/archive/blogs/brada/the-pit-of-success">restated by Brad Adams</a>, MS CLR and .Net team cofounder)
</blockquote>
</details>

Developers and translators should be able to read and write the syntax easily in a text editor.

Translators (and their tools) are not software engineers, so we want our syntax
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to prioritize the requirements. For example, while I agree that we should, in general terms, make the syntax simple and robust, I would also suggest that the primary consumer of the message syntax are developers. Translators will oftentimes work with just the pattern syntax, through CAT tools.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. See the "Evaluation" portion of the "Proposed Design" section below.

However, prioritizing requirements to me is tantamount to defining our value system to evaluate. We don't yet have alignment as a group on what our requirements are, let alone our prioritization of those requirements (value system). This is another reason why I wanted to keep the values (prioritization of requirements) only alongside the area explaining how we chose to propose an option via evaluation.

Before we can do anything further towards implementing your suggestion, you/we have to get the group to be self-aware and precise on their values, and then maybe after that, alignment. :-)

Within a complex message, patterns are always quoted with `{{...}}` or other choice of delimiter.

The entire complex message is also wrapped with `{{...}}` or other choice of delimiter.
This allows interior "code mode" of message to have flexible whitespace in between tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think should happen to whitespace outside the entire code block? Should we specify it here as well?

Copy link
Collaborator

@mihnita mihnita Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think it should be specified, it not already.
(will probable be covered anyway when we get to update the ebnf)

My choice would be to say:

  1. nothing before {#. If there is something, then we are in simple text mode, and {# will be an error

  2. after closing #} we have several options:

  • no closing, we drop it as a requirement
  • closing is optional. If as a developer you are bothered to see unclosed brackets, feel free to close it. Does nothing
  • closing is mandatory, and the message ends there
    • nothing allowed after is => unnecessarily rigid?
    • allowed only spaces / newlines after it
    • reserve it for us. We can extend the standard later to add comments, lint directives, links to images, etc
    • allow for developers to do what they want, "free for all". The message ended, we ignore the rest.
      They can add comments, lint directives, links to images, whatever

On the WG to discuss and decide.
I like the idea to reserve it for us.
It is a non-breaking change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to initially say anything outside the complex message-wide delimiters is an error (invalid). In the future, if we want to relax requirements and say annotations & message description notes are allowed in message syntax, you have the freedom to do so (after the closing delimiter).

In general with API design, you can always relax requirements and narrow outputs, but the reverse is not possible (causes breaking changes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is anything "outside" the complex message-wide delimiters, then:

  1. It's whitespace that we trim
    a. Or produces an error (worth discussion)
  2. or the message is actually a simple message that produces a lot of errors
This {{ match {$var} when * {{{$var}}}}} is an interesting message.

This evaluates, I think, as:

This {�} is an interesting message.

Or possibly (with $var==123) as:

This {�}{�} match 123 when * 123{�}{�} is an interesting message.

(both emit a syntax error)

* The rule about the whether leading and trailing whitespace is included is simple and unambiguous.
* This matches the WYSIWIG behavior that simple messages preserve.
* The patterns can be detected within the pattern more easily due to the delimiters serving as a visual anchor.
* Requiring all patterns to be quoted minimizes the number of characters that need to be escaped within a pattern to 3:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also holds for at least some of the other proposals.

In fact, if we always quote variant patterns, we must make the closing delimiter special. Theoretically, in other proposals, we could only special-case the opening delimiter, because both code and placeholders would be wrapped in the same delimiter. (Although that would require agreeing to a different way of preserving whitespace than the currently agreed {{ ... }}.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, if we always quote variant patterns, we must make the closing delimiter special.

But we have to do it for all the other proposals too.
Because all allow for optional wrapping.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also holds for at least some of the other proposals.

You're right, it's true for 1a and 2a. Not for 3a. Since it's not a universal aspect of the proposals, it's not a wash (redundant), and thus a point worth mentioning.

I'm open to suggestions for rephrasing "Requiring all patterns" to whatever it is that results in the minimal possible of characters needing escaping.

while complex messages use the aforementioned delimiter to quote patterns (ex: `{{...}}`).
* Another potential drawback, specifically in the case of non-`match` complex messages with exactly 1 declaration,
is that this option adds 2 extra delimiters compared to an alternative syntax that doesn't require quoted patterns
and is designed to minimize delimiter usage only to code mode introducers.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some of the other previously discussed drawbacks:

  • If the code block is delimited from both sides, users may be tempted to insert text around it.
  • If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.
  • If we use curlies for patterns and for placeholders, then they serve double duty, which may make the syntax harder to understand, and also harder to make the pattern out visually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding text after closing is not necessarily a drawback (I have a comment somewhere)

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

Fair enough.
Maybe make it optional?

I think that {{ is to make this intentionally ugly. I would probably go with {# to enter code mode, and { to wrap the patterns.
And yes, it is double douty. But choosing anything else means that we escape one extra thing.
If we use <<< and I want (for some reason <<< Hello world as simple text, then we need to escape it.

So we add another escaping rule.
Pros and cons :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If the code block is delimited from both sides, users may be tempted to insert text around it.

Not a moving argument to me. As you pointed out in a different comment, the audience of message authoring is developers. If we say it's not possible to add extraneous text, and they do so and the MF2 implementation rejects the message, they'll figure it out quickly.

  • If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

Forgetting to type syntax can happen in any alternative option, so this feels like a weaker argument than the previous. Also, even though developers learn instincts early on to always balance delimiters, we create linters & other tooling to double-check.

(Some languages that use delimiters in a simple and regular way have the ability to evolve powerful tools that always keep everything balanced while being easy to use -> easy to do the right thing, impossible to do the wrong thing. But I digress...)

  • If we use curlies for patterns and for placeholders, then they serve double duty, which may make the syntax harder to understand, and also harder to make the pattern out visually.

Sure. Option 3a solves for that with the tradeoff of taking on other costs as a result. The next question then becomes how do we prioritize our requirements in order to create a value system that we use to evaluate the tradeoffs (choose)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.
Forgetting to type syntax can happen in any alternative option, so this feels like a weaker argument than the previous.

I disagree a bit. Any enclosing syntax is prone to errors, but the argument here is that the more levels of nesting (and the further apart the enclosure bits), the more opportunities for error exist because the user is keeping track of more things.

Note that one of the proposals for "2a" was to have just a starter sigil. This would eliminate the enclosure for the message.


Cons:

* This comes at the cost of an inconsistency in the WYSIWYG patterns are quoted between simple and complex messages.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss here the risk of "two syntaxes in a trench coat." For someone who's only ever seen simple messages, the only syntax rule they can infer is that {} is used for placeholders. A whole separate complex-message mode cannot be easily "guessed".

(But see also my comment about considering developers the primary audience for the complex message syntax, so perhaps this is acceptable.)

Copy link
Collaborator

@mihnita mihnita Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"two syntaxes in a trench coat."

The expression is designed to sounds ugly :-)

But there are already templating systems doing this, and didn't prevent adoption, or trigger many complaints.
Heck, when I write HTML with stylesheets and code I have 3 syntaxes in trench coat :-)

Or if you write C/C++/C# something else, you have one syntax in the main code, another syntax in strings, and another in printf strings, etc.

int foo = 10%3; // syntax one, math, result is 1
puts("10%3"); // syntax 2, in string, output, result is "10%3"
printf("10%3"); // syntax 3, in string, error, need to double the `%%` to get "10%3" in output

our value system places to the requirements met by the pro aspects compared to the con aspects. Namely:

* [high] Unsurprising WYSIWYG behavior from patterns
* [high] Easy recognition of patterns, even for non-developers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, I've never found ICU MF patterns easy to spot—specifically because they use {}.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.

But I've never seen any questions asked for single selections.
(on StackOverflow or internally sites, here and in previous companies)

It gets ugly, even for developers, when you have multiple selections (plural-in-plural, select in plural in ???)

...
<?php
if (true) {
echo '<p>Hello World</p>';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, but I think both PHP and Freemarker would actually prefer the second method:

<?php if (true): ?>
    <p>Hello, world!</p>
<?php endif ?>

Copy link
Collaborator

@mihnita mihnita Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't even know that is possible.

If I check https://www.php.net/manual/en/control-structures.if.php and https://www.w3schools.com/php/php_if_else.asp the option is not even mentioned.

It is mentioned in the "User Contributed Notes" of the manual.

But I kind of doubt that something that is not even mentioned in the official manual is the preferred way.

If anything this makes the point that having more than one way to do things, some more recommended than others, is not a good thing.

Comment on lines 123 to 124
Rarely do messages that need to include leading or trailing whitespace do so due to
how they will be concatenated with other text,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • After the resource file gets parsed as XML, the Android string resource format requiring
  • After the resource file gets parsed as XML, the Android resource compiler requires

Comment on lines 123 to 124
Rarely do messages that need to include leading or trailing whitespace do so due to
how they will be concatenated with other text,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somewhat biased word "Rarely" with something more neutral such as "Some messages need..."

In all fairness, 0.3% is indeed rare.
True, that number comes from an HTML oriented corpus, but I don't have access to much code using Windows / MacOS native formats.

Comment on lines 135 to 136
Also importantly, we cannot make assumptions about the validity of leading or trailing whitespace in a message,
especially since their usage may be entirely unrelated to internationalization issues (ex: sentence agreement disruption by concatenation).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this whole point a bit. Something along these lines:


All common OSes (Windows, MacOS, Linux, iOS, Android) have "plain text" widgets and "rich formatting widgets" (usually Web).
The "Web widgets" usually drag with them a whole HTML engine. That is slow, and memory consumming.
So the most commonly used widgets are plain text.
And when that is all you have, spaces and newlines are used to create "fake" formatting.
Things like paragraphs, indents, lists (bulleted or numeric).

Some examples (pick and choose):

Even in HTML there are sometimes reasons to force the space preserving.

TLDR: trailing spaces are not necessarily an i18n bug, so it is not the job of MF2 to discourage them, or to get in the way.


Messages themselves are "simple strings" and must be considered to be a single
line of text. In many containing formats, newlines will be represented as the local
equivalent of `\n`.
Messages themselves are "simple strings" and must be considered to be WYSIWYG.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But the storage format might remove spaces / newlines, and not only from the beginning

For example I can do this in properties file:

myMessage = {{\
    input {$var :number}\
    {{You have {$var} message(s)}}\
}}

What MF2 sees (once loaded from the file) is a single line, with leading spaces trimmed (from each line) and newlines removed:

{{input {$var :number}{{You have {$var} message(s)}}}}

("Simple messages" refers to messages consisting solely of a pattern, and thus are not complex messages.)

Because the simple message pattern consists of the entire message,
the pattern includes any leading or trailing whitespace.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and newlines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already included.

s = 1*( SP / HTAB / CR / LF )

when there is 1+ declarations in a `match` (selection) message,
or when there are 2+ declarations in a non-`match` complex message.

Cons:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About closing brackets, pros and cons:


  • Closing brackets are ingrained in developers.
    {{ something something feels broken because of the missing }}

  • The closing brackets might assist in some storage formats (maybe to be designed), especially some that might be minimized by tools.
    Example:
msg1 = {{ some complex message }}
when = Type your name here.

Minimized: msg1={{ some complex message }}when=Type your name here.


  • One can use the space after closing to add comments, metadata for linters or other tools.
{{
... when ...
}} lint_rules: { maxlen:"80 chars" } ref : { screenshot: "https://example.com/foo.jpg", glossaryId: 1234 }

Not strong arguments. But there is some value.
Even if all it does is prevent the "what the heck is this" reaction.

when there is 1+ declarations in a `match` (selection) message,
or when there are 2+ declarations in a non-`match` complex message.

Cons:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition to the pros (new bullet of changing an existing one): visibility (?)


"No trimming / Always delimit" also makes it clear what spaces are rendered.
Example:

{#when one}   This is a message (when one) condition

vs

{{
when one {   This is a message (when one) condition }
}}

In the second case it is clear what is rendered: leading spaces, no matter what kind.

In the first case it is not clear.
"We trim the ASCII spaces" rule does not help, since the spaces might be non-breaking spaces, or em-space, en-space, ideographic space, and all the other characters that look like space on screen, but are not ASCII space.

So visually I don't know where the message starts / stops.
Even in edit mode (when I translate) I don't know.

when there is 1+ declarations in a `match` (selection) message,
or when there are 2+ declarations in a non-`match` complex message.

Cons:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit worse: as a translator you often get strings with no context.
There is no way to know if through what kind of API the string will go.

So if I get this string "Bill Gates did something" and I have to translate, and I want to put the "honorific space" in front of the name, I don't know how to do it.
Is the message going through MF2 with trimming? Then I have to wrap it in { Bill ....}
But maybe it is not going through MF2.
Then if I wrap it the { ... } will render on screen "as is", not what you want.

We've seen this:
"The bee" => "L'abbeile" : if the message goes through MF1, the apostrophe needs to be escaped
"I'm 1 in 100" => "Eu sunt 1%" : if the string goes through a printf-like API then I need to escape the %

And as a translator I have no why to know what API the dev uses, and I'm not familiar with the escape rules for myriads of APIs.
I am in fact faced with more APIs than a developer.
A dev might do "java + html + js".
A translator often works for many projects, from many companies, so it is exposed to strings consumed by native Windows apps, PHP, some Ruby stuff, C#, SQL, others. And even switch several times per day.

These days you don't pay the bills as a translator if you only handle one single format for one produce from one company.

Co-authored-by: Addison Phillips <[email protected]>
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
exploration/text-vs-code.md Outdated Show resolved Hide resolved
Within a complex message, patterns are always quoted with `{{...}}` or other choice of delimiter.

The entire complex message is also wrapped with `{{...}}` or other choice of delimiter.
This allows interior "code mode" of message to have flexible whitespace in between tokens
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to initially say anything outside the complex message-wide delimiters is an error (invalid). In the future, if we want to relax requirements and say annotations & message description notes are allowed in message syntax, you have the freedom to do so (after the closing delimiter).

In general with API design, you can always relax requirements and narrow outputs, but the reverse is not possible (causes breaking changes).

* The rule about the whether leading and trailing whitespace is included is simple and unambiguous.
* This matches the WYSIWIG behavior that simple messages preserve.
* The patterns can be detected within the pattern more easily due to the delimiters serving as a visual anchor.
* Requiring all patterns to be quoted minimizes the number of characters that need to be escaped within a pattern to 3:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also holds for at least some of the other proposals.

You're right, it's true for 1a and 2a. Not for 3a. Since it's not a universal aspect of the proposals, it's not a wash (redundant), and thus a point worth mentioning.

I'm open to suggestions for rephrasing "Requiring all patterns" to whatever it is that results in the minimal possible of characters needing escaping.

when there is 1+ declarations in a `match` (selection) message,
or when there are 2+ declarations in a non-`match` complex message.

Cons:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @mihnita 's point about the possibility "to add comments, metadata for linters or other tools."

We can discuss as a workgroup whether we leave the space after the closing delimiter for as a free-for-all, or we reserve it for us as a standard for future extensions.

Note: we are losing the possibility to do the same for simple messages because we are moving away from our current syntax.

while complex messages use the aforementioned delimiter to quote patterns (ex: `{{...}}`).
* Another potential drawback, specifically in the case of non-`match` complex messages with exactly 1 declaration,
is that this option adds 2 extra delimiters compared to an alternative syntax that doesn't require quoted patterns
and is designed to minimize delimiter usage only to code mode introducers.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If the code block is delimited from both sides, users may be tempted to insert text around it.

Not a moving argument to me. As you pointed out in a different comment, the audience of message authoring is developers. If we say it's not possible to add extraneous text, and they do so and the MF2 implementation rejects the message, they'll figure it out quickly.

  • If the code block is delimited from both sides, it may be easy to forget the }} closing the entire block.

Forgetting to type syntax can happen in any alternative option, so this feels like a weaker argument than the previous. Also, even though developers learn instincts early on to always balance delimiters, we create linters & other tooling to double-check.

(Some languages that use delimiters in a simple and regular way have the ability to evolve powerful tools that always keep everything balanced while being easy to use -> easy to do the right thing, impossible to do the wrong thing. But I digress...)

  • If we use curlies for patterns and for placeholders, then they serve double duty, which may make the syntax harder to understand, and also harder to make the pattern out visually.

Sure. Option 3a solves for that with the tradeoff of taking on other costs as a result. The next question then becomes how do we prioritize our requirements in order to create a value system that we use to evaluate the tradeoffs (choose)?

exploration/text-vs-code.md Outdated Show resolved Hide resolved
@eemeli
Copy link
Collaborator

eemeli commented Oct 29, 2023

On my part, I've spent much more time than I'd wish over the past few days and weeks thinking about and looking into messages with external whitespace. In the absence of any better place to put down some of my thoughts on this, these are the aspects and arguments that I find important to account for:

Localizable external whitespace is really rare, while mistakes are common

Sometimes a leading or trailing space could be localizable, but you can't necessarily tell without looking through the code which is using the message. So I did that. This is what I'd mentioned previously via email:

In total, [Mozilla's Pontoon] system has so far handled about 175k translatable messages, of which 568 had exterior whitespace in the source locale. These I've manually categorised as follows, starting from the most common:

  • 180 incorrect segmentation / string concatenation, e.g. " for this version." or "Download the "
  • 122 contents wrapped in a <tag>, so the whitespace is almost certainly an insignificant segmentation artifact.
  • 108 ends in colon+space, such as "Description: "
  • 106 unlocalized markup, such as leading or trailing newlines
  • 52 potentially localizable space between clauses

So overall that's perhaps 0.3% of all messages, of which (generously) up to a third might be localizable. If you'd like to perform your own analysis, please reach out to me separately for the source data.

As a next step, I filtered all of the above to the 41 potentially localizable strings which are currently in production (looks like a sentence, has one leading or trailing space), and found where they're coming from in code, and how they're used: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40.

Of the above, only the 0th message actually contains a localizable space; the other 40 are all bugs. So that's exactly one localizable external space in about 66k messages currently in production. This space was incorrectly dropped in 15 of the 35 locales to which it's translated; I've now submitted corrections for all of them.

Real localizable external whitespace is so rare that it gets drowned by the noise.

All of the above bugs are in formats that explicitly delimit patterns, and thereby make it too easy to include leading or trailing whitespace. About 28% of the messages currently in production use Fluent, which does not quote patterns. None of these were similarly buggy.

From this I would conclude that using a syntax which requires external whitespace to be explicitly intentional would make it much less likely for it to be ignored, and would lead to better localizations.

We're not actually talking about quoting patterns

To be precise, we are talking about delimiting patterns with {braces}, which rather explicitly are not 'quotes' or "quotes". The distinction here matters, because we are looking to assign a novel meaning to a pair of characters that no other syntax than ICU MessageFormat uses to delimit localizable text. Every other syntax which uses braces in text uses them to delimit code. Which MF2 is also doing.

When we talk of the syntax being "WYSIWYG", we are asking for its readers to not see the {braces}, a symbol so prevalent in our syntax that we've jokingly incorporated it into our logo {�}. Humans are not trained for that the way they are with "quotes", or with empty spaces acting as content separators. In other words, if I see the braces but I don't see the empty space, how is the syntax "WYSIWYG"?

We are also asking for MF2 authors and editors to somehow know that the spaces within the { braces } are significant -- but only if they're delimiting patterns. Within expressions, MF2 syntax ignores whitespace, so {{ Hello {world }}} formats to contain a leading space, but no trailing space. I mean, go back to the first sentence of this paragraph and consider how you saw the " braces ": Was it truly obvious that its padding spaces were a part of the string it represents the same way they are with the double quotes?

The overlap of external whitespace and variants is really truly tiny

In a very real sense, the discussion of whether messages with variants should always be delimited is a discussion asking if this string is ok being represented as

{#match :platform}
{#when macos} ⇧ ⌘{| |}
{#when *} Ctrl+Shift+

or if that is sufficiently problematic that we need every pattern of every message with variants to be {{delimited}}.

I have been actively seeking for examples of messages with leading or trailing whitespace for the last month, and the above is the one actual, current message with variants and external whitespace that has been identified. We are talking about such a rare situation it should not be driving our whole syntax.

We should choose to do with patterns what we're doing with literals, where for common values we allow them to be delimited by whitespace, but also permit |vertical pipes| to be used as "quotes" in the rare cases where they're required.

aphillips added a commit that referenced this pull request Oct 29, 2023
Fast tracked from #503.
@aphillips
Copy link
Member

@eemeli Thanks for the long comment.

To be precise, we are talking about delimiting patterns with {braces}, which rather explicitly are not 'quotes' or "quotes".

I agree that we are talking about delimiting patterns and that this is what our technical decision is about. Quoting would be one mechanism for delimiting patterns, but is not the only one. I would call out that quoting does not require the use of "quote" characters. I think that referring to {{ and }} as pattern quotes is fine (although we could be pedantic and call them "pattern delimiters" if you prefer)

The overlap of external whitespace and variants is really truly tiny

I think this isn't quite on the nose. The real problem we're dealing with here is intentionality. There has to be a way for users to intentionally include various kinds of character sequence into their pattern. This includes invisible Unicode whitespace that is not MF2 whitespace. For example, non-breaking space or NNBSP or ZWNJ or what have you. There are a lot of characters that have no "ink" but which a user might intend to be part of the message. Some of those characters will be MF2 whitespace.

MF2 intentionally does not include a general purpose character escaping mechanism (because we expect the host environment or file format to include one and we are avoiding the double-escaping mess). If the boundary between "pattern" and "not pattern" is between invisible characters, that's pretty difficult to work with.


I think there are four audiences that need to be served:

  1. Developers/message authors. The creator of the source message needs to understand how the pattern is delimited and be able to clearly include any character sequence into a pattern.
  2. Translators. Translators, like developers, need to know whether whitespace is part of the pattern or not and to intentionally include whitespace into patterns where necessary.
  3. MF2 parsers. The parser obviously need to be able to determine the boundary for the pattern if the message is to be rendered.
  4. Tools. CAT tools could treat the entire message as a single segment. But they could also produce separate segments for each when case (including generating necessary additional segments for the target locale). Either way, tools generally protect syntax to allow translators to focus on content. Non-CAT tools also need to preserve intentional whitespace and not interfere with pattern delimiters.

What I think is interesting is that pattern delimiters are probably syntax. If pattern delimiters are optional, it might be unclear to translators whether a given pattern is already quoted (delimited) or to tools as to whether delimiters are needed. It's hard for machines to guess people's intentions. I agree that PEWS is rare, which is why we need to be especially clear about how to handle the rare cases.

@eemeli
Copy link
Collaborator

eemeli commented Oct 29, 2023

[...] I think that referring to {{ and }} as pattern quotes is fine (although we could be pedantic and call them "pattern delimiters" if you prefer)

I don't want to insist on pedantry, as long as we recognise that the pattern "quote" characters we're considering are rather explicitly not generally used as quote characters. For anyone coming to MF2 not via MF1, this will be an additional weird thing to learn. Very much comparable to |quoted literals|, which I've heard argued not to be so bad because they'll be so rare.

[...] There has to be a way for users to intentionally include various kinds of character sequence into their pattern. This includes invisible Unicode whitespace that is not MF2 whitespace. For example, non-breaking space or NNBSP or ZWNJ or what have you. There are a lot of characters that have no "ink" but which a user might intend to be part of the message. Some of those characters will be MF2 whitespace.

Yes, and this will be supported no matter which way we decide to go, by the optional {{pattern quoting}} ability. Which unfortunately will mean that the closing brace } may need to be considered a syntax character, and require special escaping. Pontoon has seen 17 messages with } as a pattern character, an order of magnitude more than messages with localizable external whitespace.

@echeran echeran marked this pull request as ready for review October 30, 2023 04:05
@echeran
Copy link
Collaborator Author

echeran commented Oct 30, 2023

Ready for review now. I made edits mostly based on #504, with some additions, and also responding comments that came prior to being ready anyways. Previous info from the doc (background, use cases, stats, examples) is kept pre-hidden at the bottom, like appendices.

exploration/text-vs-code.md Show resolved Hide resolved
Comment on lines +177 to +183
>{{
> match {$var}
> {when *} This pattern has a space in front (it's between \} and This)
> {when other}
> This pattern has a newline and six spaces in front of it
> {when moo}This pattern has no spaces in front of it, but an invisible space at the end
>}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>{{
> match {$var}
> {when *} This pattern has a space in front (it's between \} and This)
> {when other}
> This pattern has a newline and six spaces in front of it
> {when moo}This pattern has no spaces in front of it, but an invisible space at the end
>}}
>{match {$var}}
>{when *} This pattern has a space in front (it's between } and This)
>{when other}
> This pattern has a newline and six spaces in front of it
>{when moo}This pattern has no spaces in front of it, but an invisible space at the end


Pros:
- WYSIWYG (on steroids)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Avoids as many escape sequences as possible,
as `}` does not need escaping in patterns.

Comment on lines +190 to +191
- Probably not a serious alternative: the example
includes any number of obvious footguns that have to be addressed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really rather opinionated. In many ways, this is the same as the "Always quote" solution, except that the pattern delimiters are }…{ instead of {{…}}. So the unnamed footguns probably apply to that alternative as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is opinionated. When I wrote it, I was being lazy by not enumerating the issues.

this is the same as the "Always quote" solution, except that the pattern delimiters are }…{ instead of {{…}}. So the unnamed footguns probably apply to that alternative as well.

No, this is incorrect.

{{...}} encloses all and only the whitespace that is intentional in the pattern, with {{ and }} forming the pattern boundary. These boundary characters are visible.

}...{ makes all whitespace in the variant block meaningful. It effectively prohibits a multiline representation of a message, because the newlines are always meaningful. It also means that trailing spaces (which are invisible) have meaning.

To make a message multiline, you have to put the whitespace inside the key.:

{{
match {$var}
{when 0
}This has no newline or space.{
when one}
This has a newline at the start.{
when *} This has a space at the start and six spaces and a newline at the end.      
}}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take your example above, and after the match replace each } with {{ and each { with }}, we get this:

{{
match {$var}
}}when 0
{{This has no newline or space.}}
when one{{
This has a newline at the start.}}
when *{{ This has a space at the start and six spaces and a newline at the end.      
}}

Ignoring the specifics of what's happening with the preamble, that seems pretty similar to me. It's just that we're conditioned to look at the { and } a certain way.

Cons:
- Requires one of the alternate syntaxes
- Has two ways to represent a pattern.
- May be difficult for translators to add quotes when needed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I've been able to determine, there are exactly three scenarios in which a translator may need to add leading or trailing spaces to a pattern that starts out without them:

  1. When translating a whole-sentence message from a CJK script to a non-CJK script, such that the sentences are concatenated into a single paragraph and need spaces between them. As with all other string concatenations, I would expect for this to be explicitly called out to the translator, so that they may know whether to add the space at the start or end of the pattern.
  2. When translating a pattern to Chinese which ends up requiring a leading honorific space. As far as I can tell, this is really rare in dynamic message strings.
  3. When the message is expected to be output using a monospace font and fakes either centering or right-alignment by using in-message spaces for indentation, and the first line of the pattern happens to be exactly the maximum length, and so does not need leading spaces. This is sufficiently rare that I'm pretty sure this is only a theoretical possibility, and in any case I'd expect it to be rather clearly called out to the translator.

Given that each of the above only has an impact on the pattern delimiting if the message also has multiple variants and if the translator is not using any tooling that'd take care of the delimiting and if the developer has not pre-emptively delimited the pattern, I would be ok accepting this negative, especially as the downside would be a single missing space in the translation.

- Easy to use (best of both worlds?)

Cons:
- Requires one of the alternate syntaxes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we just ran a "beauty contest" in which these "alternate syntaxes" were preferred over the current main syntax by an absolute majority of the participants, this could also be listed as one of the "Pros".

Pros:
- Code is special, whitespace is not.
- Makes PEWS into a "special event", alerting developers to the non-I18N aspects of it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Avoids as many escape sequences as possible,
as `}` does not need escaping in patterns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs escaping if it is used as ending of the "wrapping" of the string

exploration/text-vs-code.md Outdated Show resolved Hide resolved
Co-authored-by: Eemeli Aro <[email protected]>
@echeran
Copy link
Collaborator Author

echeran commented Oct 30, 2023

The distinction here matters, because we are looking to assign a novel meaning to a pair of characters that no other syntax than ICU MessageFormat uses to delimit localizable text. Every other syntax which uses braces in text uses them to delimit code. Which MF2 is also doing.

Remember our previous discussions from last year about this. This MF2 group chose curly braces precisely because they are the least likely to occur in other syntaxes and in message patterns themselves.

@aphillips
Copy link
Member

For the purposes of reading the document in our 2023-10-30 call, I'm merging this work now. This does not make the comments above right/wrong, relevant/irrelevant, or anything else. It's just to enable "easy reading".

@aphillips aphillips merged commit a07d972 into unicode-org:main Oct 30, 2023
aphillips added a commit that referenced this pull request Oct 30, 2023
- Rename the design doc.
- Cross out rejected options 2 and 5
- Add notes to 2 and 5 calling this out

(other changes may be added from the previous thread in #503 and WG call notes from 2023-10-30)
aphillips added a commit that referenced this pull request Oct 30, 2023
* Prepare design doc ahead of balloting

- Rename the design doc.
- Cross out rejected options 2 and 5
- Add notes to 2 and 5 calling this out

(other changes may be added from the previous thread in #503 and WG call notes from 2023-10-30)

* Prepare balloting instructions

* Update exploration/delimiting-variant-patterns.md

Co-authored-by: Eemeli Aro <[email protected]>

* Apply suggestions from code review

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants