Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673

Closed
wants to merge 26 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
b387301
Allow LRM in whitespace
aphillips Feb 19, 2024
ae712d7
Update syntax.md
aphillips Feb 19, 2024
c556ecf
Make 3a-2 definition consistent with requirements
aphillips Feb 19, 2024
8558eef
Update spec/syntax.md
aphillips Feb 19, 2024
d830550
Replace `[s]` with owsp production
aphillips Feb 19, 2024
42b7744
`s` != `wsp`
aphillips Feb 19, 2024
af75420
Update syntax.md
aphillips Feb 19, 2024
6a6dc70
Update message.abnf
aphillips Feb 19, 2024
db5e97c
Update spec/syntax.md
aphillips Feb 19, 2024
3fede7a
Update spec/syntax.md
aphillips Feb 20, 2024
586a4d6
Update spec/syntax.md
aphillips Feb 20, 2024
502e1fe
Update spec/syntax.md
aphillips Feb 20, 2024
d7af5eb
Address literals
aphillips Feb 20, 2024
68f32f8
Fix converting some `s` productions to `wsp`
aphillips Feb 21, 2024
e36d01d
Add bidi isolate support
aphillips Feb 21, 2024
5609dc0
Missing one "x"
aphillips Feb 21, 2024
e74b34c
Add a warning/tech preview feedback note
aphillips Feb 21, 2024
9557849
Update spec/syntax.md
aphillips Feb 21, 2024
ec369ec
Update spec/syntax.md
aphillips Feb 21, 2024
2532206
Address @macchiati's comment
aphillips Feb 21, 2024
f985860
Update spec/syntax.md
aphillips Feb 21, 2024
52c5c86
Update spec/syntax.md
aphillips Feb 21, 2024
4a6c96c
Fix up @machiatti's suggested text
aphillips Feb 21, 2024
19864c3
Address @mihnita's suggestion, fix expression brackets
aphillips Feb 22, 2024
27b9e42
Make syntax.md make abnf
aphillips Feb 26, 2024
00e3796
Merge branch 'main' into aphillips-allow-lrm
aphillips Jun 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions spec/message.abnf
Original file line number Diff line number Diff line change
Expand Up @@ -104,5 +104,5 @@ quoted-escape = backslash ( backslash / "|" )
reserved-escape = backslash ( backslash / "{" / "|" / "}" )
backslash = %x5C ; U+005C REVERSE SOLIDUS "\"

; Whitespace
s = 1*( SP / HTAB / CR / LF / %x3000 )
; Whitespace, optionally prepended with LRM
s = [%x200E] 1*( SP / HTAB / CR / LF / %x3000 )
Copy link
Collaborator

@gibson042 gibson042 Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that LRM is not valid after a line terminator or when not followed by a whitespace character, e.g. before the quoted patterns in a message like

.match {$count :number}
one
	{{You have {$count} notification.}}
*
	{{You have {$count} notifications.}}

or

.match {$count :number}
one
{{You have {$count} notification.}}
*
{{You have {$count} notifications.}}

Is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really.

Probably what needs to happen here is a distinguishing of optional and non-optional whitespace. Everywhere we have [s] should use a production that can be just LRM (or RLM, fwiw) or nothing, e.g.:

; optional whitespace
owsp = *( SP/ HTAB / CR / LF / %x3000 / %x200E / %x200F )

And everywhere that requires positive whitespace (i.e. just s) permit controls to either side:

; required whitespace
wsp = [ (%x200E / %x200F) ] 1*( SP / HTAB / CR / LF / %x3000) [ (%x200E / %x200F) ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be an improvement (and then you can drop the convoluted explanation from the conformance statement, and the profile limits itself to changing the sets of characters).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the hard change. We now need to get WG approval.

22 changes: 17 additions & 5 deletions spec/syntax.md
Original file line number Diff line number Diff line change
Expand Up @@ -906,16 +906,28 @@ backslash = %x5C ; U+005C REVERSE SOLIDUS "\"
### Whitespace

**_<dfn>Whitespace</dfn>_** is defined as one or more of
U+0009 CHARACTER TABULATION (tab),
U+000A LINE FEED (new line),
U+000D CARRIAGE RETURN,
U+3000 IDEOGRAPHIC SPACE,
or U+0020 SPACE.
`U+0009 CHARACTER TABULATION` (tab),
`U+000A LINE FEED` (new line),
`U+000D CARRIAGE RETURN`,
`U+3000 IDEOGRAPHIC SPACE`,
or `U+0020 SPACE`,
optionally prepended with `U+200E LEFT-TO-RIGHT MARK`.

Inside _patterns_ and _quoted literals_,
whitespace is part of the content and is recorded and stored verbatim.
Whitespace is not significant outside translatable text, except where required by the syntax.

The character `U+200E LEFT-TO-RIGHT MARK` (LRM) MAY be prepended to _whitespace_ outside
_patterns_ and _quoted literals_ to assist with presentation to users.
Tools SHOULD generate these LRM characters following _identifiers_, _unquoted literals_, or
_option_ values that use right-to-left characters so that the _message_ displays
intelligibly in a left-to-right context.

This definition of _whitespace_ implements
[UTR#31 Rule 3a-2](https://www.unicode.org/reports/tr31/#R3a-2).
aphillips marked this conversation as resolved.
Show resolved Hide resolved
It is a profile of R3a-1 in that specification because only the
whitespace characters listed are permitted as whitespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UAX31-R2a-2 says you need to

define that profile with a precise specification of the characters that are added to or removed from the set of code points defined by the Pattern_White_Space property, and of any changes to the criteria under which a character or sequence of characters is interpreted as an end of line, as ignorable format controls, or as horizontal space.

The point is that the reader of such a conformance statement sees the difference from the default, which are the things that may need special attention when interoperating with an implementation based on the defaults.

So, if I am reading this right:

  • Form feed, next line, line separator, paragraph separator, and right-to-left mark are removed from the set of white space characters;
  • ideographic space is added to the set of white space characters;
  • the sole ignorable format control (LRM) is allowed only in contexts UAX31-I1 and UAX31-I3 (not UAX31-I2), where UAX31-I1 is further restricted to the beginning of a sequence of horizontal spaces, and UAX31-I3 is further restricted to only the end and not the beginning of a line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further restricted to only the end and not the beginning of a line

Also it is disallowed on a blank line, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eggrobin. Technically, we don't have lines. Message whitespace can be normalized to a single space (in cases where whitespace is required) or to nothing (in cases where the whitespace is optional).

Your reading is correct. Note that UAX31-I2 is not permitted because that would break our sigil-identifier syntax. In all of the other cases in our syntax, there is required whitespace.

Note that quoted literals or pattern text can contain bidi controls that might cause Trojan source effects unless/until we address placeholder isolation.

I can correct the description to list the differences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that quoted literals or pattern text can contain bidi controls that might cause Trojan source effects unless/until we address placeholder isolation.

Mostly that is something to be dealt with at a higher level (in editors and tooling); the main thing languages need to do is to treat the right things in the right way so that standardized tooling can deal with the issue, see UTS55.

Note that UAX31-I2 is not permitted because that would break our sigil-identifier syntax.

I don’t really understand what you mean here; UAX31-I2 does not say you should allow $⟨LRM⟩identifier, it says that you should allow it if you allow $ identifier.

The reason why you do not have UAX31-I2 is that you do not allow an LRM wherever you have [s], so for instance in

markup = "{" [s] "#" identifier *(s option) *(s attribute) [s] ["/"] "}" ; open and standalone
/ "{" [s] "/" identifier *(s option) *(s attribute) [s] "}" ; close

you do not allow {⟨LRM⟩#whatever⟨LRM⟩}, only {⟨LRM⟩ #whatever⟨LRM⟩ }, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do not allow {⟨LRM⟩#whatever⟨LRM⟩}, only {⟨LRM⟩ #whatever⟨LRM⟩ }, right?

Note that this probably throws a wrench into the UTS55 conversion to plain text algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. See my response to @gibson042's comment.


> [!NOTE]
> The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for
> compatibility with certain East Asian keyboards and input methods,
Expand Down