-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix whitespace conformance to match UAX31 (including permitting LRM/RLM) #673
Changes from 2 commits
b387301
ae712d7
c556ecf
8558eef
d830550
42b7744
af75420
6a6dc70
db5e97c
3fede7a
586a4d6
502e1fe
d7af5eb
68f32f8
e36d01d
5609dc0
e74b34c
9557849
ec369ec
2532206
f985860
52c5c86
4a6c96c
19864c3
27b9e42
00e3796
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -906,16 +906,28 @@ backslash = %x5C ; U+005C REVERSE SOLIDUS "\" | |||||
### Whitespace | ||||||
|
||||||
**_<dfn>Whitespace</dfn>_** is defined as one or more of | ||||||
U+0009 CHARACTER TABULATION (tab), | ||||||
U+000A LINE FEED (new line), | ||||||
U+000D CARRIAGE RETURN, | ||||||
U+3000 IDEOGRAPHIC SPACE, | ||||||
or U+0020 SPACE. | ||||||
`U+0009 CHARACTER TABULATION` (tab), | ||||||
`U+000A LINE FEED` (new line), | ||||||
`U+000D CARRIAGE RETURN`, | ||||||
`U+3000 IDEOGRAPHIC SPACE`, | ||||||
or `U+0020 SPACE`, | ||||||
optionally prepended with `U+200E LEFT-TO-RIGHT MARK`. | ||||||
|
||||||
Inside _patterns_ and _quoted literals_, | ||||||
whitespace is part of the content and is recorded and stored verbatim. | ||||||
Whitespace is not significant outside translatable text, except where required by the syntax. | ||||||
|
||||||
The character `U+200E LEFT-TO-RIGHT MARK` (LRM) MAY be prepended to _whitespace_ outside | ||||||
_patterns_ and _quoted literals_ to assist with presentation to users. | ||||||
Tools SHOULD generate these LRM characters following _identifiers_, _unquoted literals_, or | ||||||
_option_ values that use right-to-left characters so that the _message_ displays | ||||||
intelligibly in a left-to-right context. | ||||||
|
||||||
This definition of _whitespace_ implements | ||||||
[UTR#31 Rule 3a-2](https://www.unicode.org/reports/tr31/#R3a-2). | ||||||
aphillips marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
It is a profile of R3a-1 in that specification because only the | ||||||
whitespace characters listed are permitted as whitespace. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. UAX31-R2a-2 says you need to
The point is that the reader of such a conformance statement sees the difference from the default, which are the things that may need special attention when interoperating with an implementation based on the defaults. So, if I am reading this right:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also it is disallowed on a blank line, I think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @eggrobin. Technically, we don't have lines. Message whitespace can be normalized to a single space (in cases where whitespace is required) or to nothing (in cases where the whitespace is optional). Your reading is correct. Note that UAX31-I2 is not permitted because that would break our sigil-identifier syntax. In all of the other cases in our syntax, there is required whitespace. Note that quoted literals or pattern text can contain bidi controls that might cause Trojan source effects unless/until we address placeholder isolation. I can correct the description to list the differences. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Mostly that is something to be dealt with at a higher level (in editors and tooling); the main thing languages need to do is to treat the right things in the right way so that standardized tooling can deal with the issue, see UTS55.
I don’t really understand what you mean here; UAX31-I2 does not say you should allow The reason why you do not have UAX31-I2 is that you do not allow an LRM wherever you have message-format-wg/spec/message.abnf Lines 35 to 36 in ae712d7
you do not allow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Note that this probably throws a wrench into the UTS55 conversion to plain text algorithm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. See my response to @gibson042's comment. |
||||||
|
||||||
> [!NOTE] | ||||||
> The character U+3000 IDEOGRAPHIC SPACE is included in whitespace for | ||||||
> compatibility with certain East Asian keyboards and input methods, | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that LRM is not valid after a line terminator or when not followed by a whitespace character, e.g. before the quoted patterns in a message like
or
Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really.
Probably what needs to happen here is a distinguishing of optional and non-optional whitespace. Everywhere we have
[s]
should use a production that can be just LRM (or RLM, fwiw) or nothing, e.g.:And everywhere that requires positive whitespace (i.e. just
s
) permit controls to either side:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be an improvement (and then you can drop the convoluted explanation from the conformance statement, and the profile limits itself to changing the sets of characters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the hard change. We now need to get WG approval.