Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control characters in Char and String literals #11478

Open
straight-shoota opened this issue Nov 19, 2021 · 1 comment
Open

Control characters in Char and String literals #11478

straight-shoota opened this issue Nov 19, 2021 · 1 comment

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Nov 19, 2021

The following Char literals are currently valid:

# Char
'�' # => '\a'
'�' # => '\b'
'
' # => '\r'
'
' # => '\n'
'	' # => '\t'
'�' # => '\v'
'' # => '\f'

# String
"" # => '\a'
"" # => '\b'
"
" # Error: expecting '\n' after '\r'
"
" # => '\n'
"	" # => '\t'
"" # => '\v'
"" # => '\f'

In the example, I'm using replacement characters for \a, \b, \v, \f because they don't make sense as a display character. The replacements may show up as question marks, depending on font support.

And that's the point really. Using these control characters (and there are more of them) in literals is pretty useless and unreadable. We have escape sequences and those should be used. Even if they're not as short as the ones described here and require \u escapes.

For string literals, some might still be useful, like \n for multiline literals. IMO they could be entirely replaced by heredocs, but apparently, they're quite commonly used 🤷

Even the formatter does not transform literal control characters to escape sequences.

I think that should be the first step: Have the formatter replace control characters (or maybe other non-printables as well) in literals by their respective escape sequences. At some point later, we can have the parser reject literal control characters.

For Char literals, I think all control characters should be transformed and eventually become invalid. Probably also some or even all non-printables (i.e. white space characters).

For String literals, there could be a bigger margin. \n, \r\n and \t are probably fine. I'm not entirely convinced about the other control or non-printable characters. But having the formatter transform is probably a good idea for most.

@straight-shoota straight-shoota changed the title Restrict valid characters in Char and String literals Control characters in Char and String literals Nov 19, 2021
@straight-shoota
Copy link
Member Author

The original implementation of #11520 showed problems with grapheme clusters after being merged and was subsequently reverted (#11603).

Grapheme clusters are glued together by non-printable characters like zero-width-joiner (U+200D). Escaping them would tear the graphemes apart. The semantics stay the same, but the source code would no longer show a single grapheme. Instead, you get a deconstructed series of characters, with non-printables escaped.

-"🏳️‍🌈"
+🏳️\u200D🌈

- "ܐ܏ܒܓܕ "
+"ܐ\u070Fܒܓܕ"

That's obviously not good. When used as such in graphemes, non-printable characters can actually have a visible effect and escaping them would destroy that. We shouldn't do that.

I think this is relatively easy to improve by not escaping non-printables when they are part of a multi-codepoint grapheme clusters. The new String::Grapheme API allows to do that. As a result, only single non-printables would be escaped.
And probably also graphemes such consiting of only non-printables, such as \r\n which has no visual difference from \n.

There is a slight complication with a text renderer's grapheme support. A text renderer that does not recognize a grapheme cluster just shows the individual characters. The first example would be indistinguishable from a string without the zero-width joiner between the emojis ("🏳️🌈"). This can't be fixed by this change or any other way, really. But that's okay and it's the text renderer's fault. However, there's little practical relevance because modern text renderers should have decent support for most use cases of graphemes. Escaping non-printables outside of grapheme clusters is definitely an improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant