Control characters in `Char` and `String` literals #11478

straight-shoota · 2021-11-19T13:47:59Z

The following Char literals are currently valid:

# Char
'�' # => '\a'
'�' # => '\b'
'
' # => '\r'
'
' # => '\n'
'	' # => '\t'
'�' # => '\v'
'' # => '\f'

# String
"�" # => '\a'
"�" # => '\b'
"
" # Error: expecting '\n' after '\r'
"
" # => '\n'
"	" # => '\t'
"�" # => '\v'
"" # => '\f'

In the example, I'm using replacement characters for \a, \b, \v, \f because they don't make sense as a display character. The replacements may show up as question marks, depending on font support.

And that's the point really. Using these control characters (and there are more of them) in literals is pretty useless and unreadable. We have escape sequences and those should be used. Even if they're not as short as the ones described here and require \u escapes.

For string literals, some might still be useful, like \n for multiline literals. IMO they could be entirely replaced by heredocs, but apparently, they're quite commonly used 🤷

Even the formatter does not transform literal control characters to escape sequences.

I think that should be the first step: Have the formatter replace control characters (or maybe other non-printables as well) in literals by their respective escape sequences. At some point later, we can have the parser reject literal control characters.

For Char literals, I think all control characters should be transformed and eventually become invalid. Probably also some or even all non-printables (i.e. white space characters).

For String literals, there could be a bigger margin. \n, \r\n and \t are probably fine. I'm not entirely convinced about the other control or non-printable characters. But having the formatter transform is probably a good idea for most.

The text was updated successfully, but these errors were encountered:

straight-shoota · 2021-12-17T09:23:22Z

The original implementation of #11520 showed problems with grapheme clusters after being merged and was subsequently reverted (#11603).

Grapheme clusters are glued together by non-printable characters like zero-width-joiner (U+200D). Escaping them would tear the graphemes apart. The semantics stay the same, but the source code would no longer show a single grapheme. Instead, you get a deconstructed series of characters, with non-printables escaped.

-"🏳️‍🌈"
+🏳️\u200D🌈

- "ܐ܏ܒܓܕ "
+"ܐ\u070Fܒܓܕ"

That's obviously not good. When used as such in graphemes, non-printable characters can actually have a visible effect and escaping them would destroy that. We shouldn't do that.

I think this is relatively easy to improve by not escaping non-printables when they are part of a multi-codepoint grapheme clusters. The new String::Grapheme API allows to do that. As a result, only single non-printables would be escaped.
And probably also graphemes such consiting of only non-printables, such as \r\n which has no visual difference from \n.

There is a slight complication with a text renderer's grapheme support. A text renderer that does not recognize a grapheme cluster just shows the individual characters. The first example would be indistinguishable from a string without the zero-width joiner between the emojis ("🏳️🌈"). This can't be fixed by this change or any other way, really. But that's okay and it's the text renderer's fault. However, there's little practical relevance because modern text renderers should have decent support for most use cases of graphemes. Escaping non-printables outside of grapheme clusters is definitely an improvement.

straight-shoota added status:discussion topic:compiler:parser labels Nov 19, 2021

straight-shoota changed the title ~~Restrict valid characters in Char and String literals~~ Control characters in Char and String literals Nov 19, 2021

straight-shoota mentioned this issue Dec 1, 2021

Formatter: Escape non-printable characters in literals #11520

Merged

This was referenced Dec 17, 2021

Discussion about Grapheme API #11610

Open

Escaping non-printables in grapheme clusters with String#inspect #11630

Open

straight-shoota mentioned this issue Jan 25, 2023

Concentrate formatter changes in next release #13002

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control characters in `Char` and `String` literals #11478

Control characters in `Char` and `String` literals #11478

straight-shoota commented Nov 19, 2021 •

edited

Loading

straight-shoota commented Dec 17, 2021

Control characters in Char and String literals #11478

Control characters in Char and String literals #11478

Comments

straight-shoota commented Nov 19, 2021 • edited Loading

straight-shoota commented Dec 17, 2021

Control characters in `Char` and `String` literals #11478

Control characters in `Char` and `String` literals #11478

straight-shoota commented Nov 19, 2021 •

edited

Loading