-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escaping non-printables in grapheme clusters with String#inspect
#11630
Comments
The peers seem to be undecided. Ruby "👨👩".inspect # => "👨👩"
"".inspect # => "" Python repr('👨👩') # => '👨\u200d👩'
repr('') # => '\u200d\u200d' Swift: dump("👨👩") // => "👨👩"
dump("") // => "" ( Julia: repr("👨👩") # =>"👨\u200d👩"
repr("") # => "\u200d\u200d" |
Let Elixir break the tie :-) |
This is misleading. A grapheme cluster may or may not be represented by a single glyph, and the purpose of text segmentation is not to determine glyph boundaries. For emojis, whether a single glyph or multiple glyphs should be rendered is determined by the list of emoji ZWJ sequences, and since |
@asterite Okay, here's Elixir: inspect "👨👩" # => "👨👩"
inspect "" # => "" |
@HertzDevil Thanks for pointing that out. However, the definition of ZWJ sequences is independent of the grapheme cluster algorithm. And the form of visualization is ultimately driven by the text rendering engine. Most implementations I've seen show |
This is a follow-up on #11406 which introduced escaping for all non-printable characters in
String#inspect
.While that change is an improvement, it has a negative effect on grapheme clusters comprising non-printable characters.
Consider a string of code points
U+1F468
(Man)U+200D
(Zero Width Joiner (ZWJ))U+1F469
(Woman). They form a grapheme cluster that renders two persons as a single grapheme:👨👩
(for reference, without the ZWJ the two emojis render as separate graphemes:👨👩
).String#inspect
escapes all non-printable characters since #11452. Zero Width Joiner is a non-printable character, so it gets escaped. For the above string, it means the grapheme cluster gets broken. The zero width joiner no longer glues the surrounding characters together:Both formats are technically correct. They describe the same string - a literal character is equivalent to its escape sequence. They are just two different representations which also result in different rendering.
I believe the intuitive expectation is that grapheme clusters should not break apart. The reasoning for escaping non-printable characters is to avoid having them go unnoticed because they are not visible. That does not apply as part of a bigger grapheme cluster because they typically have a visible effect there. So we should only escape non-printable characters that stand alone.
That's assuming the employed text renderer supports the respective grapheme cluster, which is impossible to detect or infer. But it's probably okay to assume grapheme cluster support? 🤔
A problem with that is that some grapheme clusters don't actually have a visual representation. For example, two consecutive Zero Width Joiners are considered a grapheme cluster. Escaping that seems like a good idea:
A grapheme cluster consisting of only non-printable code points would be relatively easy to detect. Zero Width Joiner also attaches to most other code points forming a grapheme cluster. Even if it has no meaning or effect. That as well should still be relatively easy to detect. But I'm realistically expecting other problematic combinations of code points. It's really a complex matter.
So I suppose we have the options to prefer either readability or sanity with regards to non-printable characters in grapheme clusters. We could also try to find a middle ground that draws a more precise line between the two. Not sure how far we can get with that.
String#dump
is an alternative for escaping all non-ASCII characters if you need that.We could consider adding a configuration option which determines the handling of grapheme clusters. But I'm in doubt if that would be much useful and would definitely defer that to a future enhancement discussion. We should find a good default behaviour first.
A similar challenge exists with formatting literals (#11478). A change to escape non-printable characters has been reverted for now because of lacking grapheme support (#11603).
The text was updated successfully, but these errors were encountered: