Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New unicode-graphemes addon. #4519
New unicode-graphemes addon. #4519
Changes from 44 commits
2f7fe11
67e9689
41760df
8e730cd
9c34bb4
466501c
284cde8
68abf10
ebfa201
3003185
e87fa16
1a8c7da
fa89e17
cf403bc
693f6b2
dc6818d
5568233
7baff4e
08a0914
6b24593
4204215
dccfc11
a0dc881
88bfc6b
988b3b1
7202979
f103540
b9abbc0
4fbc623
302da18
b9a4760
3bfe5d8
9b21786
fb2c680
aac5d47
cf243ad
5994e25
e3d3cdb
f1bc956
2070cc4
6c1eb97
21fd545
00fd2b8
e21e525
2204375
e5a2ebd
5ccf4d3
dc6dd6d
44e1977
630d213
930e5c4
9963c58
427c472
bddff60
a003034
1b5756c
17c4d09
9e536ca
24b932d
9ab0cd6
8183d1d
32c6165
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should stick to the major version of unicode to separate the unicode support. Thats easier to grasp for ppl.
For the grapheme vs not-grapheme support I suggest to expose 2 providers from the addon, one with and one w'o support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you draw the line? The non-grapheme variant should still handle simple clusters that consist of just a single base character followed by some modifiers. Otherwise it's a regression.
Without having examined the issue, the main differentiator is whether to ignore ZWJ (zero-width-joiner), thus treating true compounds as separate clusters. Handling of Regional Indicators (flags) is another issue. It is certainly possible to add a flag that restores the old non-standard behavior.. I'd rather not (I'm not sure it's needed), but I'll do whatever you and @Tyriar agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the problem here is, that many cmdline apps are simply not segmentation aware, unless the system provides better unicode support for them, and instead just add
wcwidth
values to guess their content width.Esp. macos is likely to create issues here, as it delivers a very old
wcwidth
impl (not sure, which unicode version it is created from, when I tried to deal with it 3ys ago, it still was v6)So to separate grapheme support from non grapheme support, I would not deactivate certain segmentation rules, but the rule handling completely falling back to old wcwidth measuring. Note that grapheme handling was not even a thing in terminals until recently, so apps are very likely to get this wrong until they adopted it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the question is: If you paste a string containing a compound cluster into an input field in an application that just uses wcwidth and doesn't know anything about clusters, what will happen? Assume the application knows enough that wcwidth==0 means a combination. What should happen? Is this a use-case we should worry about?
In addition to spacing and alignment issues, consider what happens when editing the input field: The application thinks the cursor is one place, while the terminal thinks it is another place.
Is this something we should support? Is suggesting that people fall back to unicode6 or unicode11 unacceptable?
It is probably not too difficult for the addon to provide two implementations of
IUnicodeVersionProvider
which would differ in their implementation ofcharProperties
. Instead of callingUC.shouldJoin
you just check if the width of the current character is 0. You select one provider by setting activeVersion to"15"
and the other to"15-grpahemes"
. Presumably the latter would the default if you just load the addon.I haven't looked into how much would be involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever is defined in the selected unicode version + ruleset, e.g. for no-grapheme support the old wcwidth calc.
Yes, it misses a lot of codepoints. V6 is even worse, as codepoints changed metrics in between.
Isn't it just a bypass for the joining, as the width has to be pulled anyway?
I know the situation with unicode is really messed up in terminals, even worse for xtermjs, as it cannot even determine a sane default from the system it runs on (other desktop TEs can do that). Regarding grapheme support in terminals we are in an "infrastructure trap" - someone has to make the first move (here either TE or app side), and the other side should follow. Thus I think grapheme handling should be on by default. But making it mandatory is imho wrong, as it is still pretty new in TE context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hot path optimization:
Can
UnicodeGraphemeProvider._plainNarrowProperties
be a const enum? This way the TS compiler can place the static number here at compile time omitting the lookup later on.(Note thats only possible, if
UnicodeService.createPropertyValue(UC.GRAPHEME_BREAK_Other, 1, false)
is really global const data.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While my TypeScript knowledge is limited, I'd be surprised if you could initialize a const enum value to that of an expression involving a function call. However, it is "const global data" in that it is a value that would not depend on Unicode version.
We can manually pre-compute the value, and use that to define an enum const. We can depend on the test suite to confirm that we didn't mess up the calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need an explicit handling of VS15 (U+FE0E) as well? Imho VS15 should always turn emojis into text repr with only 1 cell width.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really needed an update to the ancient East Asian Widths TR to deal with grapheme clusters and more. In addition to VS15, it would be nice to have a recommendation on how wide en/em dashes/spaces should be.
@Tyriar If perchance you have a contact that is involved with Unicode standardization, perhaps check with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PerBothner I think they dont feel obligated to solve the monospace issue for us:
That note got added to TR-11 with unicode version 11. It basically says - "it is much more complicated, and we dont care to spec things out for the terminal". 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah well, I am not sure anymore, if VS15 should reduce emojis to 1 cell width. For VS16 it is explicitly mentioned that pictograms should be wide. There no such notion for VS15, which makes them somewhat unclear again (if width property should be applied, which can be narrow or wide for different symbols, or whether they still stick to wide). What a nonsense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No contact 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To not block on a silly thing like VS15, maybe just add a comment somewhere, that VS15 is not treated special in width handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat unhappy with
extractWidth
andextractCharKind
being implemented onUnicodeService
. It leaks facts about the impl in this addon into the core code. Even worse - it forces anyone implementing that interface to use the sameUnicodeCharProperties
layout. More about this below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not the cleanest approach. However, if we support multiple UnicodeCharProperties layouts then these functions can no longer be static, which is likely to have performance implications. A static function can potentially be inlined - though I don't know of a way to have TypeScript actually do inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, as far as I am aware TS does not do that kind of optimization for us. I've never seen it inlining code blocks, I only know of the const enum trick for simple const value inlining. For functions/methods it will always keep the call, which might or might not be inlined by the JS JIT later on.
(@Tyriar plz correct me, if I miss here something).
I have not followed all call traces of these functions in detail, but maybe reshaping
UnicodeCharProperties
into a ref passed object as described below can help here? It would also omit the additional bit packing & extraction steps, which are known to create instruction pressure (at least in C).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my understanding but I'd need to look up the conditions in which a function is inlined. I know there's some V8 tool that lets you inspect optimizations like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again possibly affected by VS15?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Register 2 providers at once (
15
for w'o grapheme support,15-grapheme
with it) and default the autoloader to the grapheme one. This way ppl can opt-out graphemes, if they get too many issues with graphemes stacked.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this? (This of course is only the UnicodeGraphemeProvider part of the change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops - I meant:
I.e. checking the current width
w
, notoldWidth
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think that should do. Then all that is needed, is to register both versions during addon loading (so both would appear under
unicode.versions
), and set the grapheme one by default. This way ppl can still opt-out the grapheme handling.