-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grapheme support #1468
Comments
That's tricky 😄 We could collect all the chars in a cluster into one It could get tricky when redrawing the screen though.
We basically supported them when the DOM renderer was added, I'm sure it wasn't perfect (empty padding on right?), but I think it worked. Terminal.app also seems to have similar support #701 (comment) @jerch unrelated: I sent you an email (rockborn), not sure if you got it? |
Woops it ended up in spam and was broken in thunderbird, roundcube for the rescue lol.
Yes, this would be straight forward from what we have atm. |
Yeah this should be handled before RTL, I believe this issue is the bulk of the work needed for proper RTL support though. |
I started to work on a compact representation of the grapheme classes for a certain character. The parsed raw data from here https://www.unicode.org/Public/10.0.0/ucd/auxiliary/GraphemeBreakProperty.txt are ~12kB, was able to get it down to 7kB as base64 string representation. That is a lot of additional data that needs to be transferred before initializing the terminal, not sure if this can be packed further without using a zip algo. The data structure also needs to be fast to convert into some lookup table thingy to avoid a weird loading lag and to lower the runtime penalty later on. Seems to be more challenging than I thought to get this right lol. Edit: @Tyriar |
@jerch is it just the ranges and thing after the semi-colon we need from that? ie. just:
I don't think it's typical to import json files in TS projects, so we could pull it into a TS file and export what we want. Also should the lookup table be lazily initialized somehow? |
@Tyriar
Ok I will put it into a ts file. And yes, the lookup table creation can be postponed until the first chars fly in. It can be even split into 3 major unicode ranges, one for lower than 12k, 2nd for 42k - 65k and 3rd for >65k due to very different character type distribution with slightly different lookup table layout. |
Scratch that, this is an option; JSON is typed automatically when you import in TS 2.9 https://blogs.msdn.microsoft.com/typescript/2018/05/31/announcing-typescript-2-9/ |
Just read some specs around unicode, to get RTL working with unicode we will have to support the bidi algorithm. No clue yet how this gonna fit into the cell/cursor movement paradigm. Early VTs had an explicit RTL mode, but I think we cant rely on that anymore since unicode bidi can mix LTR/RTL just by character occurence (the old VT-RTL mode would hard-switch the cursor movement to right to left). This is for sure a topic/issue of its own.
So go with JSON or stick to the ts file? |
@jerch up to you 😄 |
Not sure if this only a linux bug - while firefox does sumthing more close to the right thing it seems chrome (v65) can't render these grapheme clusters correctly on canvas:
Technically this a string of 6 perceived characters build from 76 chars with grapheme clusters (though nonsense chars they are valid specwise). Output in xterm.js with firefox: Output in xterm.js with chrome: The cursor is in the right position, but the combining chars don't combine at all, they get simply printed to the right. Seems |
@jerch could this be the font that Chrome chooses to use doesn't know how to build the character? |
@Tyriar Not sure yet what causes it, at least changing the font changes the faulty output to some other faulty output. Does anyone know if chrome uses the system font renderer or draws stuff itself? |
Not sure |
Closing this for now as grapheme support is (not yet) worth to be implemented. |
I think grapheme support needs to be discussed before going down that road.
Grapheme clusters are the unicode way to represent user perceived characters build from several consecutive unicode chars in a string. They are needed by some letter systems to draw the correct glyph - e.g. drawing them independently leads to a different glyph output than drawing them at once.
The problem with a terminal environment - the terminal typically has a grid based layout where a single cell represents a character and/or a possible cursor position with a fixed size. In a monospaced ASCII world this is perfect, every printable char moves the cursor by one cell and can be rendered at that position. It does not work that easy as soon as unicode enters the stage. The first problem are characters that are defined as full width chars used by many asian languages and combining chars used all over the place in many letter systems. xterm.js can handle this atm with a typical
wcwidth
implementation and the cell based model just as good/bad as most other terminal emulators do.Grapheme clusters are kinda the next level of the this problem - they typically join multiple cells, that wcwidth + the combining chars handling would output leading to multiple combined cells as ONE perceived character. Compared to the current fullwidth char handling:
and the current combining handling:
and the current fullwidth + combining handling:
it is not that easy with grapheme support anymore. They can be build from any combinations of the above (limited by the grapheme breaking algorithm of course), which raises several questions:
char+combining
) to make sure they end up together in one string and get not rendered separately (which is also a requirement for grapheme clusters). Currently this is possible because the combining chars always have a wcwidth of 0. In a grapheme cluster a following char might have a wcwidth != 0, adding those up in the first cell will break the cursor movement. How to handle the cursor here is still obscure to me.Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all. Maybe some code editors have implemented this before, how about iterm2?
Maybe someone could share some experience regarding this topic so we can avoid basic misconception and flaws from the beginning.
The text was updated successfully, but these errors were encountered: