Grapheme support #1468

jerch · 2018-05-23T09:43:50Z

I think grapheme support needs to be discussed before going down that road.

Grapheme clusters are the unicode way to represent user perceived characters build from several consecutive unicode chars in a string. They are needed by some letter systems to draw the correct glyph - e.g. drawing them independently leads to a different glyph output than drawing them at once.

The problem with a terminal environment - the terminal typically has a grid based layout where a single cell represents a character and/or a possible cursor position with a fixed size. In a monospaced ASCII world this is perfect, every printable char moves the cursor by one cell and can be rendered at that position. It does not work that easy as soon as unicode enters the stage. The first problem are characters that are defined as full width chars used by many asian languages and combining chars used all over the place in many letter systems. xterm.js can handle this atm with a typical wcwidth implementation and the cell based model just as good/bad as most other terminal emulators do.

Grapheme clusters are kinda the next level of the this problem - they typically join multiple cells, that wcwidth + the combining chars handling would output leading to multiple combined cells as ONE perceived character. Compared to the current fullwidth char handling:

[..., fullwidth_char, null, ...]  --> 2 cells for fullwidth char

and the current combining handling:

[..., char+combining, ...] --> 1 cell for char with combining

and the current fullwidth + combining handling:

[..., fullwidth_char+combining, null, ...]  --> 2 cells for fullwidth char with combining

it is not that easy with grapheme support anymore. They can be build from any combinations of the above (limited by the grapheme breaking algorithm of course), which raises several questions:

Where adding the chars up? The current combining handling adds modifier characters to the first cell (char+combining) to make sure they end up together in one string and get not rendered separately (which is also a requirement for grapheme clusters). Currently this is possible because the combining chars always have a wcwidth of 0. In a grapheme cluster a following char might have a wcwidth != 0, adding those up in the first cell will break the cursor movement. How to handle the cursor here is still obscure to me.
How to deal with sum of wcwidth? In a grapheme cluster the sum of individual wcwidths is likely to be bigger than the wcwidth of the final user perceived character (they get merged into sumthing new typically). Here the grid based monospace environment might create ugly space between grapheme clusters if we enforce grid alignment. A wordprocessor with variable font width does not suffer this, it can just align the stuff as needed. No clue yet, what to do here.

Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all. Maybe some code editors have implemented this before, how about iterm2?

Maybe someone could share some experience regarding this topic so we can avoid basic misconception and flaws from the beginning.

The text was updated successfully, but these errors were encountered:

Tyriar · 2018-05-23T14:30:07Z

In a grapheme cluster the sum of individual wcwidths is likely to be bigger than the wcwidth of the final user perceived character (they get merged into sumthing new typically). Here the grid based monospace environment might create ugly space between grapheme clusters if we enforce grid alignment.

That's tricky 😄

We could collect all the chars in a cluster into one CharData and set the width as appropriate, then follow that with width - 1 zero width chars. Basically extending the CJK method to work with n width chars.

It could get tricky when redrawing the screen though.

Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all.

We basically supported them when the DOM renderer was added, I'm sure it wasn't perfect (empty padding on right?), but I think it worked. Terminal.app also seems to have similar support #701 (comment)

@jerch unrelated: I sent you an email (rockborn), not sure if you got it?

jerch · 2018-05-23T15:06:44Z

@jerch unrelated: I sent you an email (rockborn), not sure if you got it?

Woops it ended up in spam and was broken in thunderbird, roundcube for the rescue lol.

We could collect all the chars in a cluster into one CharData and set the width as appropriate, then follow that with width - 1 zero width chars. Basically extending the CJK method to work with n width chars.

Yes, this would be straight forward from what we have atm.

jerch · 2018-05-23T16:03:19Z

@Tyriar Ah yes #701 is full of good refs to get something started. And Terminal.app does some magic, where is the code hosted again? Just kidding...

Edit: To keep things comprehensible I would not mess around with RTL for now.

Tyriar · 2018-05-23T18:29:46Z

Yeah this should be handled before RTL, I believe this issue is the bulk of the work needed for proper RTL support though.

jerch · 2018-05-28T17:48:29Z

I started to work on a compact representation of the grapheme classes for a certain character. The parsed raw data from here https://www.unicode.org/Public/10.0.0/ucd/auxiliary/GraphemeBreakProperty.txt are ~12kB, was able to get it down to 7kB as base64 string representation.

That is a lot of additional data that needs to be transferred before initializing the terminal, not sure if this can be packed further without using a zip algo. The data structure also needs to be fast to convert into some lookup table thingy to avoid a weird loading lag and to lower the runtime penalty later on. Seems to be more challenging than I thought to get this right lol.

Edit:
Got it down to 4kB for BMP chars as base64 string. For browser setups it zips down to 800 bytes - small enough to not worry about it imho.

@Tyriar
Do we have a "best practice" how to include data files into the project? As a json file? Or wrap the data into some .ts file?

Tyriar · 2018-05-29T13:37:08Z

@jerch is it just the ranges and thing after the semi-colon we need from that? ie. just: 0000..0009 and Control, not all of 0000..0009 ; Control # Cc [10] <control-0000>..<control-0009>?

Do we have a "best practice" how to include data files into the project? As a json file? Or wrap the data into some .ts file?

I don't think it's typical to import json files in TS projects, so we could pull it into a TS file and export what we want. Also should the lookup table be lazily initialized somehow?

jerch · 2018-05-29T19:05:27Z

@Tyriar
Yup, we only need the codepoint ranges and the "character type" (those Control, ... entries) from that file to have all needed information. I ended up encoding the information into 256th codepoint chunks with a length and a type attribute (is about 3kB, as base64 ~4k). I would not pack it further with a zip algo since 4kB is almost neglectible for local builds, browser based builds tend to have a transport zipper anyways.

I don't think it's typical to import json files in TS projects, so we could pull it into a TS file and export what we want. Also should the lookup table be lazily initialized somehow?

Ok I will put it into a ts file. And yes, the lookup table creation can be postponed until the first chars fly in. It can be even split into 3 major unicode ranges, one for lower than 12k, 2nd for 42k - 65k and 3rd for >65k due to very different character type distribution with slightly different lookup table layout.

Tyriar · 2018-05-31T21:33:06Z

I don't think it's typical to import json files in TS projects

Scratch that, this is an option; JSON is typed automatically when you import in TS 2.9 https://blogs.msdn.microsoft.com/typescript/2018/05/31/announcing-typescript-2-9/

jerch · 2018-06-04T05:22:36Z

Yeah this should be handled before RTL, I believe this issue is the bulk of the work needed for proper RTL support though.

Just read some specs around unicode, to get RTL working with unicode we will have to support the bidi algorithm. No clue yet how this gonna fit into the cell/cursor movement paradigm. Early VTs had an explicit RTL mode, but I think we cant rely on that anymore since unicode bidi can mix LTR/RTL just by character occurence (the old VT-RTL mode would hard-switch the cursor movement to right to left). This is for sure a topic/issue of its own.

Scratch that, this is an option; JSON is typed automatically when you import in TS 2.9 https://blogs.msdn.microsoft.com/typescript/2018/05/31/announcing-typescript-2-9/

So go with JSON or stick to the ts file?

Tyriar · 2018-06-04T07:21:30Z

So go with JSON or stick to the ts file?

@jerch up to you 😄

jerch · 2018-06-04T14:24:59Z

Not sure if this only a linux bug - while firefox does sumthing more close to the right thing it seems chrome (v65) can't render these grapheme clusters correctly on canvas:

"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"

Technically this a string of 6 perceived characters build from 76 chars with grapheme clusters (though nonsense chars they are valid specwise).

Output in xterm.js with firefox:

Output in xterm.js with chrome:

The cursor is in the right position, but the combining chars don't combine at all, they get simply printed to the right. Seems fillText in chrome can't handle this (you can test it with the string above in the testbed here https://developer.mozilla.org/en-US/docs/Web/API/CanvasRenderingContext2D/fillText).

Tyriar · 2018-06-09T10:19:19Z

@jerch could this be the font that Chrome chooses to use doesn't know how to build the character?

jerch · 2018-06-09T16:37:52Z

@Tyriar Not sure yet what causes it, at least changing the font changes the faulty output to some other faulty output. Does anyone know if chrome uses the system font renderer or draws stuff itself?

Tyriar · 2018-06-10T19:15:00Z

Does anyone know if chrome uses the system font renderer or draws stuff itself?

Not sure

jerch · 2019-11-15T00:38:49Z

Closing this for now as grapheme support is (not yet) worth to be implemented.

jerch mentioned this issue May 23, 2018

revamp wcwidth #1467

Closed

Tyriar added the type/enhancement Features or improvements to existing features label May 23, 2018

Tyriar mentioned this issue May 28, 2018

Indic Texts microsoft/vscode#50552

Closed

jerch mentioned this issue Sep 24, 2018

Update wcwidth/CharWidth.ts #1707

Closed

shreevatsa mentioned this issue Nov 27, 2018

Wrong width for Hindi on macOS, but correct width on Linux jquast/wcwidth#25

Closed

Tyriar changed the title ~~grapheme support~~ Grapheme support Oct 7, 2019

jerch self-assigned this Nov 15, 2019

jerch closed this as completed Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grapheme support #1468

Grapheme support #1468

jerch commented May 23, 2018

Tyriar commented May 23, 2018

jerch commented May 23, 2018

jerch commented May 23, 2018 •

edited

Loading

Tyriar commented May 23, 2018

jerch commented May 28, 2018 •

edited

Loading

Tyriar commented May 29, 2018

jerch commented May 29, 2018

Tyriar commented May 31, 2018

jerch commented Jun 4, 2018 •

edited

Loading

Tyriar commented Jun 4, 2018

jerch commented Jun 4, 2018

Tyriar commented Jun 9, 2018

jerch commented Jun 9, 2018

Tyriar commented Jun 10, 2018

jerch commented Nov 15, 2019

Grapheme support #1468

Grapheme support #1468

Comments

jerch commented May 23, 2018

Tyriar commented May 23, 2018

jerch commented May 23, 2018

jerch commented May 23, 2018 • edited Loading

Tyriar commented May 23, 2018

jerch commented May 28, 2018 • edited Loading

Tyriar commented May 29, 2018

jerch commented May 29, 2018

Tyriar commented May 31, 2018

jerch commented Jun 4, 2018 • edited Loading

Tyriar commented Jun 4, 2018

jerch commented Jun 4, 2018

Tyriar commented Jun 9, 2018

jerch commented Jun 9, 2018

Tyriar commented Jun 10, 2018

jerch commented Nov 15, 2019

jerch commented May 23, 2018 •

edited

Loading

jerch commented May 28, 2018 •

edited

Loading

jerch commented Jun 4, 2018 •

edited

Loading