Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grapheme support #1468

Closed
jerch opened this issue May 23, 2018 · 15 comments
Closed

Grapheme support #1468

jerch opened this issue May 23, 2018 · 15 comments
Assignees
Labels
type/enhancement Features or improvements to existing features

Comments

@jerch
Copy link
Member

jerch commented May 23, 2018

I think grapheme support needs to be discussed before going down that road.

Grapheme clusters are the unicode way to represent user perceived characters build from several consecutive unicode chars in a string. They are needed by some letter systems to draw the correct glyph - e.g. drawing them independently leads to a different glyph output than drawing them at once.

The problem with a terminal environment - the terminal typically has a grid based layout where a single cell represents a character and/or a possible cursor position with a fixed size. In a monospaced ASCII world this is perfect, every printable char moves the cursor by one cell and can be rendered at that position. It does not work that easy as soon as unicode enters the stage. The first problem are characters that are defined as full width chars used by many asian languages and combining chars used all over the place in many letter systems. xterm.js can handle this atm with a typical wcwidth implementation and the cell based model just as good/bad as most other terminal emulators do.

Grapheme clusters are kinda the next level of the this problem - they typically join multiple cells, that wcwidth + the combining chars handling would output leading to multiple combined cells as ONE perceived character. Compared to the current fullwidth char handling:

[..., fullwidth_char, null, ...]  --> 2 cells for fullwidth char

and the current combining handling:

[..., char+combining, ...] --> 1 cell for char with combining

and the current fullwidth + combining handling:

[..., fullwidth_char+combining, null, ...]  --> 2 cells for fullwidth char with combining

it is not that easy with grapheme support anymore. They can be build from any combinations of the above (limited by the grapheme breaking algorithm of course), which raises several questions:

  • Where adding the chars up? The current combining handling adds modifier characters to the first cell (char+combining) to make sure they end up together in one string and get not rendered separately (which is also a requirement for grapheme clusters). Currently this is possible because the combining chars always have a wcwidth of 0. In a grapheme cluster a following char might have a wcwidth != 0, adding those up in the first cell will break the cursor movement. How to handle the cursor here is still obscure to me.
  • How to deal with sum of wcwidth? In a grapheme cluster the sum of individual wcwidths is likely to be bigger than the wcwidth of the final user perceived character (they get merged into sumthing new typically). Here the grid based monospace environment might create ugly space between grapheme clusters if we enforce grid alignment. A wordprocessor with variable font width does not suffer this, it can just align the stuff as needed. No clue yet, what to do here.

Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all. Maybe some code editors have implemented this before, how about iterm2?

Maybe someone could share some experience regarding this topic so we can avoid basic misconception and flaws from the beginning.

@jerch jerch mentioned this issue May 23, 2018
@Tyriar
Copy link
Member

Tyriar commented May 23, 2018

In a grapheme cluster the sum of individual wcwidths is likely to be bigger than the wcwidth of the final user perceived character (they get merged into sumthing new typically). Here the grid based monospace environment might create ugly space between grapheme clusters if we enforce grid alignment.

That's tricky 😄

We could collect all the chars in a cluster into one CharData and set the width as appropriate, then follow that with width - 1 zero width chars. Basically extending the CJK method to work with n width chars.

It could get tricky when redrawing the screen though.

Seems I cant find a terminal that supports grapheme clusters yet, not sure if this was done before for a grid based monospace environment at all.

We basically supported them when the DOM renderer was added, I'm sure it wasn't perfect (empty padding on right?), but I think it worked. Terminal.app also seems to have similar support #701 (comment)


@jerch unrelated: I sent you an email (rockborn), not sure if you got it?

@Tyriar Tyriar added the type/enhancement Features or improvements to existing features label May 23, 2018
@jerch
Copy link
Member Author

jerch commented May 23, 2018

@jerch unrelated: I sent you an email (rockborn), not sure if you got it?

Woops it ended up in spam and was broken in thunderbird, roundcube for the rescue lol.

We could collect all the chars in a cluster into one CharData and set the width as appropriate, then follow that with width - 1 zero width chars. Basically extending the CJK method to work with n width chars.

Yes, this would be straight forward from what we have atm.

@jerch
Copy link
Member Author

jerch commented May 23, 2018

@Tyriar Ah yes #701 is full of good refs to get something started. And Terminal.app does some magic, where is the code hosted again? Just kidding...

Edit: To keep things comprehensible I would not mess around with RTL for now.

@Tyriar
Copy link
Member

Tyriar commented May 23, 2018

Yeah this should be handled before RTL, I believe this issue is the bulk of the work needed for proper RTL support though.

@jerch
Copy link
Member Author

jerch commented May 28, 2018

I started to work on a compact representation of the grapheme classes for a certain character. The parsed raw data from here https://www.unicode.org/Public/10.0.0/ucd/auxiliary/GraphemeBreakProperty.txt are ~12kB, was able to get it down to 7kB as base64 string representation.

That is a lot of additional data that needs to be transferred before initializing the terminal, not sure if this can be packed further without using a zip algo. The data structure also needs to be fast to convert into some lookup table thingy to avoid a weird loading lag and to lower the runtime penalty later on. Seems to be more challenging than I thought to get this right lol.

Edit:
Got it down to 4kB for BMP chars as base64 string. For browser setups it zips down to 800 bytes - small enough to not worry about it imho.

@Tyriar
Do we have a "best practice" how to include data files into the project? As a json file? Or wrap the data into some .ts file?

@Tyriar
Copy link
Member

Tyriar commented May 29, 2018

@jerch is it just the ranges and thing after the semi-colon we need from that? ie. just: 0000..0009 and Control, not all of 0000..0009 ; Control # Cc [10] <control-0000>..<control-0009>?

Do we have a "best practice" how to include data files into the project? As a json file? Or wrap the data into some .ts file?

I don't think it's typical to import json files in TS projects, so we could pull it into a TS file and export what we want. Also should the lookup table be lazily initialized somehow?

@jerch
Copy link
Member Author

jerch commented May 29, 2018

@Tyriar
Yup, we only need the codepoint ranges and the "character type" (those Control, ... entries) from that file to have all needed information. I ended up encoding the information into 256th codepoint chunks with a length and a type attribute (is about 3kB, as base64 ~4k). I would not pack it further with a zip algo since 4kB is almost neglectible for local builds, browser based builds tend to have a transport zipper anyways.

I don't think it's typical to import json files in TS projects, so we could pull it into a TS file and export what we want. Also should the lookup table be lazily initialized somehow?

Ok I will put it into a ts file. And yes, the lookup table creation can be postponed until the first chars fly in. It can be even split into 3 major unicode ranges, one for lower than 12k, 2nd for 42k - 65k and 3rd for >65k due to very different character type distribution with slightly different lookup table layout.

@Tyriar
Copy link
Member

Tyriar commented May 31, 2018

I don't think it's typical to import json files in TS projects

Scratch that, this is an option; JSON is typed automatically when you import in TS 2.9 https://blogs.msdn.microsoft.com/typescript/2018/05/31/announcing-typescript-2-9/

@jerch
Copy link
Member Author

jerch commented Jun 4, 2018

Yeah this should be handled before RTL, I believe this issue is the bulk of the work needed for proper RTL support though.

Just read some specs around unicode, to get RTL working with unicode we will have to support the bidi algorithm. No clue yet how this gonna fit into the cell/cursor movement paradigm. Early VTs had an explicit RTL mode, but I think we cant rely on that anymore since unicode bidi can mix LTR/RTL just by character occurence (the old VT-RTL mode would hard-switch the cursor movement to right to left). This is for sure a topic/issue of its own.

Scratch that, this is an option; JSON is typed automatically when you import in TS 2.9 https://blogs.msdn.microsoft.com/typescript/2018/05/31/announcing-typescript-2-9/

So go with JSON or stick to the ts file?

@Tyriar
Copy link
Member

Tyriar commented Jun 4, 2018

So go with JSON or stick to the ts file?

@jerch up to you 😄

@jerch
Copy link
Member Author

jerch commented Jun 4, 2018

Not sure if this only a linux bug - while firefox does sumthing more close to the right thing it seems chrome (v65) can't render these grapheme clusters correctly on canvas:

"Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞"

Technically this a string of 6 perceived characters build from 76 chars with grapheme clusters (though nonsense chars they are valid specwise).

Output in xterm.js with firefox:
grafik

Output in xterm.js with chrome:
grafik

The cursor is in the right position, but the combining chars don't combine at all, they get simply printed to the right. Seems fillText in chrome can't handle this (you can test it with the string above in the testbed here https://developer.mozilla.org/en-US/docs/Web/API/CanvasRenderingContext2D/fillText).

@Tyriar
Copy link
Member

Tyriar commented Jun 9, 2018

@jerch could this be the font that Chrome chooses to use doesn't know how to build the character?

@jerch
Copy link
Member Author

jerch commented Jun 9, 2018

@Tyriar Not sure yet what causes it, at least changing the font changes the faulty output to some other faulty output. Does anyone know if chrome uses the system font renderer or draws stuff itself?

@Tyriar
Copy link
Member

Tyriar commented Jun 10, 2018

Does anyone know if chrome uses the system font renderer or draws stuff itself?

Not sure

@Tyriar Tyriar changed the title grapheme support Grapheme support Oct 7, 2019
@jerch jerch self-assigned this Nov 15, 2019
@jerch
Copy link
Member Author

jerch commented Nov 15, 2019

Closing this for now as grapheme support is (not yet) worth to be implemented.

@jerch jerch closed this as completed Nov 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Features or improvements to existing features
Projects
None yet
Development

No branches or pull requests

2 participants