Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RTL languages #701

Open
Tyriar opened this issue Jun 13, 2017 · 29 comments
Open

Support RTL languages #701

Tyriar opened this issue Jun 13, 2017 · 29 comments

Comments

@Tyriar
Copy link
Member

Tyriar commented Jun 13, 2017

Downstream issue: microsoft/vscode#28571

When we enforced unicode character width in #467 this broke RTL language characters as they are now rendered in reverse (LTR). We could revert that for RTL character ranges only but we should do the right fix and reverse the strings so they're actually on the character grid as the new selection model relies on all characters lining up perfectly on the grid #670

Ideally line reflow #622 would be done before this so it's easier to change the contents of multiple lines.

Terminal.app:

image

VS Code 1.13 (notice sentences are reversed):

image

@mostafa69d @CherryDT a little info on the languages in question would be handy:

  1. Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters?
  2. How are the characters meant to interact with characters like 0-9 or punctuation?

Useful references:

@CherryDT
Copy link

CherryDT commented Jun 13, 2017

It is actually a whole lot more complicated and includes statefulness and even mirroring certain characters. I'd say it's a science of its own. (And I have the deepest respect for those people who wrote robust text rendering libraries that handle all the BiDi issues properly, so I don't have to mess around with it, to be honest.)

See also:
https://en.wikipedia.org/wiki/Bi-directional_text (good overview)
https://www.w3.org/International/articles/inline-bidi-markup/uba-basics
https://www.w3.org/International/tutorials/svg-tiny-bidi/ (the initial premise is not related but it explains a few things better than the previous link)
https://github.com/fevangelou/doctype-mirror/tree/master/bidihowto/bidi-support-in-a-ui

EDIT: I think the way the new selection works may actually be unexpected because it is going to behave differently than VSCode itself. For example, given the text "The song מדינת קומבינה makes me think", when I start selecting at "The" and end between the two Hebrew words, I will have selected "The song מדינת", while in the console I will have selected "The song קומבינה".

See example:
Image

However it will still be better than how Sublime Text "works" last time I checked, because there you will see one thing selected but copy another, which is very annoying.

@mostafa-drz
Copy link

mostafa-drz commented Jun 15, 2017

@Tyriar
First of all I'm gonna give you a very brief perspective of Arabic and Persian language maybe it help you(I'm not sure if the Hebrew is the same).
In Arabic and Persian languages the alphabets are like "آ" "ب" "س" and so on. And the words are made by these alphabets (obviously) with a very different rule in compare with for example English.
The difference is that we have more than one shape for some alphabet like "س" .The first shape is "س" and the second one is " سـ" ,the other one is "ـسـ" and the last one is "ـس". And what's the usage of these shapes? Based on where the alphabet in a word appears, the shape of alphabet we use varies. For example, for the mentioned alphabet "س" we use the shape "سـ" when a word starts with this alphabet like "سلام". Here is the problem and actually the difference between a language like English and Persian or Arabic. We generate words in these languages by concating the different shapes of these alphabets(we adhere them together in some cases). Again I highlight these rule: we generate these words by concating the shapes not the alphabets(Which is always concating alphabets in English) you can see some examples below:
we have alphabets "ک" "ن" "ا" "د" "ی"
I make these words by just mentioned alphabets : نادان , یاد,دکان
So, to wrap it up and give you the clue what happened in the screenshots I posted , the terminal breaks the words to alphabets and reverse them.(So it's not just about reversing). Take a look at words I created and alphabets I mentioned before, Now the VS terminal shows them "separated" and "reversed".

Correct format: نادان Terminal: ن ا د ا ن
Correct format:یاد Terminal: د ا ی
Correct format: دکان Terminal: ن ا ک د

Now your questions:
Where should the strings be flipped.for Hebrew/Arabic/Persian, do I reverse entire continuous sequences of characters in-between ascii characters?
I don't have any idea about Hebrew, but in Arabic and persian the sequences of character should flip when they encounter a space character(The word separator is space) like this:" من در حال نوشتن هستم" but still it should keep the "shapes" and necessary adherence.

How are the characters meant to interact with characters like 0-9 or punctuation?
About numbers and punctuation the rules are the same as English and the numbers and punctuation signs follows the characters. like this:
?من در سال "۱۳۶۹" به دنیا آمدم.
من در سال "1369" به دنیا آمدم.
Actually a sequences of characters containing RTL and none-RTL characters is a whole different story and if you need more information I can elaborate that.

P.S 1:
This link here is a source code which is written to solve the same problem in PHP( for sure old versions) you can take a look
https://github.com/slashmili/php-gd-persian/blob/master/phpgd/fagd.php

P.S 2:
Here is a resource on wikipedia about the Persian characters
https://en.wikipedia.org/wiki/Persian_alphabet

P.S 3:
Again, I have to mention that in the previous version of VS Code, everything was fine.

P.S 4:
About the problem with selecting a word containing some LTR character like
<p>اینجا را بخوانید</p> which @CherryDT mentioned , there are some minor bugs which I don't have problem with them and I found quick solutions for them.(But still if you need some elaboration about those let me know)

@saeedhei
Copy link

saeedhei commented Oct 31, 2017

After Updating my vscode, Everything reversed, That is Very bad, Please Solve This problem
I want to downgrade, Witch version is okey?

@amitbeck
Copy link

@mostafa69d luckily enough in Hebrew that barely exist. Hebrew letters stay mostly the same in any position inside a word, besides few letters which are כ which turns to ך, then מ which turns to ם, then נ which turns to ן, then פ which turns to ף and finally צ which turns to ץ. This makes Hebrew easier to format, I guess.

@CherryDT
Copy link

However these are still separate characters (in terms of character encoding) and always display the same. They do not change appearance when moved around. (It's the writer's job to use the right letter - sofit or not - at the right position.)

@MortadaAK
Copy link

The problem with the splitting characters is when they are wrapped within span one by one it will require connection and it will miss represent the shape (Arabic letters).

To fix the problem these characters must be within one span or not wrap them at all.

The list of the unicode all of these letters are
Arabic (0600–06FF, 255 characters)
Arabic Supplement (0750–077F, 48 characters)
Arabic Extended-A (08A0–08FF, 73 characters)
Arabic Presentation Forms-A (FB50–FDFF, 611 characters)
Arabic Presentation Forms-B (FE70–FEFF, 141 characters)
Rumi Numeral Symbols (10E60–10E7F, 31 characters)
Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters)
screen shot 2017-11-29 at 11 45 00 pm

This was referenced May 20, 2018
@Tyriar Tyriar added area/i18n area/renderer type/enhancement Features or improvements to existing features and removed type/bug Something is misbehaving labels Jun 3, 2018
@wis
Copy link

wis commented Nov 15, 2018

required reading: https://opensource.com/life/16/3/twisted-road-right-left-language-support

from microsoft/vscode#28571 (comment)

do you have an example of another terminal that handles this well?

mlterm seems to be better than the average (non-web based) terminal.
2018-11-15-023232_577x981_scrot
It is cursive but in some cases cut off, I think it can be solved by changing the font, this paragraph was copied from Wikipedia, the blue characters are the RTL mark, that's how vim is outputing them and mlterm is rendering them in blue.

@Tyriar
Copy link
Member Author

Tyriar commented Jan 11, 2019

The character joiner API might be able to solve this, we could probably make all adjacent arabic/hebrew/etc. unicode characters join and be drawn in the same glyph.

@babakks
Copy link

babakks commented Jan 17, 2019

For what it's worth, the debug console works well with RTL texts. This is what I've tried:
code
And this is the output on the debug console:
debug
But the terminal is still the same:
terminal

I'm using VS Code - Insiders v1.31.0.

@elieobeid7
Copy link

@babakks Only two Terminals as far as I know in the Linux system can output RTL correctly, konsole and mlterm, they are available in all the distros repos.

@MortadaAK
Copy link

MortadaAK commented Jan 17, 2019

@elieobeid7 @babakks Mac OS terminal output RTL correctly

Tyriar added a commit to Tyriar/xterm.js that referenced this issue Jan 17, 2019
Tyriar added a commit to Tyriar/xterm.js that referenced this issue Jan 17, 2019
@Tyriar
Copy link
Member Author

Tyriar commented Jan 17, 2019

Put out a PR to fix this, if anyone wants to test out the branch that would be useful as I don't speak these languages. #1899

To test:

git clone https://github.com/Tyriar/xterm.js
cd xterm.js
git checkout 701_rtl_support
yarn
yarn watch

# another terminals
yarn start

You may need some dependencies to be installed https://github.com/Microsoft/node-pty#dependencies

@egmontkob
Copy link

Please hold off for a little bit :)

I've been recently working on studying, evaluating existing docs and implementations of RTL in terminals, and come up with a (draft) recommendation. I'll release it real soon now.

It's way more complicated than one would first think. A bit of spoiler: If you start shuffling the characters around according to the BiDi algorithm, it becomes literally, mathematicaly provably impossible to have proper BiDi-aware text editing-viewing experience (e.g. vim, emacs...) on top of that platform. (And to respond to the previous few comments: no, konsole, mlterm and macOS Terminal don't get it right either.)

@Tyriar
Copy link
Member Author

Tyriar commented Jan 17, 2019

@egmontkob does this take into account the fact that we get to leverage the browser's bidi support? All my change does is force related unicode sequences to be drawn together not as separate characters. This is probably wrong when the cursor is over the character but it seems to work other than that.

@babakks
Copy link

babakks commented Jan 17, 2019

@Tyriar Sorry Tyriar, but it's still wrong. I commented under the pull request.
#1899 (comment)

@egmontkob
Copy link

The spec defines how the canvas needs to look like, after receiving some data. The spec doesn't care what the backend of the terminal emulator is (e.g. a graphical canvas, or a browser (HTML DOM), or another terminal emulator (tmux)), it's the terminal emulator's task to implement the specified behavior by whatever means.

And one aspect of the specified behavior is that in some circumstances the character cells need to be shuffled according to the BiDi algorithm (for display purposes only, not affecting the actual storage), because that's the only reasonable way to get simple utilities like "cat" produce the desired output; and in some other circumstances the cells mustn't be rearranged, because that's the only way vim/emacs/whoever can do their own BiDi. There are escape sequences controlling this behavior. And there's much-much more to the story than this.

@egmontkob
Copy link

Please see the published draft BiDi specification at https://terminal-wg.pages.freedesktop.org/bidi/ . Comments, improvement ideas etc. are welcome over there in its issue tracker.

@roseMix
Copy link

roseMix commented Feb 18, 2021

I just had this issue in vscode terminal is there still no fix for this?

@munael
Copy link

munael commented Sep 22, 2021

Not sure what the current state of this issue is? Some old comments mention PRs fixing it, but it's still active in the latest vsc insiders. :'(

@amir-nejad
Copy link

I have the same issue.

This issue is like Adobe's photoshop problem. in adobe, we can go to language settings and enable middle eastern features for fixing this issue. But we do have not any solution in vs code.

Check this link:
https://graphicdesign.stackexchange.com/questions/18005/how-can-i-get-farsi-arabic-text-to-render-correctly-in-photoshop

I have Windows 11 and I have not any problem in manual running code in CMD or Powershell. But VS Code not working correctly.

Please fix this.

@eyaler
Copy link

eyaler commented Feb 27, 2022

hi. i wanted to migrate from pycharm to vscode, but this is a blocker. in pycharm the terminal works fine with RTL. was anyone able to get vscode terminal work for Hebrew?

@par3ae
Copy link

par3ae commented Oct 10, 2022

@Tyriar
Hi dear Daniel
excuse me I used the latest version of vscode and still I have this issue about RTL languages.
I read the whole messages of this issue but I don't understand the approach to solve this issue in vscode integrated terminal.

whould you please guide me to solve this?

@Tyriar
Copy link
Member Author

Tyriar commented Oct 10, 2022

@par3ae this needs a bunch of research to figure out how to solve it properly which I haven't had time to do.

@starball5
Copy link

Related question on Stack Overflow: Why doesn't the VS Code integrated terminal support RTL (right-to-left) text?

@andjc
Copy link

andjc commented Apr 18, 2023

Arabic and Hebrew script have been mentioned, but there are many more scripts that require bidi support. But it is also not just of a question of bidirectional text, all writing systems requiring complex rendering seem to be affected. Every South Asian and South East Asian script I tried was broken, as were quite a few African scripts.

@jerch
Copy link
Member

jerch commented Apr 18, 2023

@andjc Yes - to make it blunt, unicode in terminal emulators is broken, when it comes to script systems outside of the latin/greek derived systems. Mainly 3 things are missing in xterm.js:

  • cluster/grapheme segmentation: While thats quite easy to get done on codepoint/data level, it raises serious questions about cursor mechanics and how to address/edit perceivable characters later on. There are several IME helpers to get things initially expressed, but it is not specced by anything, how the terminal cursor should move across that or make things editable afterwards.
  • proper width handling on a grid system: Terminals still stick to the now wonky half vs. full width separation (1 cell vs 2 cells in terminals) based on East Asian Width property. But unicode made it pretty clear, that its not the right way to layout glyphs, instead it depends on more complex rules from clustering and combining, even leading to fractions of cells (+ depending on font devs choice). And they dont answer the question, how a strictly monospaced environment not knowing the glyphs' width in advance should treat that. You will see the side effects of this "under specification" in any monospaced GUI editor, where it suddenly breaks out of the grid system on multiple combined/clustered chars. But a terminal cannot do this the same way.
  • bidi: @egmontkob did a great job to spec a proposal how to solve that for terminals (see above), still there are some surprising side effects when it comes to line progression/cursor advance. All older attempts (even DEC already had an RTL setting) are useless these days, as unicode made the line progression to a codepoint property (the old systems worked as strictly LTR or RTL).

Regarding bidi and xterm.js - since we have no devs with an RTL background, it is unlikely to be adopted soon. Speaking for myself - I have literally no clue about bidi mechs, and would just end up messing with a system I dont know/understand.
Ofc PRs in this regard are more than welcome, but at anyone being up for this - you better have a strong affiliation/dedication to scripting system mechs, or things will get really frustrating.

@andjc
Copy link

andjc commented Apr 18, 2023

@jerch, assuming a solution to the extended grapheme clusters is found, it will then be necessary to rethink the grid system. East Asian Width property and two cell sizes may not be adequate, some grapheme clusters become quite complex, and if the base character is wide to start with ...

Bidi is one issue, but visual versus logical ordering is another issue:

Take grapheme "ကြွေ" (U+1000 U+103C U+103D U+1031), U+1031 is rendered at the beginning of the cluster but is the fourth character in the string.

Honestly, it quickly becomes extremely complex to implement. Most terminals work best for LCG, and even for LCG they don't necessarily play nice with all input frameworks either.

@socketpair
Copy link

socketpair commented Jun 11, 2023

one more example:

echo '"qwe \u05e9\u05dc\u05d5\u05dd 123 \u043f\u0440\u0438\u0432\u0435\u0442"' | jq -r

should give:

image

Note, 123 going (by bytes) after Hebrew should be rendered on the left side of Hebrew word.

gnome-terminal supports this correctly.

@tabarra
Copy link

tabarra commented Feb 14, 2024

I wonder if using a library like bidi-js as a pre-processor can mitigate the issue for now.
Anyone managed to come up with a patch for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests