Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad letter boundary detection for complex scrips #2115

Open
santhoshtr opened this issue Jan 7, 2014 · 22 comments
Open

Bad letter boundary detection for complex scrips #2115

santhoshtr opened this issue Jan 7, 2014 · 22 comments

Comments

@santhoshtr
Copy link

Paste the following text to brackets, and see where the cursor is placed

സന്തോഷ്

Cursor is supposed to place at end of the word, but in brackets it is after 4 or 5 character width.

Happens with all non-latin complex scripts

Works fine in Firefox, but issue exist in chrome.

brackets

(duplicated from adobe/brackets#6301)

@marijnh
Copy link
Member

marijnh commented Jan 7, 2014

This is a case of CodeMirror's simplistic grapheme cluster algorithm not handling the language. Unfortunately, JavaScript does not provide the primitives needed to do sane cluster-boundary detection (finding character properties, etc).

Happens with all non-latin complex scripts

Not all. Some, like Arabic, should work.

@santhoshtr
Copy link
Author

This is a case of CodeMirror's simplistic grapheme cluster algorithm not handling the language. Unfortunately, JavaScript does not provide the primitives needed to do sane cluster-boundary detection (finding character properties, etc).

I would like to understand it a bit more. What exact algorithm you need to place the cursor at a logically correct position? If we want to support a lot of languages, we should leave this kind of primitive functionality to browsers. Trying to imitate such behavior will reach no where.

Also. how Chrome gives different output than Firefox in this case?

@marijnh
Copy link
Member

marijnh commented Jan 7, 2014

To know how to move the cursor through a text, and which ranges of codepoints to use when measuring character positions, CodeMirror needs to know where clusters start and end.

The browser knows this, but doesn't expose this information to JavaScript. Telling me that what I'm doing "will reach no where" without actually understanding the problem isn't really the right tone to take here.

@santhoshtr
Copy link
Author

I have faced the cursor movement, logical cluster issues in the development of Visual Editor for Wikimedia. Thought of understanding the problem in detail so that I might be able to help. Will check later, don't have time to find out the details now. Thanks.

@marijnh
Copy link
Member

marijnh commented Jan 9, 2014

Attached patch fixes some known problems with handling of extending code points, and appears to help with #2125 (Hindi), but does not fix your example.

I will need some input from someone who is familiar with this language's Unicode encoding, because the behavior of this string baffles me. Characters "ന്തോ" act as a single unit, as far as cursor movement is concerned, but only the second code point in that string is an extending character. If I read the document at http://www.unicode.org/reports/tr29/ correctly, this should count as three grapheme clusters, not one. What is going on?

@peterflynn
Copy link
Contributor

CC'ing @pauldhunt and @miguelsousa, who have worked on some of Adobe's open-source typography efforts -- just in case they have any quick insights to share :-)

@Jaygiri
Copy link

Jaygiri commented Jan 9, 2014

I have removed my previous comment.

This language is Malayalam. Fix for #2125 is not fixing positioning for this language.

@santhoshtr
Copy link
Author

Characters "ന്തോ" act as a single unit, as far as cursor movement is concerned, but only the second code point in that string is an extending character. If I read the document at http://www.unicode.org/reports/tr29/ correctly, this should count as three grapheme clusters, not one. What is going on?

You cannot rely on TR29 for getting grapheme clusters for the purpose of the counting or cursor movement. TR29 clearly explains this. You have to use tailored logic to meet your purpose. That too is not enough since in Indic scripts, depdending on the font, multiple consonants with the help of a joining character like VIRAMA can create single ligatures. Sometime stacking of characters happens. Chrome and FF does not agree on the implementation of character movement on Indic scripts. Chrome allows you to move your cursor as per logical boundaries. FF also follow the same rule, but FF allows placing cursor if you try to do it using a program. You have to ask the browser whether you can place a cursor here or not. Iterating that question over a range of text will give you a reliable cursor placing positions. This can be used for creating a stack of edits useful for undo redo etc.

@marijnh
Copy link
Member

marijnh commented Jan 10, 2014

By 'ask the browser' you mean create a textarea and try to set the cursor in the textarea there? Or is there a more efficient/convenient way to do it on (non-editable) DOM nodes?

Is there an easy/cheap way to determine whether a string might have stacking?

@santhoshtr
Copy link
Author

By 'ask the browser' you mean create a textarea and try to set the cursor in the textarea there? Or is there a more efficient/convenient way to do it on (non-editable) DOM nodes?

Yes, create an editable node and keep on trying to place cursor. Of course it is inefficient and hacky.

Is there an easy/cheap way to determine whether a string might have stacking?

No, that is not possible. It not only depends on the data but also the font used.

@marijnh
Copy link
Member

marijnh commented Jan 16, 2014

Is there an easy/cheap way to determine whether a string might have stacking?

No, that is not possible. It not only depends on the data but also the font used.

Well, I meant a way to weed out strings that obviously don't need the expensive treatment, and simply have a cursor position between every code point. /[^\x00-\x7f]/ would work to spot ascii strings, but maybe we can do better, and enumerate the ranges of the languages in this occurs (by using broad ranges to keep the string size under control, false positives aren't bad).

@marijnh
Copy link
Member

marijnh commented Jan 27, 2014

@santhoshtr

Yes, create an editable node and keep on trying to place cursor. Of course it is inefficient and hacky.

On Firefox, it seems that selectionEnd can be set to any value, even one that's not a valid cursor position. Do you have any example of this technique actually being applied?

@marijnh
Copy link
Member

marijnh commented Jan 27, 2014

(That is, I'm using a textarea now, because there i can play with selectionEnd without actually breaking the existing selection in the document. Using getSelection().addRange() is just too horribly disruptive—will cause tons of side effects on mobile, and also cause spurious deselects/reselects on desktop.)

@ghost
Copy link

ghost commented Feb 23, 2014

@marijnh Arabic doesn't work correctly same as Thai.

@peterkroon
Copy link
Contributor

@marijnh
#2115 (comment)

The browser knows this, but doesn't expose this information to JavaScript.

Have you considered filing a bug for this at https://bugzilla.mozilla.org/ or https://code.google.com/p/chromium/

@alicoding
Copy link

Wondering if there is any update or workaround to this bug yet?

@marijnh
Copy link
Member

marijnh commented Apr 28, 2014

Nope, I still haven't found a hack that works halfway acceptably.

@niftylettuce
Copy link

I still have same issue, if you set a custom font, like Inconsolata, the line height or cursor positioning is way off (until you start to make some interaction/typing/clicking in the textarea rendered into .CodeMirror class.

@niftylettuce
Copy link

screen shot 2016-01-14 at 1 24 09 am
screen shot 2016-01-14 at 1 24 03 am

@sadig41
Copy link

sadig41 commented Jan 9, 2018

Can't make RTL for arabic?

@adrianheine
Copy link
Contributor

This is a issue that's difficult if not impossible to solve with the fundamental approach currently taken by CodeMirror.

We are working on a rewrite (CodeMirror 6) that might address this issue, and we are currently raising money for this work: See the announcement for more information about the rewrite and a demo.

Note that CodeMirror 6 is by no means stable or usable in production, yet. It's highly unlikely that we pick up this issue for CodeMirror 5, though.

@HTGAzureX1212
Copy link

Same issue here, the cursor seem to be completely mispositioned... I have used codeMirror.getDoc().setValue() though.
image

Windows 10 1909
Chrome 86.0.4240.111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants