Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word segmenter with generic locale #136

Open
my2iu opened this issue Apr 19, 2021 · 10 comments
Open

Word segmenter with generic locale #136

my2iu opened this issue Apr 19, 2021 · 10 comments

Comments

@my2iu
Copy link

my2iu commented Apr 19, 2021

I have a bunch of strings of text of unknown language/locale, and I want to find the word breaks so that I can lay them out in paragraphs in svg. Is there some way to create a word segmenter using some sort of generic or default locale? Would, say, the English segmenter still properly handle Chinese text? Or do I need some sort of language detection to figure out the locale of the segmenter I need to instantiate?

@FrankYFTang
Copy link
Contributor

FrankYFTang commented May 1, 2021

What you need is line break. Some languages are using word break to determine line break but some are not. For example, Chinese and Japanese are not needed to line break at word boundary but English, Thai, Arabic and Khmer are.

Notice for JavaScript Intl.Segmenter, there are no support for line break but only word break.
The early draft include the support the line break mode but engineer from Apple Safari team oppose that and claim that will encourage incorrect use of the facility. I try to convince them the need for JavaScript line break for SVG but didn't get enough supports from others. If you feel strongly we need to support a line break mode for that please voice it up

@my2iu
Copy link
Author

my2iu commented May 1, 2021

Yes, but my understanding was that it would take years for anyone to agree on an API for line breaking. I assumed the compromise was that the word segmenting API was designed to be rich enough to allow programmers to construct a rudimentary line segmenter themselves.

The no/generic locale thing is important to multilingual users who are forced to type Chinese/Arabic/etc. into English web apps. Or English expats typing English into local Japanese websites. etc. Not everything can be localized properly for all languages.

@FrankYFTang
Copy link
Contributor

Yes, but my understanding was that it would take years for anyone to agree on an API for line breaking. I assumed the compromise was that the word segmenting API was designed to be rich enough to allow programmers to construct a rudimentary line segmenter themselves.

No. That is exactly why I think it is a bad idea to not support Line Break. The compromise was Apple engineer believe if we do support line break, then there are no body will attempt to use ECMA to build a line break and use html and css to line break instead. They claim if we support line break, then people will NOT use CSS + html to do line break and not support line break and no one will use word break to do that. I am afraid if we do not support line break, then people will use word break incorrectly to implement line break. And your reasonable totally fulfill my prediction. Word Break is NOT a subset of Line Break. Line Break is neither a subset of Word Break. They follow two different systems. For SOME language, the line break may depend on word break but that is only on those languages.

It will be nice if you can argue your use cases of WHY you need to break the line by yourself but not depend on CSS to line break for you instead. Apple engineer believe all who need line break COULD use CSS line break facility to do that and should not use JavaScript to perform that by themselves, in particular line break also need the information of glyph boundary, which is not accessible from JavaScript.

@FrankYFTang
Copy link
Contributor

would take years for anyone to agree on an API for line breaking

That is not the reason- the reason is in order to decide line breaking, it require two thing together

  1. the logical line break points - where the text COULD break
  2. the font metrics -
    and the layout NEED BOTH to perform line break but JavaScript has no support of (2) now.

@my2iu
Copy link
Author

my2iu commented May 1, 2021

Well, that’s a little silly then. There’s a lot of people trying to standardize custom font rendering that need this. HTML/CSS are too high level for modern HTML5 web apps. Anyone doing HTML5 canvas games, WebGL, WebVR, WebXR, svg charts and graphs, visualization, or anything graphical will have their own low-level text rendering. I, myself, have written my own vector graphics tool in HTML5, and I had to use low-level libraries Typr.js and Harfbuzz to do my text rendering. And I need a low-level internationalization library to do my word breaking and line breaking, but the full ICU with all its dictionaries and stuff is too big to include on a web page. Since all web browsers already include the ICU in some way, I always thought the whole purpose of this standardization effort was to provide an API to let programmers access it instead of forcing them to download it or to use some server solution.

That is not the reason- the reason is in order to decide line breaking, it require two thing together
the logical line break points - where the text COULD break
the font metrics -
and the layout NEED BOTH to perform line break but JavaScript has no support of (2) now.

Yeah, I assumed the hold-up was that you couldn’t agree on an API for this, not that you couldn’t agree on whether line breaking was actually needed.

@FrankYFTang
Copy link
Contributor

not that you couldn’t agree on whether line breaking was actually needed.

Actually, THAT IS the disagreement. I believe it is needed but there are OTHERS believe line break is not needed as a JavaScript API. It is very easy for me who implement V8 to add the line break support, but we have hard time to convince Apple to agree w/ us. They believe the problems should only be solved by html and css but not in the level of JavaScript and if JavaScript provide such support it will be misused and damage the web. I believe if we do not support line break then people who need it will misuse the word break to implement it any way and damage even worst.

If you strongly believe adding line break support is essential, please file another bug here and request for adding line break granularity and put down the use case and motivation clearly. and I will try to reopen the issue in ECMA402 committee and TC39 to ask for reconsideration.

@FrankYFTang
Copy link
Contributor

Also, I would suggest you to file bug to ask for line break support in v8 (https://bugs.chromium.org/p/v8/issues/ - assign to me ( [email protected] ) Components: Internationalization , Mozilla and Microsoft Edge, JSC . If all browser vendors receive more feature requests and agree with you, it may pressure them to accept the feature in ECMA402.

@my2iu
Copy link
Author

my2iu commented May 3, 2021

I don't think I have the permissions to assign bugs to users or components in v8, so if I were to create a v8 bug, it would probably get lost in triage. It might be better if you were to create it then.

@FrankYFTang
Copy link
Contributor

FrankYFTang commented May 6, 2021 via email

@my2iu
Copy link
Author

my2iu commented May 6, 2021

I think I already ended up filing it incorrectly, but I’ll let you fix it up:

https://bugs.chromium.org/p/v8/issues/detail?id=11744

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants