Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next - Unicode 11 support #24

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Next - Unicode 11 support #24

wants to merge 16 commits into from

Conversation

JLHwung
Copy link
Contributor

@JLHwung JLHwung commented Dec 8, 2018

The change should be a breaking change since

  • an ES Module is exported instead of CommonJS module now. People have to change
const GraphmeSplitter = require("grapheme-splitter")

to

import GraphmeSplitter from "grapheme-splitter"

or if they are using a legacy environment

const GraphmeSplitter = require("grapheme-splitter").default

Other than that, the API is stable.

  • The new implementation now conformed to Unicode 11

Dev Infrastructure Changes:

  • Added scripts to convert GraphemeBreakProperty.txt to JavaScript snippet.
  • Added scripts to convert emoji-data.txt to JavaScript snippet.
  • Documented the usage of these maintenance scripts.

@orling Could you install travis to this repository so that I can setup CI? It would be good to prove that the software works as expected.

@JLHwung JLHwung mentioned this pull request Dec 8, 2018
4 tasks
@JLHwung JLHwung changed the title WIP: Next Next - Unicode 11 support Dec 8, 2018
@@ -12,6 +12,10 @@
}
],
"main": "index.js",
"files": [
Copy link
Contributor Author

@JLHwung JLHwung Dec 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have constrained the files here, only distributed libraries and LICENSE/README will be distributed. The src, scripts, dev infrastructure babel.config.js will not be included in our npm package. Thus we obtain an optimal node_modules footprint.

Here is the result of npm pack:

npm notice
npm notice 📦  [email protected]
npm notice === Tarball Contents ===
npm notice 1.1kB   package.json
npm notice 839B    index.d.ts
npm notice 145.1kB index.js
npm notice 1.1kB   LICENSE
npm notice 5.2kB   README.md
npm notice === Tarball Details ===
npm notice name:          grapheme-splitter
npm notice version:       1.0.4
npm notice filename:      grapheme-splitter-1.0.4.tgz
npm notice package size:  32.7 kB
npm notice unpacked size: 153.3 kB
npm notice shasum:        c3455c5317b8c40b340d7c7035b78edf11c561e7
npm notice integrity:     sha512-p+i2AbQ0PNq/T[...]3C7vTVdWa8gLQ==
npm notice total files:   5

@orling
Copy link
Owner

orling commented Dec 11, 2018 via email

@JLHwung
Copy link
Contributor Author

JLHwung commented Dec 11, 2018 via email

@rebirthtobi
Copy link

Hi @orling,

Hoping this could be approved soon

@orling
Copy link
Owner

orling commented Oct 24, 2019

Better not break APIs, even for minor things.

Also this pull request contains several unrelated changes, making it very time-consuming to review and risky to merge

@rebornix
Copy link

rebornix commented Oct 29, 2019

@orling @JLHwung thanks for the good work done by both of you. I'm trying to improve the unicode segmentation for VS Code/Monaco Editor and find this project already doing most of the work. I made some changes in my own fork https://github.com/rebornix/grapheme-splitter/tree/perf, including

My change is very unlikely to be merged into upstream and I still like to share here, just in case if you are interested.

@JLHwung
Copy link
Contributor Author

JLHwung commented Oct 29, 2019

@orling Oh it was my code almost a year ago. I can split that into different PRs.

@jasonsbarr
Copy link

@rebornix oh nice, you refactored those incredibly long if conditions. Any issues you've found with your version, or is it as accurate as the original?

@rebornix
Copy link

@jasonsbarr I didn't run into any weird issue in all my use cases (and it passed the test suites.)

@mattpauldavies
Copy link

@orling amazing work on creating this library... it's been extremely useful for us.

I was also super impressed with @JLHwung's work (and I personally agree with the recommendation to move to Typescript).

I urgently needed Unicode 13 support, so I forked this pull request and have created Graphemer

It includes the following:

  • new documentation to make the library easier to maintain
  • updated to include Unicode 13
  • refactored in Typescript

If you'd like to discuss consolidating those efforts into this library that would work for me or if, life is getting in the way, and grapheme-splitter is a bit much to maintain. I'd really appreciate support on the Graphemer project.

@xorgy
Copy link

xorgy commented Feb 12, 2021

@mattpauldavies I stupidly went ahead and factored grapheme-splitter into a module without classes; not noticing this PR at all, nor your project. I'm going to look into factoring your Unicode 13+ work into my module.

Given that these classes have no actual state and are completely pointless, I feel like these should just be functions.

@mattpauldavies
Copy link

Not stupid at all @xorgy. I agree the classes don't provide any real benefit. I kept them as I wanted a direct swap and I was already using grapheme-splitter.

How would you feel about refactoring Graphemer to use functions? We could then either release a v2.0 (with breaking changes) or we could map the functions to a sort of proxy class that would provide backwards capability.

I want to think about it a bit, but if the functions were split into separate files it would make maintenance easier. Especially updating to new Unicode versions.

If that doesn't vibe with you feel free to take the Unicode 13 work and crack on!

@xorgy
Copy link

xorgy commented Feb 13, 2021

@mattpauldavies For my immediate use case, I ended up writing https://github.com/xorgy/grapheme-iterator from scratch instead. (though I don't suggest anyone use it right now, I'm not 100% confident in the correctness of my state machine right now, and the generator is a bit of a mess since I wrote the state machine directly from reading the spec).

I think the approach is pretty good though, my classify function is much faster than the equivalent in grapheme-splitter, so overall grapheme-iterator is about twice as fast as grapheme-splitter, even when you compare using the iterator just for counting (throwing away the values) and comparing that to the hand-written countGraphemes loop in grapheme-splitter.

classify uses a table computed directly from the Unicode 13.0 GraphemeBreakProperty.txt file, and when 13.1 comes out it should Just Work™ by what I know about that standards process.

The other benefit is that none of the symbols in grapheme-iterator need to be preserved when minifying. Overall it ends up about 3500 bytes gzipped with the table, even with no name mangling.

@xorgy
Copy link

xorgy commented Feb 13, 2021

@mattpauldavies I think maybe I could make a CommonJS version of it (might need this myself anyway, if I want to use it from a CommonJS-based node app), and write a GraphemeSplitter interface emulator on top of that; then new GraphemeSplitter could just depend on the cleaner module.

Or somebody else could do that, it's only a couple hundred lines of code, and you'd only really need to touch about a dozen of them.

@xorgy
Copy link

xorgy commented Feb 16, 2021

Also now instead of being just 2x faster than GraphemeSplitter, grapheme-iterator is about 22x faster.

@JLHwung
Copy link
Contributor Author

JLHwung commented Feb 22, 2021

Note that Intl.Segmenter is a stage 3 ES proposal and has been implemented by Chrome.

The GraphemeSplitter interface

const splitter = new GraphemeSplitter();
splitter.splitGraphemes("abcd"); // returns ["a", "b", "c", "d"]

can be replaced by

const segmenter = new Intl.Segmenter({granularity: "grapheme"});
[...segmenter.segment("abcd")] // returns [{segment: "a", index: 0, input: "abcd"} , ... , {segment: "d", index: 3, input: "abcd"}]

@orling Consider leave a note on README and suggest transition to Intl.Segmenter.

@ljharb
Copy link

ljharb commented Feb 22, 2021

There'd still need to be a polyfill, otherwise most websites won't be able to rely on it for about 5-10 years.

@xorgy
Copy link

xorgy commented Feb 26, 2021

It also seems that Intl.Segmenter involves the rest of TR29 as well, not just grapheme breaking, and the proposed API selects which of these segmenters you use through a "granularity" property in an object, which means that it is not trivial to just polyfill the bit that you want. If you wanted to have a polyfill mechanism, you'd want it the other way around: start with a grapheme splitter and try to use Intl.Segmenter.

P.S. I think grapheme-iterator is working correctly now, need to find a better test suite than the one from the Unicode Consortium (which doesn't even include examples for each GraphemeBreakProperty (!)), but when I find such a test suite I'll put it up as 1.0.

@AlexRMU
Copy link

AlexRMU commented Nov 12, 2024

🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants