-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Various improvements to Unicode handling #57
base: master
Are you sure you want to change the base?
Conversation
Before this commit, words could only contain alphabetic and numeric characters. This is unduly limiting; for example, combining marks (like accents) are not included. This commit expands the set of allowed characters based on the recommendations of UAX 31 <https://www.unicode.org/reports/tr31/>, with a bias toward allowing more characters (though emoji are excluded). In addition, unassigned and private use characters are assumed to be allowed in words (ensuring that case conversion will pass them through unchanged). This change means we can no longer rely only on the Unicode data tables shipped with the standard library. A new `tables` binary crate is in charge of generating the tables we need (which consumes 3600 bytes of data).
Requires 5945 additional bytes of static data. Some existing tests had to be modified, as the old algorithm sometimes inserted word boundaries after digits in cases where the new one does not.
Some Unicode characters consist of a pair (or even triple) of letters; when title-casing, only the first member of the pair should be capitalized. For example, U+01C6 (dž) uppercases to U+01C4 (DŽ) but titlecases to U+01C5 (Dž). This adds 2160 bytes of static data.
None => { | ||
// Nonspacing marks are ignored for the purpose of determining boundaries. | ||
if !tables::is_nonspacing_mark(c) { | ||
prev_was_lowercase_or_non_greek_titlecase = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line to restore the old behavior of breaking after the 1 in a1B
.
Ensures that the new rules are strictly more permissive than the old ones.
Hey, this seems like a lot to take in. I don't think I'll have the time / motivation to review it anytime soon. I'd be open to just making you a maintainer of this crate and let you merge whenever you think it's good enough.. but only if it was entirely my crate, which it isn't. I can't even access repo settings, and also just feel like I shouldn't make decisions like that on my own. If you care to have this new implementation (which at the face of it seems really good) under the name of this crate instead of contributing it to another library like For extra context, this crate used to rely on the |
This PR makes several changes to improve the crate's Unicode handling.
Expand the set of characters allowed in words. Currently, only alphabetic and numeric characters are allowed; combining diacritics (accents, for example) are excluded. This PR expands the set of supported characters to a list based on UAX 31 rules, while also allowing private-use and unassigned characters. This ensures that accents won't be stripped upon case-folding, upon other benefits. (Notably, Rust uses UAX 31 to determine what strings are valid identifiers, though with somewhat different tailoring.) You can browse the full list of supported characters with this tool.
ID_Compat_Math_Continue
mathematical symbols, or alternatively include emoji, currency symbols, or other symbols.Update the word boundary definition to match the UTS 55 definition of an "identifier word boundary". This gives the same results in most cases; however nonspacing combining marks are now properly ignored, and titlecase letters are handled correctly as well.
Proper title casing. Some Unicode characters consist of a pair (or even triple) of letters; when title-casing, only the first letter of the group should be capitalized. For example, U+01C6 (dž) uppercases to U+01C4 (DŽ) but titlecases to U+01C5 (Dž).
Unfortunately, these changes mean that the Unicode APIs exposed by the standard library no longer suffice. Therefore, we now include 11705 bytes of static data, in tables generated by a new
tables
binary crate. There likely exist cleverer ways of compressing this.Fixes #55