Various improvements to Unicode handling #57

Jules-Bertholet · 2024-03-15T01:35:15Z

This PR makes several changes to improve the crate's Unicode handling.

Expand the set of characters allowed in words. Currently, only alphabetic and numeric characters are allowed; combining diacritics (accents, for example) are excluded. This PR expands the set of supported characters to a list based on UAX 31 rules, while also allowing private-use and unassigned characters. This ensures that accents won't be stripped upon case-folding, upon other benefits. (Notably, Rust uses UAX 31 to determine what strings are valid identifiers, though with somewhat different tailoring.) You can browse the full list of supported characters with this tool.
- This is the portion of the PR with the greatest room for bikeshedding. For example, we could exclude the ID_Compat_Math_Continue mathematical symbols, or alternatively include emoji, currency symbols, or other symbols.
Update the word boundary definition to match the UTS 55 definition of an "identifier word boundary". This gives the same results in most cases; however nonspacing combining marks are now properly ignored, and titlecase letters are handled correctly as well.
- In the old implementation, an uppercase character could never come after a lowercase character inside a word. The UTS 55 rules adopted by this PR, however, allow it when there are caseless characters (like numbers) between the two cased characters. If the old behavior is deemed preferable, restoring it is a one-line change.
Proper title casing. Some Unicode characters consist of a pair (or even triple) of letters; when title-casing, only the first letter of the group should be capitalized. For example, U+01C6 (ǆ) uppercases to U+01C4 (Ǆ) but titlecases to U+01C5 (ǅ).

Unfortunately, these changes mean that the Unicode APIs exposed by the standard library no longer suffice. Therefore, we now include 11705 bytes of static data, in tables generated by a new tables binary crate. There likely exist cleverer ways of compressing this.

Fixes #55

Before this commit, words could only contain alphabetic and numeric characters. This is unduly limiting; for example, combining marks (like accents) are not included. This commit expands the set of allowed characters based on the recommendations of UAX 31 <https://www.unicode.org/reports/tr31/>, with a bias toward allowing more characters (though emoji are excluded). In addition, unassigned and private use characters are assumed to be allowed in words (ensuring that case conversion will pass them through unchanged). This change means we can no longer rely only on the Unicode data tables shipped with the standard library. A new `tables` binary crate is in charge of generating the tables we need (which consumes 3600 bytes of data).

Requires 5945 additional bytes of static data. Some existing tests had to be modified, as the old algorithm sometimes inserted word boundaries after digits in cases where the new one does not.

Some Unicode characters consist of a pair (or even triple) of letters; when title-casing, only the first member of the pair should be capitalized. For example, U+01C6 (ǆ) uppercases to U+01C4 (Ǆ) but titlecases to U+01C5 (ǅ). This adds 2160 bytes of static data.

Jules-Bertholet · 2024-03-15T02:23:12Z

src/lib.rs

+                None => {
+                    // Nonspacing marks are ignored for the purpose of determining boundaries.
+                    if !tables::is_nonspacing_mark(c) {
+                        prev_was_lowercase_or_non_greek_titlecase = false;


Remove this line to restore the old behavior of breaking after the 1 in a1B.

Ensures that the new rules are strictly more permissive than the old ones.

jplatte · 2024-03-18T19:37:18Z

Hey, this seems like a lot to take in. I don't think I'll have the time / motivation to review it anytime soon. I'd be open to just making you a maintainer of this crate and let you merge whenever you think it's good enough.. but only if it was entirely my crate, which it isn't. I can't even access repo settings, and also just feel like I shouldn't make decisions like that on my own.

If you care to have this new implementation (which at the face of it seems really good) under the name of this crate instead of contributing it to another library like convert_case, or making your own, I suggest you contact withoutboats about it via email to ask about becoming a maintainer (this is what I did years ago).

For extra context, this crate used to rely on the unicode-segmentation crate but that's actually somewhat of a heavy dependency and the main users of this crate seem to be proc-macros that only ever run it on Rust type or field names (i.e. ascii-only input). A while after I took over maintenance, that dependency was first made optional and then removed completely (with some tests being added in the process).

Jules-Bertholet added 4 commits March 14, 2024 20:57

Use UTS 55 rules for determining word boundaries

4153d58

Requires 5945 additional bytes of static data. Some existing tests had to be modified, as the old algorithm sometimes inserted word boundaries after digits in cases where the new one does not.

Add tests and update documentation

30ea379

Jules-Bertholet commented Mar 15, 2024

View reviewed changes

Allow all alphabetic and numeric characters in words

ce59241

Ensures that the new rules are strictly more permissive than the old ones.

Jules-Bertholet force-pushed the uts-55 branch from babda2f to ce59241 Compare March 15, 2024 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various improvements to Unicode handling #57

Various improvements to Unicode handling #57

Jules-Bertholet commented Mar 15, 2024 •

edited

Loading

Jules-Bertholet Mar 15, 2024

jplatte commented Mar 18, 2024 •

edited

Loading

Various improvements to Unicode handling #57

Are you sure you want to change the base?

Various improvements to Unicode handling #57

Conversation

Jules-Bertholet commented Mar 15, 2024 • edited Loading

Jules-Bertholet Mar 15, 2024

Choose a reason for hiding this comment

jplatte commented Mar 18, 2024 • edited Loading

Jules-Bertholet commented Mar 15, 2024 •

edited

Loading

jplatte commented Mar 18, 2024 •

edited

Loading