Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various improvements to Unicode handling #57

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Jules-Bertholet
Copy link

@Jules-Bertholet Jules-Bertholet commented Mar 15, 2024

This PR makes several changes to improve the crate's Unicode handling.

  • Expand the set of characters allowed in words. Currently, only alphabetic and numeric characters are allowed; combining diacritics (accents, for example) are excluded. This PR expands the set of supported characters to a list based on UAX 31 rules, while also allowing private-use and unassigned characters. This ensures that accents won't be stripped upon case-folding, upon other benefits. (Notably, Rust uses UAX 31 to determine what strings are valid identifiers, though with somewhat different tailoring.) You can browse the full list of supported characters with this tool.

    • This is the portion of the PR with the greatest room for bikeshedding. For example, we could exclude the ID_Compat_Math_Continue mathematical symbols, or alternatively include emoji, currency symbols, or other symbols.
  • Update the word boundary definition to match the UTS 55 definition of an "identifier word boundary". This gives the same results in most cases; however nonspacing combining marks are now properly ignored, and titlecase letters are handled correctly as well.

    • In the old implementation, an uppercase character could never come after a lowercase character inside a word. The UTS 55 rules adopted by this PR, however, allow it when there are caseless characters (like numbers) between the two cased characters. If the old behavior is deemed preferable, restoring it is a one-line change.
  • Proper title casing. Some Unicode characters consist of a pair (or even triple) of letters; when title-casing, only the first letter of the group should be capitalized. For example, U+01C6 (dž) uppercases to U+01C4 (DŽ) but titlecases to U+01C5 (Dž).

Unfortunately, these changes mean that the Unicode APIs exposed by the standard library no longer suffice. Therefore, we now include 11705 bytes of static data, in tables generated by a new tables binary crate. There likely exist cleverer ways of compressing this.

Fixes #55

Before this commit, words could only contain alphabetic and numeric characters.
This is unduly limiting; for example, combining marks (like accents) are not included.
This commit expands the set of allowed characters based on the recommendations of
UAX 31 <https://www.unicode.org/reports/tr31/>, with a bias toward allowing more characters
(though emoji are excluded).
In addition, unassigned and private use characters are assumed to be allowed in words
(ensuring that case conversion will pass them through unchanged).

This change means we can no longer rely only on the Unicode data tables
shipped with the standard library.
A new `tables` binary crate is in charge of generating the tables we need
(which consumes 3600 bytes of data).
Requires 5945 additional bytes of static data.
Some existing tests had to be modified, as the old algorithm
sometimes inserted word boundaries after digits
in cases where the new one does not.
Some Unicode characters consist of a pair (or even triple) of letters; when title-casing, only the first member of the pair should be capitalized. For example, U+01C6 (dž) uppercases to U+01C4 (DŽ) but titlecases to U+01C5 (Dž).

This adds 2160 bytes of static data.
None => {
// Nonspacing marks are ignored for the purpose of determining boundaries.
if !tables::is_nonspacing_mark(c) {
prev_was_lowercase_or_non_greek_titlecase = false;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line to restore the old behavior of breaking after the 1 in a1B.

Ensures that the new rules are strictly more permissive than the old ones.
@jplatte
Copy link
Collaborator

jplatte commented Mar 18, 2024

Hey, this seems like a lot to take in. I don't think I'll have the time / motivation to review it anytime soon. I'd be open to just making you a maintainer of this crate and let you merge whenever you think it's good enough.. but only if it was entirely my crate, which it isn't. I can't even access repo settings, and also just feel like I shouldn't make decisions like that on my own.

If you care to have this new implementation (which at the face of it seems really good) under the name of this crate instead of contributing it to another library like convert_case, or making your own, I suggest you contact withoutboats about it via email to ask about becoming a maintainer (this is what I did years ago).

For extra context, this crate used to rely on the unicode-segmentation crate but that's actually somewhat of a heavy dependency and the main users of this crate seem to be proc-macros that only ever run it on Rust type or field names (i.e. ascii-only input). A while after I took over maintenance, that dependency was first made optional and then removed completely (with some tests being added in the process).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTS 55 conformance
2 participants