Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement String#unicode_normalize and String#unicode_normalized? #11226

Merged

Conversation

HertzDevil
Copy link
Contributor

Closes #11223.

This is a WIP; the code is in a usable state, but every non-ASCII string performs a full normalization on every call to either method. Ways to optimize these methods will be implemented soon.

The spec file directly downloads the test suite from the Unicode Character Database on each invocation. This is probably better than defining all the ~18k test cases in the spec file itself.

@straight-shoota
Copy link
Member

The spec file directly downloads the test suite from the Unicode Character Database on each invocation. This is probably better than defining all the ~18k test cases in the spec file itself.

This is inacceptable. The spec suite must be able to run without any external components.

I think we could either consider incorporating the spec source data into the repository, or make the spec to run manually.

@ysbaddaden
Copy link
Contributor

That would be very nice to have! I just came across the need to remove diacritics from a String, and normalizing to NFD would make it super easy.

I'd just name the methods without the unicode_ prefix (hence #normalize) but that's just a personal preference. I also wouldn't generate a String in #normalized? when the string is maybe normalized, but iterate each char to check if it's normalized or not, but that can be optimized later.

@HertzDevil
Copy link
Contributor Author

I have no idea how to optimize the slow path for #unicode_normalized? yet. Anyone with a better algorithm in mind could take over.

@HertzDevil HertzDevil marked this pull request as ready for review June 4, 2022 22:56
spec/manual/string_normalize_spec.cr Outdated Show resolved Hide resolved
src/string.cr Outdated Show resolved Hide resolved
@straight-shoota straight-shoota added this to the 1.6.0 milestone Aug 18, 2022
@straight-shoota straight-shoota merged commit eb97f34 into crystal-lang:master Aug 29, 2022
@HertzDevil HertzDevil deleted the feature/unicode-normalize branch August 29, 2022 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Unicode normalization forms
3 participants