Implement `String#unicode_normalize` and `String#unicode_normalized?` #11226

HertzDevil · 2021-09-17T16:58:32Z

This is a WIP; the code is in a usable state, but every non-ASCII string performs a full normalization on every call to either method. Ways to optimize these methods will be implemented soon.

The spec file directly downloads the test suite from the Unicode Character Database on each invocation. This is probably better than defining all the ~18k test cases in the spec file itself.

straight-shoota · 2021-09-17T17:25:09Z

The spec file directly downloads the test suite from the Unicode Character Database on each invocation. This is probably better than defining all the ~18k test cases in the spec file itself.

This is inacceptable. The spec suite must be able to run without any external components.

I think we could either consider incorporating the spec source data into the repository, or make the spec to run manually.

…ormalize

ysbaddaden · 2022-02-24T10:21:23Z

That would be very nice to have! I just came across the need to remove diacritics from a String, and normalizing to NFD would make it super easy.

I'd just name the methods without the unicode_ prefix (hence #normalize) but that's just a personal preference. I also wouldn't generate a String in #normalized? when the string is maybe normalized, but iterate each char to check if it's normalized or not, but that can be optimized later.

HertzDevil · 2022-06-04T22:56:41Z

I have no idea how to optimize the slow path for #unicode_normalized? yet. Anyone with a better algorithm in mind could take over.

spec/manual/string_normalize_spec.cr

src/string.cr

…ormalize

HertzDevil added 2 commits September 18, 2021 00:32

String#unicode_normalize

6bfc89c

String#unicode_normalized?

d1a58de

HertzDevil added kind:feature topic:stdlib:text labels Sep 17, 2021

HertzDevil added 3 commits September 18, 2021 01:29

Make normalize_spec.cr a manual spec

0da6694

Quick check for normalization forms

ce89caa

Merge remote-tracking branch 'upstream/master' into feature/unicode-n…

036b525

…ormalize

Merge remote-tracking branch 'upstream/master' into feature/unicode-n…

9f4bc4c

…ormalize

HertzDevil marked this pull request as ready for review June 4, 2022 22:56

straight-shoota approved these changes Jun 8, 2022

View reviewed changes

spec/manual/string_normalize_spec.cr Outdated Show resolved Hide resolved

src/string.cr Outdated Show resolved Hide resolved

HertzDevil added 2 commits August 18, 2022 20:54

Merge remote-tracking branch 'upstream/master' into feature/unicode-n…

6f9f91b

…ormalize

fixup

67c1e71

straight-shoota approved these changes Aug 18, 2022

View reviewed changes

straight-shoota added this to the 1.6.0 milestone Aug 18, 2022

straight-shoota merged commit eb97f34 into crystal-lang:master Aug 29, 2022

HertzDevil deleted the feature/unicode-normalize branch August 29, 2022 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `String#unicode_normalize` and `String#unicode_normalized?` #11226

Implement `String#unicode_normalize` and `String#unicode_normalized?` #11226

HertzDevil commented Sep 17, 2021

straight-shoota commented Sep 17, 2021

ysbaddaden commented Feb 24, 2022

HertzDevil commented Jun 4, 2022

Implement String#unicode_normalize and String#unicode_normalized? #11226

Implement String#unicode_normalize and String#unicode_normalized? #11226

Conversation

HertzDevil commented Sep 17, 2021

straight-shoota commented Sep 17, 2021

ysbaddaden commented Feb 24, 2022

HertzDevil commented Jun 4, 2022

Implement `String#unicode_normalize` and `String#unicode_normalized?` #11226

Implement `String#unicode_normalize` and `String#unicode_normalized?` #11226