Skip to content

Commit

Permalink
Improve latin normalization
Browse files Browse the repository at this point in the history
This improves the normalization for Latin characters, mainly to
address the concerns in #51. This adds a very large number of new
normalizations, especially in the 'Latin Extended Additional' block
which for some reason was missing every capital letter.

I did not add normalizations in any new Unicode blocks, but I did
slightly extend the 'Latin 1' block to also capture some of the
subscripts; this is for consistency with the 'Subscripts and
Superscripts' block which was previously handled. I also preserved
the actual implementation of the `normalize` function in terms of
the check order, etc. In particular, the generated code should be
approximately the same. To verify this, I ran some crude
benchmarks on a variety of input (all ASCII, sparse Unicode, heavy
Unicode, all outside normalizatio ranges) and there was no
observable difference, but definitely not super rigorous.

Finally, I inlined all of the char blocks, rather than replying on
the 'sparse table' static generation which was implemented
earlier. In particular, `normalization` is now a `const fn`. At
least in my mind it is a bit easier to read in this form. It also
makes it much clearer when characters are missed.
  • Loading branch information
alexrutar committed Nov 18, 2024
1 parent ef24853 commit 08d0732
Showing 1 changed file with 945 additions and 510 deletions.
Loading

0 comments on commit 08d0732

Please sign in to comment.