Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This improves the normalization for Latin characters, mainly to address the concerns in #51. This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter. I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the `normalize` function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous. Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. In particular, `normalization` is now a `const fn`. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
- Loading branch information