Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
Related issue
Fixes #139
What does this PR do?
CompatibilityDecompositionNormalizer
as outlined in Implement a Compatibility Decomposition Normalizer #139NonSpacingMark
normalizer is removing diacritic in normal form for Hebrew, Thai, and Arabic, some tests that weren't in normal form were updatedCompatibilityDecompositionNormalizer
andNonSpacingMark
(modifiedNonSpacingMark
to also act onLatin
script).°
symbol is no longer normalized todeg
Benchmarks
The addition of the new normalizer regresses quite a bit the performance of some of the normalizer benchmarks (and improves very slightly some of the segmenter benchmarks). Find below the list of benchmarks that changed in a statistically significant way (benchmarks absent from the list are within noise)
tokenize/132/Cj/Cmn time: [12.153 µs 12.188 µs 12.222 µs]
change: [+13.230% +13.555% +13.910%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
7 (7.00%) high mild
2 (2.00%) high severe
tokenize/132/Cj/Jpn time: [16.485 µs 16.511 µs 16.543 µs]
change: [+9.1975% +9.5396% +9.9001%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
tokenize/132/Latin/Eng time: [9.8592 µs 9.8760 µs 9.8930 µs]
change: [+9.6027% +9.9246% +10.267%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
tokenize/132/Latin/Fra time: [10.462 µs 10.467 µs 10.473 µs]
change: [+19.536% +19.826% +20.118%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
7 (7.00%) high mild
6 (6.00%) high severe
tokenize/132/Hebrew/Heb time: [7.3985 µs 7.4261 µs 7.4545 µs]
change: [+27.823% +28.265% +28.720%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
10 (10.00%) high mild
8 (8.00%) high severe
tokenize/132/Thai/Tha time: [6.3937 µs 6.4069 µs 6.4217 µs]
change: [+21.622% +21.950% +22.273%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
tokenize/132/Hangul/Kor time: [38.714 µs 38.738 µs 38.766 µs]
change: [+15.225% +16.021% +16.531%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
tokenize/363/Cj/Cmn time: [35.227 µs 35.354 µs 35.502 µs]
change: [+11.361% +11.692% +12.050%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
8 (8.00%) high mild
7 (7.00%) high severe
tokenize/364/Cj/Jpn time: [46.831 µs 46.884 µs 46.947 µs]
change: [+8.8447% +9.2262% +9.5988%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
tokenize/363/Latin/Eng time: [24.713 µs 24.736 µs 24.762 µs]
change: [+12.024% +12.303% +12.581%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
tokenize/363/Latin/Fra time: [27.521 µs 27.572 µs 27.633 µs]
change: [+22.345% +22.849% +23.292%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
tokenize/365/Hebrew/Heb time: [19.976 µs 19.986 µs 19.997 µs]
change: [+27.461% +27.901% +28.305%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
tokenize/366/Thai/Tha time: [15.799 µs 15.809 µs 15.821 µs]
change: [+23.711% +24.888% +25.793%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
tokenize/364/Hangul/Kor time: [94.655 µs 94.753 µs 94.924 µs]
change: [+18.140% +18.457% +18.810%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe