Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

dureuill · 2022-11-28T08:36:55Z

Pull Request

Related issue

Fixes #139

What does this PR do?

Add a new CompatibilityDecompositionNormalizer as outlined in Implement a Compatibility Decomposition Normalizer #139
Fix the tests that fail as a result: The NonSpacingMark normalizer is removing diacritic in normal form for Hebrew, Thai, and Arabic, some tests that weren't in normal form were updated
Remove the Latin Normalizer, whose functionality is subsumed by CompatibilityDecompositionNormalizer and NonSpacingMark (modified NonSpacingMark to also act on Latin script).
- As a consequence, the ° symbol is no longer normalized to deg

Benchmarks

The addition of the new normalizer regresses quite a bit the performance of some of the normalizer benchmarks (and improves very slightly some of the segmenter benchmarks). Find below the list of benchmarks that changed in a statistically significant way (benchmarks absent from the list are within noise)

``` segment/132/Cj/Cmn time: [5.8952 µs 5.9062 µs 5.9162 µs] change: [+1.1073% +1.4403% +1.7747%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe segment/132/Hangul/Kor time: [30.574 µs 30.595 µs 30.629 µs] change: [-1.8877% -1.5395% -1.1909%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 5 (5.00%) high mild 10 (10.00%) high severe segment/363/Cj/Cmn time: [18.771 µs 18.782 µs 18.794 µs] change: [-2.0023% -1.6410% -1.2950%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe segment/364/Cj/Jpn time: [37.490 µs 37.523 µs 37.564 µs] change: [-2.0402% -1.6633% -1.2663%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe segment/363/Latin/Eng time: [12.780 µs 12.784 µs 12.788 µs] change: [-1.9263% -1.6888% -1.4364%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe segment/364/Hangul/Kor time: [72.141 µs 72.176 µs 72.217 µs] change: [-1.9258% -1.5513% -1.1986%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 9 (9.00%) high mild 2 (2.00%) high severe

tokenize/132/Cj/Cmn time: [12.153 µs 12.188 µs 12.222 µs]
change: [+13.230% +13.555% +13.910%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
7 (7.00%) high mild
2 (2.00%) high severe
tokenize/132/Cj/Jpn time: [16.485 µs 16.511 µs 16.543 µs]
change: [+9.1975% +9.5396% +9.9001%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
tokenize/132/Latin/Eng time: [9.8592 µs 9.8760 µs 9.8930 µs]
change: [+9.6027% +9.9246% +10.267%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
tokenize/132/Latin/Fra time: [10.462 µs 10.467 µs 10.473 µs]
change: [+19.536% +19.826% +20.118%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
7 (7.00%) high mild
6 (6.00%) high severe
tokenize/132/Hebrew/Heb time: [7.3985 µs 7.4261 µs 7.4545 µs]
change: [+27.823% +28.265% +28.720%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
10 (10.00%) high mild
8 (8.00%) high severe
tokenize/132/Thai/Tha time: [6.3937 µs 6.4069 µs 6.4217 µs]
change: [+21.622% +21.950% +22.273%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
tokenize/132/Hangul/Kor time: [38.714 µs 38.738 µs 38.766 µs]
change: [+15.225% +16.021% +16.531%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
tokenize/363/Cj/Cmn time: [35.227 µs 35.354 µs 35.502 µs]
change: [+11.361% +11.692% +12.050%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
8 (8.00%) high mild
7 (7.00%) high severe
tokenize/364/Cj/Jpn time: [46.831 µs 46.884 µs 46.947 µs]
change: [+8.8447% +9.2262% +9.5988%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
tokenize/363/Latin/Eng time: [24.713 µs 24.736 µs 24.762 µs]
change: [+12.024% +12.303% +12.581%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
tokenize/363/Latin/Fra time: [27.521 µs 27.572 µs 27.633 µs]
change: [+22.345% +22.849% +23.292%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
tokenize/365/Hebrew/Heb time: [19.976 µs 19.986 µs 19.997 µs]
change: [+27.461% +27.901% +28.305%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
tokenize/366/Thai/Tha time: [15.799 µs 15.809 µs 15.821 µs]
change: [+23.711% +24.888% +25.793%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
tokenize/364/Hangul/Kor time: [94.655 µs 94.753 µs 94.924 µs]
change: [+18.140% +18.457% +18.810%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe

</details>

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Needed for use in the CompatibilityDecompositionNormalizer

Only enabled when the token is not already in normal form

Failing tests were due to accents removed by the nonspacing mark normalizer

dureuill · 2022-11-28T10:01:59Z

Remove the Latin Normalizer, whose functionality is subsumed by CompatibilityDecompositionNormalizer and NonSpacingMark (modified NonSpacingMark to also act on Latin script).
- As a consequence, the ° symbol is no longer normalized to deg

dureuill · 2022-11-28T12:42:51Z

Updated the benchmarks after removing the Latin normalizer (see PR description for the up-to-date results). Not much change, but slight improvements in the segmenter benchmarks (not sure if it is related)

ManyTheFish

Thank you @dureuill!

Bors merge

bors · 2022-11-28T15:24:08Z

Build succeeded:

tests

dureuill added 3 commits November 28, 2022 09:29

Add unicode-normalization dependency

e484469

Needed for use in the CompatibilityDecompositionNormalizer

Add CompatibilityDecompositionNormalizer

99b6a8d

Only enabled when the token is not already in normal form

Fix failing tests

9ef573e

Failing tests were due to accents removed by the nonspacing mark normalizer

dureuill force-pushed the compatibility_decomposition_normalizer branch from 07a7e32 to 9ef573e Compare November 28, 2022 09:11

dureuill requested a review from ManyTheFish November 28, 2022 09:14

Remove LatinNormalizer

f117423

dureuill changed the title ~~Add a Compatibility Decomposition Normalizer~~ Add a Compatibility Decomposition Normalizer, remove Latin normalizer Nov 28, 2022

ManyTheFish approved these changes Nov 28, 2022

View reviewed changes

bors bot merged commit 2d3d0f4 into main Nov 28, 2022

bors bot deleted the compatibility_decomposition_normalizer branch November 28, 2022 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

dureuill commented Nov 28, 2022 •

edited

Loading

dureuill commented Nov 28, 2022

dureuill commented Nov 28, 2022 •

edited

Loading

ManyTheFish left a comment

bors bot commented Nov 28, 2022

Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

Conversation

dureuill commented Nov 28, 2022 • edited Loading

Pull Request

Related issue

What does this PR do?

Benchmarks

dureuill commented Nov 28, 2022

dureuill commented Nov 28, 2022 • edited Loading

ManyTheFish left a comment

Choose a reason for hiding this comment

bors bot commented Nov 28, 2022

dureuill commented Nov 28, 2022 •

edited

Loading

dureuill commented Nov 28, 2022 •

edited

Loading