Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

Merged
merged 4 commits into from
Nov 28, 2022

Conversation

dureuill
Copy link
Contributor

@dureuill dureuill commented Nov 28, 2022

Pull Request

Related issue

Fixes #139

What does this PR do?

  • Add a new CompatibilityDecompositionNormalizer as outlined in Implement a Compatibility Decomposition Normalizer #139
  • Fix the tests that fail as a result: The NonSpacingMark normalizer is removing diacritic in normal form for Hebrew, Thai, and Arabic, some tests that weren't in normal form were updated
  • Remove the Latin Normalizer, whose functionality is subsumed by CompatibilityDecompositionNormalizer and NonSpacingMark (modified NonSpacingMark to also act on Latin script).
    • As a consequence, the ° symbol is no longer normalized to deg

Benchmarks

The addition of the new normalizer regresses quite a bit the performance of some of the normalizer benchmarks (and improves very slightly some of the segmenter benchmarks). Find below the list of benchmarks that changed in a statistically significant way (benchmarks absent from the list are within noise)

``` segment/132/Cj/Cmn time: [5.8952 µs 5.9062 µs 5.9162 µs] change: [+1.1073% +1.4403% +1.7747%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe segment/132/Hangul/Kor time: [30.574 µs 30.595 µs 30.629 µs] change: [-1.8877% -1.5395% -1.1909%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 5 (5.00%) high mild 10 (10.00%) high severe segment/363/Cj/Cmn time: [18.771 µs 18.782 µs 18.794 µs] change: [-2.0023% -1.6410% -1.2950%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe segment/364/Cj/Jpn time: [37.490 µs 37.523 µs 37.564 µs] change: [-2.0402% -1.6633% -1.2663%] (p = 0.00 < 0.05) Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe segment/363/Latin/Eng time: [12.780 µs 12.784 µs 12.788 µs] change: [-1.9263% -1.6888% -1.4364%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe segment/364/Hangul/Kor time: [72.141 µs 72.176 µs 72.217 µs] change: [-1.9258% -1.5513% -1.1986%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 9 (9.00%) high mild 2 (2.00%) high severe

tokenize/132/Cj/Cmn time: [12.153 µs 12.188 µs 12.222 µs]
change: [+13.230% +13.555% +13.910%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
7 (7.00%) high mild
2 (2.00%) high severe
tokenize/132/Cj/Jpn time: [16.485 µs 16.511 µs 16.543 µs]
change: [+9.1975% +9.5396% +9.9001%] (p = 0.00 < 0.05)
Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) high mild
13 (13.00%) high severe
tokenize/132/Latin/Eng time: [9.8592 µs 9.8760 µs 9.8930 µs]
change: [+9.6027% +9.9246% +10.267%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
tokenize/132/Latin/Fra time: [10.462 µs 10.467 µs 10.473 µs]
change: [+19.536% +19.826% +20.118%] (p = 0.00 < 0.05)
Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
7 (7.00%) high mild
6 (6.00%) high severe
tokenize/132/Hebrew/Heb time: [7.3985 µs 7.4261 µs 7.4545 µs]
change: [+27.823% +28.265% +28.720%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
10 (10.00%) high mild
8 (8.00%) high severe
tokenize/132/Thai/Tha time: [6.3937 µs 6.4069 µs 6.4217 µs]
change: [+21.622% +21.950% +22.273%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
tokenize/132/Hangul/Kor time: [38.714 µs 38.738 µs 38.766 µs]
change: [+15.225% +16.021% +16.531%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
tokenize/363/Cj/Cmn time: [35.227 µs 35.354 µs 35.502 µs]
change: [+11.361% +11.692% +12.050%] (p = 0.00 < 0.05)
Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
8 (8.00%) high mild
7 (7.00%) high severe
tokenize/364/Cj/Jpn time: [46.831 µs 46.884 µs 46.947 µs]
change: [+8.8447% +9.2262% +9.5988%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) high mild
7 (7.00%) high severe
tokenize/363/Latin/Eng time: [24.713 µs 24.736 µs 24.762 µs]
change: [+12.024% +12.303% +12.581%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
5 (5.00%) high mild
2 (2.00%) high severe
tokenize/363/Latin/Fra time: [27.521 µs 27.572 µs 27.633 µs]
change: [+22.345% +22.849% +23.292%] (p = 0.00 < 0.05)
Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
tokenize/365/Hebrew/Heb time: [19.976 µs 19.986 µs 19.997 µs]
change: [+27.461% +27.901% +28.305%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
5 (5.00%) high mild
6 (6.00%) high severe
tokenize/366/Thai/Tha time: [15.799 µs 15.809 µs 15.821 µs]
change: [+23.711% +24.888% +25.793%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
1 (1.00%) low mild
3 (3.00%) high mild
5 (5.00%) high severe
tokenize/364/Hangul/Kor time: [94.655 µs 94.753 µs 94.924 µs]
change: [+18.140% +18.457% +18.810%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
3 (3.00%) high mild
7 (7.00%) high severe

</details>

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Needed for use in the CompatibilityDecompositionNormalizer
Only enabled when the token is not already in normal form
Failing tests were due to accents removed by the nonspacing mark normalizer
@dureuill dureuill force-pushed the compatibility_decomposition_normalizer branch from 07a7e32 to 9ef573e Compare November 28, 2022 09:11
@dureuill dureuill requested a review from ManyTheFish November 28, 2022 09:14
@dureuill dureuill changed the title Add a Compatibility Decomposition Normalizer Add a Compatibility Decomposition Normalizer, remove Latin normalizer Nov 28, 2022
@dureuill
Copy link
Contributor Author

  • Remove the Latin Normalizer, whose functionality is subsumed by CompatibilityDecompositionNormalizer and NonSpacingMark (modified NonSpacingMark to also act on Latin script).
    • As a consequence, the ° symbol is no longer normalized to deg

@dureuill
Copy link
Contributor Author

dureuill commented Nov 28, 2022

Updated the benchmarks after removing the Latin normalizer (see PR description for the up-to-date results). Not much change, but slight improvements in the segmenter benchmarks (not sure if it is related)

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dureuill!

Bors merge

@bors
Copy link
Contributor

bors bot commented Nov 28, 2022

Build succeeded:

@bors bors bot merged commit 2d3d0f4 into main Nov 28, 2022
@bors bors bot deleted the compatibility_decomposition_normalizer branch November 28, 2022 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a Compatibility Decomposition Normalizer
2 participants