-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a Compatibility Decomposition Normalizer #139
Comments
@ManyTheFish This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese. For example, this is a case where there is no problem even after tokenization:
Half-width But the following cases can be problematic.
Since full-width numbers are registered in the morphological dictionary (IPADIC), each number becomes a single token, so that a full-width For this reason, it is common for search engines that handle Japanese to perform character normalization before tokenization. Is there a way for Meilisearch to perform character normalization before tokenization? |
Hello @mosuka, Your comment will be really useful to continue the enhancement of Languages supports, so, could you please copy-paste it into the dedicated Japanese discussion in order to keep that in mind for future improvements? 😄 Thank you again! 👍 |
@ManyTheFish |
I tried to follow the same new standard that I saw was changed recently. It is still a WIP, also introduced some breaks on our tests. I'm just posting because maybe you can help me with a doubt about LatinNormalizer. Fixes meilisearch#139.Update nfkd() composition to use CharNormalizer I tried to follow the same new standard that I saw was changed recently. It is still a WIP, also introduced some breaks on our tests. I'm just posting because maybe you can help me with a doubt about LatinNormalizer. Fixes meilisearch#139.
Hey folks, I started trying to implement this issue still during the Hacktober fest. But I couldn't focus so much on it. Then I didn't ever commented here to avoid to "block" an issue that someone could eventually implement faster.... But I kept trying to implement it anyway, just to learn more about rust and charabia engine. Well, talking now about the implementation.... I started working on it when we still had to implement The first one is to understand which Other and maybe more complex, is that the Thanks for your help :-) |
Hello @charlesschaefer, |
Meilisearch is unable to find Canonical and Compatibility equivalences, for instance,
ガギグゲゴ
can't be found with a queryガギグゲゴ
.Technical approach
Implement a new Normalizer
CompatibilityDecompositionNormalizer
using the methodnfkd
of the unicode-normalization crate.Files expected to be modified
Misc
related to product#532
The text was updated successfully, but these errors were encountered: