-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a Japanese specialized Normalizer #131
Comments
Update: Copied to #149 Limitations
After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:
Is it okay if the normalizer only convert Katakana to Hiragana? |
Hey @choznerol, the conversion from kanji to Hiragana is not straightforward, that's why all the libraries I found don't support it. |
Hi @ManyTheFish @choznerol , I think that in Japanese, excessive normalization of katakana and hiragana can create a lot of noise in the opposite direction. https://docs.meilisearch.com/learn/configuration/synonyms.html In the normalization of Japanese characters, I think it is more important that I wrote in this comment. What do you think? |
thank you @mosuka for your feedback, so let's be cautious. About your other issue, I started a redesign to implement a pre-normalization, however, it's not an easy task mainly if you want a good highlighting on Meilisearch. 😅 |
@ManyTheFish |
Thank @mosuka @ManyTheFish for the discussion. I am afraid I don't have enough Japanese/tokenization knowledge to have input 😅.
Not sure exactly how this will be implemented, but please let me know if I should also address the "disable by default" part in #149. And of course, if after re-consideration we think #149 is actually too risky, please don't hesitate to close it or leave it pending. |
@choznerol Thanks! 😄 |
@mosuka @choznerol, I requested this last change on your PR @choznerol, everything else is good and should be merged! Thanks to both of you! |
meilisearch#149 (review) meilisearch#131 (comment) Co-authored-by: ManyTheFish <[email protected]>
149: Add Japanese normalizer to cover Katakana to Hiragana r=ManyTheFish a=choznerol # Pull Request ## Related issue Fixes #131 ## What does this PR do? - Add a new Normalizer for Japanese, which converts Katakana to Hiragana <!-- ## TODOs before ready for review - [x] Check the failing test `normalizer::control_char::test::global_normalize` --> <div id=limitations ></div> ## [#](#limitations) Limitations ### Converting from Kanji is not supported From #131: > ... for instance, `ダメ`, is also spelled `駄目`, or `だめ` > ... [wana_kana](https://crates.io/crates/wana_kana) seems promising to convert everything in Hiragana After some experiments and checking [convert options](https://docs.rs/wana_kana/2.1.0/wana_kana/struct.Options.html), it seems like [wana_kana](https://crates.io/crates/wana_kana) does not support converting Kanji to Hiragana or Romaji. For example: - `to_hiragana("ダメ駄目だめ")` will be `"だめ駄目だめ"` - `to_romaji("ダメ駄目だめ")` will be `"dame駄目dame"` ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Lawrence Chou <[email protected]> Co-authored-by: Lawrence Chou <[email protected]>
Today, there is no specialized normalizer for the Japanese Language.
drawback
Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance,
ダメ
, is also spelled駄目
, orだめ
Technical approach
Create a new Japanese normalizer that unifies hiragana and katakana equivalences.
Interesting libraries
Files expected to be modified
Misc
related to product#532
The text was updated successfully, but these errors were encountered: