Implement a Japanese specialized Normalizer #131

ManyTheFish · 2022-09-26T16:11:34Z

Today, there is no specialized normalizer for the Japanese Language.

drawback

Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ダメ, is also spelled 駄目, or だめ

Technical approach

Create a new Japanese normalizer that unifies hiragana and katakana equivalences.

Interesting libraries

wana_kana seems promising to convert everything in Hiragana

Files expected to be modified

create /src/normalizer/japanese.rs
/src/normalizer/mod.rs

Misc

related to product#532

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

The text was updated successfully, but these errors were encountered:

choznerol · 2022-10-08T08:27:42Z

Update: Copied to #149 Limitations

... for instance, ダメ, is also spelled 駄目, or だめ
... wana_kana seems promising to convert everything in Hiragana

After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:

to_hiragana("ダメ駄目だめ") will be "だめ駄目だめ"

to_romaji("ダメ駄目だめ") will be "dame駄目dame"

Is it okay if the normalizer only convert Katakana to Hiragana?

Close meilisearch#131

ManyTheFish · 2022-10-10T14:37:32Z

Hey @choznerol, the conversion from kanji to Hiragana is not straightforward, that's why all the libraries I found don't support it.
However, converting katakana to hiragana is a great enhancement!

mosuka · 2022-10-11T07:18:16Z

Hi @ManyTheFish @choznerol ,

I think that in Japanese, excessive normalization of katakana and hiragana can create a lot of noise in the opposite direction.
If you wish to treat these hiragana and katakana tokens identically, it is common practice to register the required synonyms in that business domain.

https://docs.meilisearch.com/learn/configuration/synonyms.html

In the normalization of Japanese characters, I think it is more important that I wrote in this comment.
#139 (comment)

What do you think?

ManyTheFish · 2022-10-11T11:44:20Z

thank you @mosuka for your feedback, so let's be cautious.
I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔
On my side, I'll investigate more about the pro and cons of doing this transliteration.
But, keep in mind that we are in an IR context and not in a translation context, sometimes, it is better to lose precision in favor of a higher recall.

About your other issue, I started a redesign to implement a pre-normalization, however, it's not an easy task mainly if you want a good highlighting on Meilisearch. 😅

mosuka · 2022-10-11T12:28:12Z

@ManyTheFish
Thank you for your reply.
The above comment is just my personal opinion. And I agree with you.
It would be helpful if users could make a choice. 😄
Thanks!

choznerol · 2022-10-14T07:38:56Z

Thank @mosuka @ManyTheFish for the discussion. I am afraid I don't have enough Japanese/tokenization knowledge to have input 😅.

I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔

Not sure exactly how this will be implemented, but please let me know if I should also address the "disable by default" part in #149. And of course, if after re-consideration we think #149 is actually too risky, please don't hesitate to close it or leave it pending.

mosuka · 2022-10-14T08:54:07Z

@choznerol
Thank you for your feedback. 😄
It's alright. No worries.
That is my personal opinion and I am sure there are many who would welcome your PR.

Thanks! 😄

ManyTheFish · 2022-10-17T13:55:21Z

@mosuka @choznerol,
we will merge the PR, but, we have to add a feature flag allowing us to activate or deactivate this normalizer at compile time, instead of depending on the #[cfg(feature = "japanese")] for this normalizer we should make it depends on a new flag #[cfg(feature = "japanese-transliteration")].

I requested this last change on your PR @choznerol, everything else is good and should be merged!

Thanks to both of you!

meilisearch#149 (review) meilisearch#131 (comment) Co-authored-by: ManyTheFish <[email protected]>

149: Add Japanese normalizer to cover Katakana to Hiragana r=ManyTheFish a=choznerol # Pull Request ## Related issue Fixes #131 ## What does this PR do? - Add a new Normalizer for Japanese, which converts Katakana to Hiragana  <div id=limitations ></div> ## [#](#limitations) Limitations ### Converting from Kanji is not supported From #131: > ... for instance, `ダメ`, is also spelled `駄目`, or `だめ` > ... [wana_kana](https://crates.io/crates/wana_kana) seems promising to convert everything in Hiragana After some experiments and checking [convert options](https://docs.rs/wana_kana/2.1.0/wana_kana/struct.Options.html), it seems like [wana_kana](https://crates.io/crates/wana_kana) does not support converting Kanji to Hiragana or Romaji. For example: - `to_hiragana("ダメ駄目だめ")` will be `"だめ駄目だめ"` - `to_romaji("ダメ駄目だめ")` will be `"dame駄目dame"` ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Lawrence Chou <[email protected]> Co-authored-by: Lawrence Chou <[email protected]>

ManyTheFish added good first issue Good for newcomers hacktoberfest labels Sep 26, 2022

ManyTheFish changed the title ~~Implement a Japanese Normalizer~~ Implement a Japanese specialized Normalizer Sep 27, 2022

curquiza transferred this issue from meilisearch/engine-team Sep 29, 2022

choznerol added a commit to choznerol/charabia that referenced this issue Oct 8, 2022

Add Japanese normalizer to cover Katakana to Hiragana

3b23493

Close meilisearch#131

choznerol added a commit to choznerol/charabia that referenced this issue Oct 8, 2022

Add Japanese normalizer to convert Katakana to Hiragana

61fabac

Close meilisearch#131

choznerol mentioned this issue Oct 8, 2022

Add Japanese normalizer to cover Katakana to Hiragana #149

Merged

3 tasks

choznerol added a commit to choznerol/charabia that referenced this issue Oct 9, 2022

Add Japanese normalizer to convert Katakana to Hiragana

9eafd31

Close meilisearch#131

choznerol added a commit to choznerol/charabia that referenced this issue Oct 17, 2022

Disable japanese-transliteration by default

dfcaa62

meilisearch#149 (review) meilisearch#131 (comment) Co-authored-by: ManyTheFish <[email protected]>

bors bot closed this as completed in 26fb497 Oct 17, 2022

ManyTheFish mentioned this issue Nov 27, 2024

Update Charabia on Meilisearch v1.12.0 meilisearch/meilisearch#5097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a Japanese specialized Normalizer #131

Implement a Japanese specialized Normalizer #131

ManyTheFish commented Sep 26, 2022 •

edited

Loading

choznerol commented Oct 8, 2022 •

edited

Loading

ManyTheFish commented Oct 10, 2022 •

edited

Loading

mosuka commented Oct 11, 2022 •

edited

Loading

ManyTheFish commented Oct 11, 2022 •

edited

Loading

mosuka commented Oct 11, 2022 •

edited

Loading

choznerol commented Oct 14, 2022

mosuka commented Oct 14, 2022

ManyTheFish commented Oct 17, 2022

Implement a Japanese specialized Normalizer #131

Implement a Japanese specialized Normalizer #131

Comments

ManyTheFish commented Sep 26, 2022 • edited Loading

drawback

Technical approach

Interesting libraries

Files expected to be modified

Misc

choznerol commented Oct 8, 2022 • edited Loading

ManyTheFish commented Oct 10, 2022 • edited Loading

mosuka commented Oct 11, 2022 • edited Loading

ManyTheFish commented Oct 11, 2022 • edited Loading

mosuka commented Oct 11, 2022 • edited Loading

choznerol commented Oct 14, 2022

mosuka commented Oct 14, 2022

ManyTheFish commented Oct 17, 2022

ManyTheFish commented Sep 26, 2022 •

edited

Loading

choznerol commented Oct 8, 2022 •

edited

Loading

ManyTheFish commented Oct 10, 2022 •

edited

Loading

mosuka commented Oct 11, 2022 •

edited

Loading

ManyTheFish commented Oct 11, 2022 •

edited

Loading

mosuka commented Oct 11, 2022 •

edited

Loading