Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Japanese specialized Normalizer #131

Closed
ManyTheFish opened this issue Sep 26, 2022 · 8 comments · Fixed by #149
Closed

Implement a Japanese specialized Normalizer #131

ManyTheFish opened this issue Sep 26, 2022 · 8 comments · Fixed by #149
Labels
good first issue Good for newcomers

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Sep 26, 2022

Today, there is no specialized normalizer for the Japanese Language.

drawback

Meilisearch is unable to find the hiragana version of a word with a katakana query, for instance, ダメ, is also spelled 駄目, or だめ

Technical approach

Create a new Japanese normalizer that unifies hiragana and katakana equivalences.

Interesting libraries

  • wana_kana seems promising to convert everything in Hiragana

Files expected to be modified

Misc

related to product#532

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

@ManyTheFish ManyTheFish changed the title Implement a Japanese Normalizer Implement a Japanese specialized Normalizer Sep 27, 2022
@curquiza curquiza transferred this issue from meilisearch/engine-team Sep 29, 2022
@choznerol
Copy link
Contributor

choznerol commented Oct 8, 2022

Update: Copied to #149 Limitations

... for instance, ダメ, is also spelled 駄目, or だめ
... wana_kana seems promising to convert everything in Hiragana

After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:

  • to_hiragana("ダメ駄目だめ") will be "だめ駄目だめ"
  • to_romaji("ダメ駄目だめ") will be "dame駄目dame"

Is it okay if the normalizer only convert Katakana to Hiragana?

choznerol added a commit to choznerol/charabia that referenced this issue Oct 8, 2022
choznerol added a commit to choznerol/charabia that referenced this issue Oct 8, 2022
choznerol added a commit to choznerol/charabia that referenced this issue Oct 9, 2022
@ManyTheFish
Copy link
Member Author

ManyTheFish commented Oct 10, 2022

Hey @choznerol, the conversion from kanji to Hiragana is not straightforward, that's why all the libraries I found don't support it.
However, converting katakana to hiragana is a great enhancement!

@mosuka
Copy link
Contributor

mosuka commented Oct 11, 2022

Hi @ManyTheFish @choznerol ,

I think that in Japanese, excessive normalization of katakana and hiragana can create a lot of noise in the opposite direction.
If you wish to treat these hiragana and katakana tokens identically, it is common practice to register the required synonyms in that business domain.

https://docs.meilisearch.com/learn/configuration/synonyms.html

In the normalization of Japanese characters, I think it is more important that I wrote in this comment.
#139 (comment)

What do you think?

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Oct 11, 2022

thank you @mosuka for your feedback, so let's be cautious.
I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔
On my side, I'll investigate more about the pro and cons of doing this transliteration.
But, keep in mind that we are in an IR context and not in a translation context, sometimes, it is better to lose precision in favor of a higher recall.

About your other issue, I started a redesign to implement a pre-normalization, however, it's not an easy task mainly if you want a good highlighting on Meilisearch. 😅

@mosuka
Copy link
Contributor

mosuka commented Oct 11, 2022

@ManyTheFish
Thank you for your reply.
The above comment is just my personal opinion. And I agree with you.
It would be helpful if users could make a choice. 😄
Thanks!

@choznerol
Copy link
Contributor

Thank @mosuka @ManyTheFish for the discussion. I am afraid I don't have enough Japanese/tokenization knowledge to have input 😅.

I think we will disable this feature on Meilisearch by default and make a prototype enabling it to gather some feedback. 🤔

Not sure exactly how this will be implemented, but please let me know if I should also address the "disable by default" part in #149. And of course, if after re-consideration we think #149 is actually too risky, please don't hesitate to close it or leave it pending.

@mosuka
Copy link
Contributor

mosuka commented Oct 14, 2022

@choznerol
Thank you for your feedback. 😄
It's alright. No worries.
That is my personal opinion and I am sure there are many who would welcome your PR.

Thanks! 😄

@ManyTheFish
Copy link
Member Author

@mosuka @choznerol,
we will merge the PR, but, we have to add a feature flag allowing us to activate or deactivate this normalizer at compile time, instead of depending on the #[cfg(feature = "japanese")] for this normalizer we should make it depends on a new flag #[cfg(feature = "japanese-transliteration")].

I requested this last change on your PR @choznerol, everything else is good and should be merged!

Thanks to both of you!

choznerol added a commit to choznerol/charabia that referenced this issue Oct 17, 2022
bors bot added a commit that referenced this issue Oct 17, 2022
149: Add Japanese normalizer to cover Katakana to Hiragana r=ManyTheFish a=choznerol

# Pull Request

## Related issue
Fixes #131

## What does this PR do?
- Add a new Normalizer for Japanese, which converts Katakana to Hiragana

<!--
## TODOs before ready for review
- [x] Check the failing test `normalizer::control_char::test::global_normalize`
-->

<div id=limitations ></div>

## [#](#limitations) Limitations

### Converting from Kanji is not supported

From #131:
> ... for instance, `ダメ`, is also spelled `駄目`, or `だめ`
> ... [wana_kana](https://crates.io/crates/wana_kana) seems promising to convert everything in Hiragana

After some experiments and checking [convert options](https://docs.rs/wana_kana/2.1.0/wana_kana/struct.Options.html), it seems like [wana_kana](https://crates.io/crates/wana_kana) does not support converting Kanji to Hiragana or Romaji. For example:
- `to_hiragana("ダメ駄目だめ")` will be `"だめ駄目だめ"`
- `to_romaji("ダメ駄目だめ")` will be `"dame駄目dame"`


## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Lawrence Chou <[email protected]>
Co-authored-by: Lawrence Chou <[email protected]>
@bors bors bot closed this as completed in 26fb497 Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants