Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Compatibility Decomposition Normalizer #139

Closed
ManyTheFish opened this issue Sep 29, 2022 · 5 comments · Fixed by #166
Closed

Implement a Compatibility Decomposition Normalizer #139

ManyTheFish opened this issue Sep 29, 2022 · 5 comments · Fixed by #166
Assignees
Labels
good first issue Good for newcomers

Comments

@ManyTheFish
Copy link
Member

ManyTheFish commented Sep 29, 2022

Meilisearch is unable to find Canonical and Compatibility equivalences, for instance, ガギグゲゴ can't be found with a query ガギグゲゴ.

Technical approach

Implement a new Normalizer CompatibilityDecompositionNormalizer using the method nfkd of the unicode-normalization crate.

Files expected to be modified

Misc

related to product#532

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

@mosuka
Copy link
Contributor

mosuka commented Oct 6, 2022

@ManyTheFish
Thank you for the creating a good issue!

This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese.

For example, this is a case where there is no problem even after tokenization:

$ echo "私はメガネを買いました。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
メガネ	名詞,一般,*,*,*,*,メガネ,メガネ,メガネ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
買い	動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイ
まし	助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。	記号,句点,*,*,*,*,。,。,。
EOS
$ echo "私はメガネを買いました。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
メガネ	UNK
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
買い	動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイ
まし	助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。	記号,句点,*,*,*,*,。,。,。
EOS

Half-width メガネ is not a problem because it is tokenized as a single unknown word, although it is not exist in the Japanese morphological dictionary (IPADIC). Of course, if normalization has been done in advance, morphological analysis can be used to accurately retrieve the part-of-speech information of words from the dictionary.

But the following cases can be problematic.

$ echo "私は時給1000円です。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
時給	名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー
1000	UNK
円	名詞,接尾,助数詞,*,*,*,円,エン,エン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS
$ echo "私は時給1000円です。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
時給	名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー
1	名詞,数,*,*,*,*,1,イチ,イチ
0	名詞,数,*,*,*,*,0,ゼロ,ゼロ
0	名詞,数,*,*,*,*,0,ゼロ,ゼロ
0	名詞,数,*,*,*,*,0,ゼロ,ゼロ
円	名詞,接尾,助数詞,*,*,*,円,エン,エン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

Since full-width numbers are registered in the morphological dictionary (IPADIC), each number becomes a single token, so that a full-width 1000 cannot be searched for 1000.

For this reason, it is common for search engines that handle Japanese to perform character normalization before tokenization. Is there a way for Meilisearch to perform character normalization before tokenization?

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Oct 6, 2022

Hello @mosuka,
Thanks a lot for those precisions!
Unfortunately, it would be a lot of work to invert the process.
For now, we will keep it like this and stick with this issue.
However, I have a similar issue with the Malayalam Language and I'll have to find a way to have both pre and post normalizers. 🤔

Your comment will be really useful to continue the enhancement of Languages supports, so, could you please copy-paste it into the dedicated Japanese discussion in order to keep that in mind for future improvements? 😄

Thank you again! 👍

@mosuka
Copy link
Contributor

mosuka commented Oct 6, 2022

@ManyTheFish
Thanks! I have commented it into meilisearch/product#532 . 😃

charlesschaefer added a commit to charlesschaefer/charabia that referenced this issue Nov 13, 2022
charlesschaefer added a commit to charlesschaefer/charabia that referenced this issue Nov 13, 2022
I tried to follow the same new standard that I saw was changed recently.
It is still a WIP, also introduced some breaks on our tests. I'm just posting because maybe you can help me with a doubt about LatinNormalizer.
Fixes meilisearch#139.Update nfkd() composition to use CharNormalizer

I tried to follow the same new standard that I saw was changed recently.
It is still a WIP, also introduced some breaks on our tests. I'm just posting because maybe you can help me with a doubt about LatinNormalizer.
Fixes meilisearch#139.
@charlesschaefer
Copy link

Hey folks, I started trying to implement this issue still during the Hacktober fest. But I couldn't focus so much on it. Then I didn't ever commented here to avoid to "block" an issue that someone could eventually implement faster....

But I kept trying to implement it anyway, just to learn more about rust and charabia engine.

Well, talking now about the implementation.... I started working on it when we still had to implement normalize(). So, I had to change some pieces after the consolidation the team made weeks ago, to use the CharNormalizer trait. I think it is ok now, but I'm experiencing some issues when we have the global normalization pipeline (and that's the reason I haven't opened a PR yet - please let me know if you prefer that I open it).

The first one is to understand which Scripts should be normalized with nfkd(). I tried to follow some other implementations, but I'm not 100% it is ok - so if someone can give me some guidance it would be great :-)

Other and maybe more complex, is that the LatinNormalizer uses deunicode, which will try to remove the accent marks and remove the characters that it doesn't recognize. So... should we avoid to use nfkd() side-by-side with LatinNormalizer?

Thanks for your help :-)

@ManyTheFish
Copy link
Member Author

ManyTheFish commented Nov 14, 2022

Hello @charlesschaefer,
first, don't hesitate to create a PR on this repo, this would be easier for me to guide you in the implementation.
Then, UnicodeNormalization trait is implemented for any type that implements IntoIter<char>, this way Some('a').nfkd() should do the job giving you an iterator over chars, then, as the Latin normalizer, a proper match would convert efficiently your Iterator into the expected type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants