Implement a Compatibility Decomposition Normalizer #139

ManyTheFish · 2022-09-29T15:37:13Z

Meilisearch is unable to find Canonical and Compatibility equivalences, for instance, ｶﾞｷﾞｸﾞｹﾞｺﾞ can't be found with a query ガギグゲゴ.

Technical approach

Implement a new Normalizer CompatibilityDecompositionNormalizer using the method nfkd of the unicode-normalization crate.

Files expected to be modified

Misc

related to product#532

Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

The text was updated successfully, but these errors were encountered:

mosuka · 2022-10-06T03:11:08Z

@ManyTheFish
Thank you for the creating a good issue!

This character normalization seems to be performed after tokenization, but in some cases, it is better to perform character normalization before tokenization in Japanese.

For example, this is a case where there is no problem even after tokenization:

$ echo "私はメガネを買いました。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
メガネ	名詞,一般,*,*,*,*,メガネ,メガネ,メガネ
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
買い	動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイ
まし	助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。	記号,句点,*,*,*,*,。,。,。
EOS

$ echo "私はﾒｶﾞﾈを買いました。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
ﾒｶﾞﾈ	UNK
を	助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
買い	動詞,自立,*,*,五段・ワ行促音便,連用形,買う,カイ,カイ
まし	助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た	助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。	記号,句点,*,*,*,*,。,。,。
EOS

Half-width ﾒｶﾞﾈ is not a problem because it is tokenized as a single unknown word, although it is not exist in the Japanese morphological dictionary (IPADIC). Of course, if normalization has been done in advance, morphological analysis can be used to accurately retrieve the part-of-speech information of words from the dictionary.

But the following cases can be problematic.

$ echo "私は時給1000円です。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
時給	名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー
1000	UNK
円	名詞,接尾,助数詞,*,*,*,円,エン,エン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

$ echo "私は時給１０００円です。" | lindera
私	名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は	助詞,係助詞,*,*,*,*,は,ハ,ワ
時給	名詞,一般,*,*,*,*,時給,ジキュウ,ジキュー
１	名詞,数,*,*,*,*,１,イチ,イチ
０	名詞,数,*,*,*,*,０,ゼロ,ゼロ
０	名詞,数,*,*,*,*,０,ゼロ,ゼロ
０	名詞,数,*,*,*,*,０,ゼロ,ゼロ
円	名詞,接尾,助数詞,*,*,*,円,エン,エン
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
。	記号,句点,*,*,*,*,。,。,。
EOS

Since full-width numbers are registered in the morphological dictionary (IPADIC), each number becomes a single token, so that a full-width １０００ cannot be searched for 1000.

For this reason, it is common for search engines that handle Japanese to perform character normalization before tokenization. Is there a way for Meilisearch to perform character normalization before tokenization?

ManyTheFish · 2022-10-06T07:48:01Z

Hello @mosuka,
Thanks a lot for those precisions!
Unfortunately, it would be a lot of work to invert the process.
For now, we will keep it like this and stick with this issue.
However, I have a similar issue with the Malayalam Language and I'll have to find a way to have both pre and post normalizers. 🤔

Your comment will be really useful to continue the enhancement of Languages supports, so, could you please copy-paste it into the dedicated Japanese discussion in order to keep that in mind for future improvements? 😄

Thank you again! 👍

mosuka · 2022-10-06T08:52:17Z

@ManyTheFish
Thanks! I have commented it into meilisearch/product#532 . 😃

… by issue meilisearch#139

I tried to follow the same new standard that I saw was changed recently. It is still a WIP, also introduced some breaks on our tests. I'm just posting because maybe you can help me with a doubt about LatinNormalizer. Fixes meilisearch#139.Update nfkd() composition to use CharNormalizer I tried to follow the same new standard that I saw was changed recently. It is still a WIP, also introduced some breaks on our tests. I'm just posting because maybe you can help me with a doubt about LatinNormalizer. Fixes meilisearch#139.

charlesschaefer · 2022-11-13T04:20:26Z

Hey folks, I started trying to implement this issue still during the Hacktober fest. But I couldn't focus so much on it. Then I didn't ever commented here to avoid to "block" an issue that someone could eventually implement faster....

But I kept trying to implement it anyway, just to learn more about rust and charabia engine.

Well, talking now about the implementation.... I started working on it when we still had to implement normalize(). So, I had to change some pieces after the consolidation the team made weeks ago, to use the CharNormalizer trait. I think it is ok now, but I'm experiencing some issues when we have the global normalization pipeline (and that's the reason I haven't opened a PR yet - please let me know if you prefer that I open it).

The first one is to understand which Scripts should be normalized with nfkd(). I tried to follow some other implementations, but I'm not 100% it is ok - so if someone can give me some guidance it would be great :-)

Other and maybe more complex, is that the LatinNormalizer uses deunicode, which will try to remove the accent marks and remove the characters that it doesn't recognize. So... should we avoid to use nfkd() side-by-side with LatinNormalizer?

Thanks for your help :-)

ManyTheFish · 2022-11-14T09:45:19Z

Hello @charlesschaefer,
first, don't hesitate to create a PR on this repo, this would be easier for me to guide you in the implementation.
Then, UnicodeNormalization trait is implemented for any type that implements IntoIter<char>, this way Some('a').nfkd() should do the job giving you an iterator over chars, then, as the Latin normalizer, a proper match would convert efficiently your Iterator into the expected type.

ManyTheFish added good first issue Good for newcomers hacktoberfest labels Sep 29, 2022

mosuka mentioned this issue Oct 11, 2022

Implement a Japanese specialized Normalizer #131

Closed

charlesschaefer added a commit to charlesschaefer/charabia that referenced this issue Nov 13, 2022

Implement unicode compatibility decomposition normalizer, as required…

655a037

… by issue meilisearch#139

curquiza removed the hacktoberfest label Nov 15, 2022

ManyTheFish assigned dureuill Nov 24, 2022

dureuill mentioned this issue Nov 28, 2022

Add a Compatibility Decomposition Normalizer, remove Latin normalizer #166

Merged

bors bot closed this as completed in 2d3d0f4 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a Compatibility Decomposition Normalizer #139

Implement a Compatibility Decomposition Normalizer #139

ManyTheFish commented Sep 29, 2022 •

edited

Loading

mosuka commented Oct 6, 2022 •

edited

Loading

ManyTheFish commented Oct 6, 2022 •

edited

Loading

mosuka commented Oct 6, 2022

charlesschaefer commented Nov 13, 2022

ManyTheFish commented Nov 14, 2022 •

edited

Loading

Implement a Compatibility Decomposition Normalizer #139

Implement a Compatibility Decomposition Normalizer #139

Comments

ManyTheFish commented Sep 29, 2022 • edited Loading

Technical approach

Files expected to be modified

Misc

mosuka commented Oct 6, 2022 • edited Loading

ManyTheFish commented Oct 6, 2022 • edited Loading

mosuka commented Oct 6, 2022

charlesschaefer commented Nov 13, 2022

ManyTheFish commented Nov 14, 2022 • edited Loading

ManyTheFish commented Sep 29, 2022 •

edited

Loading

mosuka commented Oct 6, 2022 •

edited

Loading

ManyTheFish commented Oct 6, 2022 •

edited

Loading

ManyTheFish commented Nov 14, 2022 •

edited

Loading