Armenian letters should be lowercased #328

NarHakobyan · 2025-02-07T15:15:44Z

Fixes #325

ManyTheFish · 2025-02-10T12:38:32Z

your PR fits the need, however the PR is not passing the CI, I know it's not completely related to your work but could you fix the errors?

clippy:

error: this `map_or` is redundant
   --> charabia/src/token.rs:116:9
    |
116 |         self.separator_kind().map_or(false, |_| true)
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use is_some_and instead: `self.separator_kind().is_some_and(|_| true)`
    |

Rust FMT:

Diff in /home/runner/work/charabia/charabia/charabia/src/normalizer/lowercase.rs:27:

    fn should_normalize(&self, token: &Token) -> bool {
        // https://en.wikipedia.org/wiki/Letter_case#Capitalisation
-        matches!(token.script, Script::Latin | Script::Cyrillic | Script::Greek | Script::Georgian | Script::Armenian)
-            && token.lemma.chars().any(char::is_uppercase)
+        matches!(
+            token.script,
+            Script::Latin | Script::Cyrillic | Script::Greek | Script::Georgian | Script::Armenian
+        ) && token.lemma.chars().any(char::is_uppercase)
    }
}

thank you!

NarHakobyan · 2025-02-10T15:04:59Z

Hi @ManyTheFish, Done! I Do not know why but RustRover didn't show any error on these lines.

ManyTheFish · 2025-02-11T09:20:32Z

Hey @NarHakobyan,
This looks good to me, but, don't you want to add a token with Armenian letters in the below test to ensure everything works as expected?

charabia/charabia/src/normalizer/lowercase.rs

Lines 45 to 98 in d929c01

    
           fn tokens() -> Vec<Token<'static>> { 
        
               vec![Token { 
        
                   lemma: Owned("PascalCase".to_string()), 
        
                   char_end: 10, 
        
                   byte_end: 10, 
        
                   script: Script::Latin, 
        
                   ..Default::default() 
        
               }] 
        
           } 
        
           fn normalizer_result() -> Vec<Token<'static>> { 
        
               vec![Token { 
        
                   lemma: Owned("pascalcase".to_string()), 
        
                   char_end: 10, 
        
                   byte_end: 10, 
        
                   script: Script::Latin, 
        
                   char_map: Some(vec![ 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                   ]), 
        
                   ..Default::default() 
        
               }] 
        
           } 
        
           fn normalized_tokens() -> Vec<Token<'static>> { 
        
               vec![Token { 
        
                   lemma: Owned("pascalcase".to_string()), 
        
                   char_end: 10, 
        
                   byte_end: 10, 
        
                   script: Script::Latin, 
        
                   kind: TokenKind::Word, 
        
                   char_map: Some(vec![ 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                       (1, 1), 
        
                   ]), 
        
                   ..Default::default() 
        
               }] 
        
           }

You just have to add a source token in the tokens() list, then fill the normalizer_result() and the normalized_tokens() with the expected output.

NarHakobyan · 2025-02-11T09:22:41Z

@ManyTheFish to be honest, I don't know how to do that :D

here is an example text to which can be used: Չին ֆիզիկոսը օճառաջուր ցողելով բժշկում է հայ գնդապետի փքված ձախ թևը։

ManyTheFish · 2025-02-12T09:24:04Z

@NarHakobyan

Add a token containing Armenian capital letters in the token() function:

fn tokens() -> Vec<Token<'static>> { 
     vec![Token { 
         lemma: Owned("PascalCase".to_string()), 
         char_end: 10, 
         byte_end: 10, 
         script: Script::Latin, 
         ..Default::default() 
-     }] 
+     },
+     Token { 
+         lemma: Owned("ֆիզիկոսը".to_string()), 
+         char_end: 8, 
+         byte_end: 16, 
+         script: Script::Armenian, 
+         ..Default::default() 
+     }]
 }

Then run the tests:
cargo test lowercase

And fix the outputs in the normalized_tokens() and normalizer_result() functions 😄

NarHakobyan · 2025-02-12T15:54:26Z

@ManyTheFish could you please run a tests?

Armenian letters should be lowercased

88aadbd

Fixes meilisearch#325

NarHakobyan added 2 commits February 10, 2025 18:53

add .DS_Store gitignore

99f1841

fix linter errors

d929c01

add test for Armenian

98c1db8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Armenian letters should be lowercased #328

Armenian letters should be lowercased #328

NarHakobyan commented Feb 7, 2025

ManyTheFish commented Feb 10, 2025

NarHakobyan commented Feb 10, 2025 •

edited

Loading

ManyTheFish commented Feb 11, 2025

NarHakobyan commented Feb 11, 2025 •

edited

Loading

ManyTheFish commented Feb 12, 2025

NarHakobyan commented Feb 12, 2025

Armenian letters should be lowercased #328

Are you sure you want to change the base?

Armenian letters should be lowercased #328

Conversation

NarHakobyan commented Feb 7, 2025

ManyTheFish commented Feb 10, 2025

NarHakobyan commented Feb 10, 2025 • edited Loading

ManyTheFish commented Feb 11, 2025

NarHakobyan commented Feb 11, 2025 • edited Loading

ManyTheFish commented Feb 12, 2025

NarHakobyan commented Feb 12, 2025

NarHakobyan commented Feb 10, 2025 •

edited

Loading

NarHakobyan commented Feb 11, 2025 •

edited

Loading