Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Armenian letters should be lowercased #328

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

NarHakobyan
Copy link

Fixes #325

@ManyTheFish
Copy link
Member

hello @NarHakobyan,

your PR fits the need, however the PR is not passing the CI, I know it's not completely related to your work but could you fix the errors?

clippy:

error: this `map_or` is redundant
   --> charabia/src/token.rs:116:9
    |
116 |         self.separator_kind().map_or(false, |_| true)
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: use is_some_and instead: `self.separator_kind().is_some_and(|_| true)`
    |

Rust FMT:

Diff in /home/runner/work/charabia/charabia/charabia/src/normalizer/lowercase.rs:27:

    fn should_normalize(&self, token: &Token) -> bool {
        // https://en.wikipedia.org/wiki/Letter_case#Capitalisation
-        matches!(token.script, Script::Latin | Script::Cyrillic | Script::Greek | Script::Georgian | Script::Armenian)
-            && token.lemma.chars().any(char::is_uppercase)
+        matches!(
+            token.script,
+            Script::Latin | Script::Cyrillic | Script::Greek | Script::Georgian | Script::Armenian
+        ) && token.lemma.chars().any(char::is_uppercase)
    }
}

thank you!

@NarHakobyan
Copy link
Author

NarHakobyan commented Feb 10, 2025

Hi @ManyTheFish, Done! I Do not know why but RustRover didn't show any error on these lines.

@ManyTheFish
Copy link
Member

Hey @NarHakobyan,
This looks good to me, but, don't you want to add a token with Armenian letters in the below test to ensure everything works as expected?

fn tokens() -> Vec<Token<'static>> {
vec![Token {
lemma: Owned("PascalCase".to_string()),
char_end: 10,
byte_end: 10,
script: Script::Latin,
..Default::default()
}]
}
fn normalizer_result() -> Vec<Token<'static>> {
vec![Token {
lemma: Owned("pascalcase".to_string()),
char_end: 10,
byte_end: 10,
script: Script::Latin,
char_map: Some(vec![
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
]),
..Default::default()
}]
}
fn normalized_tokens() -> Vec<Token<'static>> {
vec![Token {
lemma: Owned("pascalcase".to_string()),
char_end: 10,
byte_end: 10,
script: Script::Latin,
kind: TokenKind::Word,
char_map: Some(vec![
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
(1, 1),
]),
..Default::default()
}]
}

You just have to add a source token in the tokens() list, then fill the normalizer_result() and the normalized_tokens() with the expected output.

@NarHakobyan
Copy link
Author

NarHakobyan commented Feb 11, 2025

@ManyTheFish to be honest, I don't know how to do that :D

here is an example text to which can be used: Չին ֆիզիկոսը օճառաջուր ցողելով բժշկում է հայ գնդապետի փքված ձախ թևը։

@ManyTheFish
Copy link
Member

@NarHakobyan

Add a token containing Armenian capital letters in the token() function:

fn tokens() -> Vec<Token<'static>> { 
     vec![Token { 
         lemma: Owned("PascalCase".to_string()), 
         char_end: 10, 
         byte_end: 10, 
         script: Script::Latin, 
         ..Default::default() 
-     }] 
+     },
+     Token { 
+         lemma: Owned("ֆիզիկոսը".to_string()), 
+         char_end: 8, 
+         byte_end: 16, 
+         script: Script::Armenian, 
+         ..Default::default() 
+     }]
 } 

Then run the tests:
cargo test lowercase

And fix the outputs in the normalized_tokens() and normalizer_result() functions 😄

@NarHakobyan
Copy link
Author

@ManyTheFish could you please run a tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Armenian letters should be lowercased
2 participants