Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different suggested fix wrt capitalization #2470

Open
dforsi opened this issue Aug 20, 2022 · 6 comments
Open

Different suggested fix wrt capitalization #2470

dforsi opened this issue Aug 20, 2022 · 6 comments
Assignees
Labels
dictionary Changes to the dictionary enhancement

Comments

@dforsi
Copy link
Contributor

dforsi commented Aug 20, 2022

I'm getting two different suggested fixes for what seems to be the same misspelling with codespell 2.1.0:

$ codespell lesstif.txt
lesstif.txt:1: Lesstiff ==> Lesstif
lesstif.txt:2: LessTiff ==> lesstif

See attachment lesstif.txt

BTW, according to Wikipedia the correct capitalization is LessTiff

@peternewman peternewman added enhancement dictionary Changes to the dictionary labels Aug 20, 2022
@peternewman
Copy link
Collaborator

So there's two things driving this, firstly do you want to correct the dictionary correction so it's in the correct case:

lesstiff->lesstif

Secondly the algorithm that tries to predict what case the correction should be in:

def fix_case(word, fixword):
if word == word.capitalize():
return fixword.capitalize()
elif word == word.upper():
return fixword.upper()
# they are both lower case
# or we don't have any idea
return fixword

Obviously this can't currently deal with camel case, where your first line it capitalised, it offers that, otherwise it gets confused and gives up.

I think we've generally gone for the typo being in lower case, but actually if we stored this typo in camel case and then made sure it was lowered throughout the codebase, and added a elif word == word.lower(): check here, I think that would sort out this behaviour (unless I've missed something or forgotten some reasoning around storing the typos in lower case)...

@vikivivi
Copy link
Contributor

@peternewman @luzpaz
Please find my patches for handling capitalization issue.

  1. Do not change suggested words in dictionary to lower case during loading.
  2. Capitalization decision is decided in fix_case().
  3. Hope this handle proper nouns, camelCase, capitalise word.
diff --git a/codespell_lib/_codespell.py b/codespell_lib/_codespell.py
index 1ed70e89..f0607c1e 100644
--- a/codespell_lib/_codespell.py
+++ b/codespell_lib/_codespell.py
@@ -454,10 +454,10 @@ def build_dict(filename, misspellings, ignore_words):
     with codecs.open(filename, mode='r', encoding='utf-8') as f:
         for line in f:
             [key, data] = line.split('->')
-            # TODO for now, convert both to lower. Someday we can maybe add
-            # support for fixing caps.
+            # Convert key to lower case.
+            # Do not modify data to lower case. Leave it as per dictionary.
             key = key.lower()
-            data = data.lower()
+            # data = data.lower()
             if key in ignore_words:
                 continue
             data = data.strip()
@@ -494,12 +494,16 @@ def is_text_file(filename):
 
 
 def fix_case(word, fixword):
-    if word == word.capitalize():
+    if fixword == fixword.upper():
+        # fixword is in all upper case as per dictionary. Eg. ASCII
+        return fixword
+    elif word == word.capitalize() and fixword == fixword.lower():
+        # word is capitalized and fixword in lower. Capitalize fixword. Eg. Pineapple
         return fixword.capitalize()
     elif word == word.upper():
+        # word is in all upper case, change fixword to upper. Eg. MONDAY
         return fixword.upper()
-    # they are both lower case
-    # or we don't have any idea
+    # word is in lower, capitalize, CamelCase or whatever. Use fixword as per dictionary
     return fixword

$ cat test.sh
#!/bin/sh

# Suggested word in all upper case in dictionary
echo "asscii" | codespell -
echo "Asscii" | codespell -
echo "ASSCII" | codespell -

# Misspelling coded in dictionary as lower
echo "tusday" | codespell -
echo "Tusday" | codespell -
echo "TUSDAY" | codespell -

# Misspelling coded in dictionary as Capitalize
echo "micosoft" | codespell -
echo "Micosoft" | codespell -
echo "MICOSOFT" | codespell -

# Misspelling and suggested both in lower case in dictionary
echo "pinapple" | codespell -
echo "Pinapple" | codespell -

# Suggested word in CamelCase in dictionary
echo "lesstiff" | codespell -
echo "lessTiff" | codespell -
echo "Lesstiff" | codespell -
echo "LessTiff" | codespell -
echo "LESSTIFF" | codespell -

@peternewman
Copy link
Collaborator

@peternewman @luzpaz Please find my patches for handling capitalization issue.

  1. Do not change suggested words in dictionary to lower case during loading.
  2. Capitalization decision is decided in fix_case().
  3. Hope this handle proper nouns, camelCase, capitalise word.

Thanks very much @vikivivi however would you mind doing it as a Pull Request please? You'll get credit, it's easier to comment or improve on it and the tests will be run automatically. We can also work on getting your test cases added to the code too.

@vikivivi
Copy link
Contributor

@peternewman I will trying to work on a pull request with my additional test cases.

@peternewman
Copy link
Collaborator

@peternewman I will trying to work on a pull request with my additional test cases.

Great thanks @vikivivi . Although feel free to open the PR with the code as is above and others could help you with the test cases too.

@vikivivi
Copy link
Contributor

@peternewman Please see #2478 for latest patch changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dictionary Changes to the dictionary enhancement
Projects
None yet
Development

No branches or pull requests

3 participants