-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[de] no spelling suggestion for 'Postleidzahl' #725
Comments
The issue here is that Post, Leid, and Zahl are correct on their own, thus the algorithm doesn't come up with suggestions. Plus "Leit" is one of those compound parts that only appear in compounds and not on their own. Plus "Postleitzahl" as a whole word is not part of our binary dictionary, which builds on a hunspell export, and hunspell also recognizes Postleitzahl only by accepting it as a compound. Possible solutions:
|
Thanks for your comment, @danielnaber. My two cents: Both approaches look viable to me. (1) I tested Hunspell's suggestions (in LibreOffice 5.1.6.2, latest frami dictionary) on three recent examples for which we don't have suggestions right now:
(All three are examples that show IMO that at the moment, we're deserting the users that need our help the most if we can't come up with suggestions. That's probably why we are receiving so many user suggestions of disturbingly poor quality.) Hunspell suggests:
Based on this small random sample, I daresay that the user experience for users with poor spelling abilities would be greatly enhanced over what we have now by implementing your first idea. (2) As a long-term user and huge fan of Hunspell (ever since it was first included in OpenOffice), I know its weaknesses. Basically, Hunspell accepts and suggests too many odd compounds, as most other spell checkers that support compounding do. Examples include (iirc) Mistreiter, Parklatz, Bestelllungen, Absatzahlen, Nationalsoziallisten, Weltrum, Tonbäder, Schutzschilder, Sitzbanken, Linienbusen, Sozialfond. Those are relatively subtle misspellings that are hard to spot for humans, that's why I consider it dangerous to include them in the suggestions. Blacklisting those would be a huge effort. If we merge my list (or any other large, orthographically clean list of words) into the binary dictionary and prioritize them over the compound words generated on the fly (i. e., show them first in the context menu) and add the automatically generated suggestions in case a compound is not in the binary dictionary, I'm pretty confident that would do the job way better than any existing solution. |
Another odd thing I found today: The checker accepts "Ratscafé" and rejects "Ratscafe", which is fine, but the correct word is not in the list of suggestions. |
Even worse: Suggestions don't work for 'Jezt' with an uppercase J. Works for lowercase j. |
Another case I found today: Umzugsvorber_ie_tungen. No suggestions. |
I have a local fix for "Postleidzahl". Unfortunately there's a side effect of adding more and more words to the dictionary: morfologik's |
I have a list (3 MB) of the words in my wordlist that are not accepted by Hunspell. If I understand the problem correctly, it would help if we either remove those before merging the list into the binary suggestions dictionary or add them to spelling.txt. |
That's correct, I tried that locally and it helps. There are still strange issues left, e.g. "Henrik" now has better suggestions (like "Hendrik"), but "Flucke" doesn't have the obvious "Flocke" suggestion. This is tricky to debug, it's probably related to "flocke" (lowercase) being one of the suggestions... |
If case sensitivity is the problem, it might help if you merge the file uppercase_candidates.txt into the binary dictionary. It is a (very incomplete) list of words that can be both uppercase and lowercase. The uppercase variants are removed from german.dic because Aspell doesn't work properly if they are present. |
A few more real-life examples from the users' suggestions (the misspelling is followed by the intended suggestion):
Except for the first example, the current status is that LanguageTool makes no or misleading suggestions. I think it might help if we tweak the calculation of the edit distance in some cases. The idea is that some characters (or character groups) are so similar that it should be counted as less than one edit to go from one to the other, maybe 0.5. In the third example, I see three ways to get 'mogligkeit' closer to 'Möglichkeit' in terms of Levenshtein distance:
Applying that basic idea to 'mogligkeit', I end up with a distance of, say, 1.5 instead of 4. Applying the same logic, 'Molligkeit' is still closer and should be higher in the list of suggestions unless statistics suggests otherwise. But at least in this case, even a weak speller should be able to figure out which one of the two is the intended word. |
This already exists, but as with almost everything related to spell checking in LT, the situation is a bit complex:
|
On a related note, it might help if we relax the maximum edit distance for long misspelled words, but only when searching in the morfologik dictionary, otherwise we will probably suggest too many nonsensical words, and it would be computationally costly/slow.
|
Maybe @jaumeortola can help us with the case issue? |
…ds to the morfologik dictionary (#725, de-DE only for now due to the size of the dictionary)
"Postleidzahl" should provide a good suggestion now, as Jan's list is now used (minus the words hunspell wouldn't accept, as discussed above). |
It does! Based on my tests so far, my impression is that the improvement is huge! In most of the cases I get exactly the correction that I would expect from a human proofreader as the first or second suggestion in the list. I'm very impressed! |
Feel free to add misspelled words to |
Weird behavior: I have entered "In aller stile." and "stile" is correctly identified as typo, but LT does not suggest the uppercase variant "Stile". (Remark: I have modified an existing rule that suggests "Stille" for "Stile") |
schlaganfal > Schlaganfall is now fixed. I am a bit confused. How was it done? With the new speller dictionary? So "Schlaganfall" was missing in the old one? |
Yes. The old dic was a Hunspell export, and Hunspell doesn't need simple compounds like this, because it generates them on-the-fly. |
I think this is caused by this if in morfologik. @jaumeortola any opinion on whether this could be changed, maybe optionally? |
Yes. That is the reason. You have the replacement i > y, and once "style" is found, the search is stopped. I think we should change this condition and always accept suggestions with a case change. So we will suggest Stile and style for stile. |
Closing this for now. |
The German spell checker does not come up with any suggestions for 'Postleidzahl'. Looks like a bug IMO.
The text was updated successfully, but these errors were encountered: