[de] no spelling suggestion for 'Postleidzahl' #725

janschreiber · 2017-06-21T17:09:05Z

The German spell checker does not come up with any suggestions for 'Postleidzahl'. Looks like a bug IMO.

danielnaber · 2017-06-22T21:52:20Z

The issue here is that Post, Leid, and Zahl are correct on their own, thus the algorithm doesn't come up with suggestions. Plus "Leit" is one of those compound parts that only appear in compounds and not on their own. Plus "Postleitzahl" as a whole word is not part of our binary dictionary, which builds on a hunspell export, and hunspell also recognizes Postleitzahl only by accepting it as a compound. Possible solutions:

If no suggestions are found, use hunspell for suggestions again. Slow, but probably okay as it doesn't happen too often. Would help for this case but probably not for others where we already have (bad) suggestions.
Extend the binary dictionary with more compound words, e.g. with Jan's list.

janschreiber · 2017-06-23T20:12:16Z

Thanks for your comment, @danielnaber. My two cents:

Both approaches look viable to me.

(1) I tested Hunspell's suggestions (in LibreOffice 5.1.6.2, latest frami dictionary) on three recent examples for which we don't have suggestions right now:

schlaganfal (BTW: again, it seems weird that there is no suggestion here)
Postleidzahl
analpherbet

(All three are examples that show IMO that at the moment, we're deserting the users that need our help the most if we can't come up with suggestions. That's probably why we are receiving so many user suggestions of disturbingly poor quality.)

Hunspell suggests:

Schlaganfallrisiko (good word, but a bit weird, misses the most natural suggestion)
Postleitzahl (correct)
Postleihzahl (a bit meh but acceptable)
Plastidenzahl (apparently an actual word, so fine)
Analphabetenrate (far off, but a good word)
Analphabeten (expected suggestion)
heranarbeiteten (okay, why not?)
Analphabet (fine)

Based on this small random sample, I daresay that the user experience for users with poor spelling abilities would be greatly enhanced over what we have now by implementing your first idea.

(2) As a long-term user and huge fan of Hunspell (ever since it was first included in OpenOffice), I know its weaknesses. Basically, Hunspell accepts and suggests too many odd compounds, as most other spell checkers that support compounding do. Examples include (iirc) Mistreiter, Parklatz, Bestelllungen, Absatzahlen, Nationalsoziallisten, Weltrum, Tonbäder, Schutzschilder, Sitzbanken, Linienbusen, Sozialfond. Those are relatively subtle misspellings that are hard to spot for humans, that's why I consider it dangerous to include them in the suggestions. Blacklisting those would be a huge effort.

If we merge my list (or any other large, orthographically clean list of words) into the binary dictionary and prioritize them over the compound words generated on the fly (i. e., show them first in the context menu) and add the automatically generated suggestions in case a compound is not in the binary dictionary, I'm pretty confident that would do the job way better than any existing solution.

janschreiber · 2017-06-24T14:13:06Z

Another odd thing I found today: The checker accepts "Ratscafé" and rejects "Ratscafe", which is fine, but the correct word is not in the list of suggestions.

janschreiber · 2017-06-24T18:09:48Z

Even worse: Suggestions don't work for 'Jezt' with an uppercase J. Works for lowercase j.

janschreiber · 2017-06-27T14:59:35Z

Another case I found today: Umzugsvorber_ie_tungen. No suggestions.

danielnaber · 2017-07-04T22:17:40Z

I have a local fix for "Postleidzahl". Unfortunately there's a side effect of adding more and more words to the dictionary: morfologik's Speller.findReplacements() will only work on words that are "misspelled", i.e. not in its dictionary. Thus, we don't get any suggestions for e.g. Henrik because it's now in the morfologik dictionary, but not in the hunspell one. I guess we need to modify morfologik to get this working for our (rather special) use case.

janschreiber · 2017-07-05T17:56:35Z

I have a list (3 MB) of the words in my wordlist that are not accepted by Hunspell. If I understand the problem correctly, it would help if we either remove those before merging the list into the binary suggestions dictionary or add them to spelling.txt.
All those words were programmatically checked with Duden Korrektor or MS Word, but most of them not manually. Perhaps it would be the cleanest solution to remove them.
In any case, it would be quite irritating for the users if the spell checker suggest words that it then considers misspelled, or marks a word as misspelled and then suggest that exact same word.

danielnaber · 2017-07-05T21:18:00Z

If I understand the problem correctly, it would help if we either remove those before merging the list into the binary suggestions dictionary

That's correct, I tried that locally and it helps. There are still strange issues left, e.g. "Henrik" now has better suggestions (like "Hendrik"), but "Flucke" doesn't have the obvious "Flocke" suggestion. This is tricky to debug, it's probably related to "flocke" (lowercase) being one of the suggestions...

janschreiber · 2017-07-06T10:47:12Z

This is tricky to debug, it's probably related to "flocke" (lowercase) being one of the suggestions...

If case sensitivity is the problem, it might help if you merge the file uppercase_candidates.txt into the binary dictionary. It is a (very incomplete) list of words that can be both uppercase and lowercase. The uppercase variants are removed from german.dic because Aspell doesn't work properly if they are present.

janschreiber · 2017-07-11T19:55:51Z

A few more real-life examples from the users' suggestions (the misspelling is followed by the intended suggestion):

Gutschaine → Gutscheine (works already)
Komunkationsinstrumente → Kommunikationsinstrumente
mogligkeit → Möglichkeit
Gelbensäcke → gelben Säcke
WIFI → Wi-Fi
Aussenvisualisierung → Außenvisualisierung

Except for the first example, the current status is that LanguageTool makes no or misleading suggestions.
Aspell with my word list gets the first three right, because 'Außenvisualisierung' is not in the dictionary yet. 'Fi' and 'Wi' will not be added.
Hunspell gets the first and the last one right.
Google Docs doesn't consider 'Aussenvisualisierung' an error and suggests 'moglichkeit' in the third example, but gets the others right.
The Duden online checker only has the proper suggestion for 'Aussenvisualisierung', no suggestions at all for the other cases. It just says "Check the spelling," which is quite unhelpful for many users.

I think it might help if we tweak the calculation of the edit distance in some cases. The idea is that some characters (or character groups) are so similar that it should be counted as less than one edit to go from one to the other, maybe 0.5.

In the third example, I see three ways to get 'mogligkeit' closer to 'Möglichkeit' in terms of Levenshtein distance:

If the correction converts the first letter to uppercase, this might even be considered a distance of zero, at least for German. (It's the same letter in some sense, after all.)
Converting 'g' to 'ch' could be considered a single edit instead of two. Possibly the same for 'er' and 'a', at least at the word end: 'Bölla' is very likely an attempt to write 'Böller'. Certainly the step from 'ss' to 'ß' and vice versa should be counted as one edit, maybe even less than one. The same for 'oe' and 'ö' etc. Another candidate: 'x' and 'chs' (Fux/Fuchs).
The step from a vowel to its umlaut counterpart is less than one full edit IMO.

Applying that basic idea to 'mogligkeit', I end up with a distance of, say, 1.5 instead of 4. Applying the same logic, 'Molligkeit' is still closer and should be higher in the list of suggestions unless statistics suggests otherwise. But at least in this case, even a weak speller should be able to figure out which one of the two is the intended word.

danielnaber · 2017-07-11T21:38:51Z

In the third example, I see three ways to get 'mogligkeit' closer to 'Möglichkeit' in terms of Levenshtein distance:

This already exists, but as with almost everything related to spell checking in LT, the situation is a bit complex:

https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/de_DE.info#L6 - used by morfologik, the list should probably be extended, I'm not sure why it's less complete than the following list
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell/de_DE.aff#L504 (used for hunspell to create suggestions, so this is probably not used at all as long as we don't use hunspell suggestions)
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanSpellerRule.java#L49 - some kind of post-processing, maybe this can be removed already as morfologik should already do it, we need to check this

janschreiber · 2017-07-12T10:12:56Z

On a related note, it might help if we relax the maximum edit distance for long misspelled words, but only when searching in the morfologik dictionary, otherwise we will probably suggest too many nonsensical words, and it would be computationally costly/slow.
A distance of 2 is often very low.
Something like this:

// if searching for suggestions in the morfologik dic
if (wrongWord.length > 8) {
    maxEdDist = 4;
} else if (wrongWord.length == 2) {
    maxEdDist = 1; // avoid suggesting 'an' for 'cu' etc.
} else {
    maxEdDist = MAX_EDIT_DISTANCE;
}

danielnaber · 2017-07-12T15:12:17Z

Maybe @jaumeortola can help us with the case issue? schlaganfal has two typos, the correct word is Schlaganfall. Still, we don't get a suggestion. Can we ignore upper/lowercase for suggestions? With fsa.dict.speller.equivalent-chars=s S,S s I didn't see an improvement. But even without that, schlaganfal should have a distance of 2 and should thus be suggested. Jaume, if you think you can help but need a minimal test case, I could create one.

…ds to the morfologik dictionary (#725, de-DE only for now due to the size of the dictionary)

danielnaber · 2017-07-13T19:28:22Z

"Postleidzahl" should provide a good suggestion now, as Jan's list is now used (minus the words hunspell wouldn't accept, as discussed above).

janschreiber · 2017-07-14T10:22:39Z

"Postleidzahl" should provide a good suggestion now

It does! Based on my tests so far, my impression is that the improvement is huge! In most of the cases I get exactly the correction that I would expect from a human proofreader as the first or second suggestion in the list. I'm very impressed!

danielnaber · 2017-07-14T10:34:36Z

Feel free to add misspelled words to languagetool-language-modules/de/src/test/resources/suggestions.txt in the format word => - this file is used by a test case, but one that doesn't run automatically. I'll run it from time to time to see how suggestions improve/change.

f-knorr · 2017-07-15T09:13:38Z

Weird behavior: I have entered "In aller stile." and "stile" is correctly identified as typo, but LT does not suggest the uppercase variant "Stile". (Remark: I have modified an existing rule that suggests "Stille" for "Stile")

jaumeortola · 2017-07-15T13:46:17Z

schlaganfal > Schlaganfall is now fixed. I am a bit confused. How was it done? With the new speller dictionary? So "Schlaganfall" was missing in the old one?

janschreiber · 2017-07-15T13:56:01Z

So "Schlaganfall" was missing in the old one?

Yes. The old dic was a Hunspell export, and Hunspell doesn't need simple compounds like this, because it generates them on-the-fly.

danielnaber · 2017-07-15T14:40:11Z

Weird behavior: I have entered "In aller stile." and "stile" is correctly identified as typo, but LT does not suggest the uppercase variant "Stile".

I think this is caused by this if in morfologik. @jaumeortola any opinion on whether this could be changed, maybe optionally?

jaumeortola · 2017-07-15T15:06:32Z

Yes. That is the reason. You have the replacement i > y, and once "style" is found, the search is stopped. I think we should change this condition and always accept suggestions with a case change. So we will suggest Stile and style for stile.

janschreiber · 2017-07-20T14:29:12Z

Closing this for now.

janschreiber added bug German labels Jun 21, 2017

danielnaber added a commit that referenced this issue Jul 11, 2017

[de] add g/ch and ch/g replacement pairs #725

d891558

danielnaber added a commit that referenced this issue Jul 12, 2017

[de] words from #725 for future tests

8c8bbb8

danielnaber added a commit that referenced this issue Jul 13, 2017

[de] improve many spell check suggestions by adding Jan's list of wor…

fd6c167

…ds to the morfologik dictionary (#725, de-DE only for now due to the size of the dictionary)

janschreiber closed this as completed Jul 20, 2017

janschreiber mentioned this issue Jan 20, 2019

[de] wrong suggestion for "Dampfschiffahrtskapitän" #1369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[de] no spelling suggestion for 'Postleidzahl' #725

[de] no spelling suggestion for 'Postleidzahl' #725

janschreiber commented Jun 21, 2017

danielnaber commented Jun 22, 2017

janschreiber commented Jun 23, 2017 •

edited

Loading

janschreiber commented Jun 24, 2017

janschreiber commented Jun 24, 2017

janschreiber commented Jun 27, 2017

danielnaber commented Jul 4, 2017

janschreiber commented Jul 5, 2017

danielnaber commented Jul 5, 2017

janschreiber commented Jul 6, 2017

janschreiber commented Jul 11, 2017

danielnaber commented Jul 11, 2017

janschreiber commented Jul 12, 2017

danielnaber commented Jul 12, 2017

danielnaber commented Jul 13, 2017

janschreiber commented Jul 14, 2017

danielnaber commented Jul 14, 2017

f-knorr commented Jul 15, 2017

jaumeortola commented Jul 15, 2017

janschreiber commented Jul 15, 2017

danielnaber commented Jul 15, 2017

jaumeortola commented Jul 15, 2017 •

edited

Loading

janschreiber commented Jul 20, 2017

[de] no spelling suggestion for 'Postleidzahl' #725

[de] no spelling suggestion for 'Postleidzahl' #725

Comments

janschreiber commented Jun 21, 2017

danielnaber commented Jun 22, 2017

janschreiber commented Jun 23, 2017 • edited Loading

janschreiber commented Jun 24, 2017

janschreiber commented Jun 24, 2017

janschreiber commented Jun 27, 2017

danielnaber commented Jul 4, 2017

janschreiber commented Jul 5, 2017

danielnaber commented Jul 5, 2017

janschreiber commented Jul 6, 2017

janschreiber commented Jul 11, 2017

danielnaber commented Jul 11, 2017

janschreiber commented Jul 12, 2017

danielnaber commented Jul 12, 2017

danielnaber commented Jul 13, 2017

janschreiber commented Jul 14, 2017

danielnaber commented Jul 14, 2017

f-knorr commented Jul 15, 2017

jaumeortola commented Jul 15, 2017

janschreiber commented Jul 15, 2017

danielnaber commented Jul 15, 2017

jaumeortola commented Jul 15, 2017 • edited Loading

janschreiber commented Jul 20, 2017

janschreiber commented Jun 23, 2017 •

edited

Loading

jaumeortola commented Jul 15, 2017 •

edited

Loading