Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiThreadedJLanguageTool with cache gives incomplete results #897

Closed
JFormDesigner opened this issue Feb 13, 2018 · 4 comments
Closed

Comments

@JFormDesigner
Copy link

When using MultiThreadedJLanguageTool.check(text) with cache, then it returns less results than JLanguageTool.check(text) with cache. Without cache, MultiThreadedJLanguageTool.check(text) returns the same result as JLanguageTool.

Seems that MultiThreadedJLanguageTool does not check all sentences if cache is enabled...

Here is a program that demonstrates the problem:

import java.io.IOException;
import java.util.List;
import org.languagetool.JLanguageTool;
import org.languagetool.Language;
import org.languagetool.MultiThreadedJLanguageTool;
import org.languagetool.ResultCache;
import org.languagetool.language.GermanyGerman;
import org.languagetool.rules.RuleMatch;

public class MultiTest {
	private static final String TEXT = "Herzlich Willkommen zu unserem Beispieltext, in den wir absichlich einige Fehler eingebaut haben. Wer zuerst kommt, malt zuerst, heißt es sprichwörtlich. Doch für einen guten Text muss man zunächst so manche Hürden umschiffen. Dem Author wurde Angst und bange, doch nachdem er alle Fehler beseitigte, wollte er den Text schnellstmöglichst veröffentlichen.";

	public static void main(String[] args) throws IOException {
		Language language = new GermanyGerman();

		// single threaded
		JLanguageTool singleLanguageTool = new JLanguageTool(language, null, new ResultCache(1000));
		List<RuleMatch> singleResult = singleLanguageTool.check(TEXT);

		System.out.println("---- single threaded ----");
		System.out.println(singleResult.size());
		System.out.println(singleResult.toString().replace(", ", "\n"));

		// multi threaded
		JLanguageTool multiLanguageTool = new MultiThreadedJLanguageTool(language, null, 2, new ResultCache(1000));
		List<RuleMatch> multiResult = multiLanguageTool.check(TEXT);

		System.out.println("---- multi threaded ----");
		System.out.println(multiResult.size());
		System.out.println(multiResult.toString().replace(", ", "\n"));

		System.out.println();
		System.out.println("Results are " + (singleResult.equals(multiResult) ? "equal" : "NOT equal") );
	}
}

The output is:

---- single threaded ----
8
[WILLKOMMEN_GROSS:9-19:<suggestion>willkommen</suggestion> scheint hier ein Adjektiv zu sein und muss daher kleingeschrieben werden.
GERMAN_SPELLER_RULE:56-66:Möglicher Rechtschreibfehler gefunden
MALT_MAHLT:116-120:Meinten Sie <suggestion>mahlt</suggestion>? (Redewendung)
HUERDEN_UMSCHIFFEN:208-214:Die feste Wendung im Sinne von 'Hindernisse überwinden' lautet korrekt '<suggestion>Klippen</suggestion> umschiffen'.
GERMAN_SPELLER_RULE:231-237:Möglicher Rechtschreibfehler gefunden
ANGST_UND_BANGE:244-259:In der Wendung <suggestion>angst und bange</suggestion> werden/sein werden 'angst' und 'bange' kleingeschrieben.
NACHDEM_PRAETERITUM:289-299:'Nachdem' drückt standardsprachlich Vorzeitigkeit aus und sollte daher nicht mit dem Präteritum verwendet werden. Verwenden Sie das Perfekt (Präsens im Hauptsatz) oder Plusquamperfekt (Präteritum im Hauptsatz) oder 'als' zum Ausdrücken von Gleichzeitigkeit.
GERMAN_SPELLER_RULE:320-338:Möglicher Rechtschreibfehler gefunden]
---- multi threaded ----
5
[WILLKOMMEN_GROSS:9-19:<suggestion>willkommen</suggestion> scheint hier ein Adjektiv zu sein und muss daher kleingeschrieben werden.
GERMAN_SPELLER_RULE:56-66:Möglicher Rechtschreibfehler gefunden
MALT_MAHLT:116-120:Meinten Sie <suggestion>mahlt</suggestion>? (Redewendung)
ANGST_UND_BANGE:244-259:In der Wendung <suggestion>angst und bange</suggestion> werden/sein werden 'angst' und 'bange' kleingeschrieben.
NACHDEM_PRAETERITUM:289-299:'Nachdem' drückt standardsprachlich Vorzeitigkeit aus und sollte daher nicht mit dem Präteritum verwendet werden. Verwenden Sie das Perfekt (Präsens im Hauptsatz) oder Plusquamperfekt (Präteritum im Hauptsatz) oder 'als' zum Ausdrücken von Gleichzeitigkeit.]

Results are NOT equal
@danielnaber
Copy link
Member

Thanks for this excellent bug report. This will probably be "fixed" by just removing the cache option for MultiThreadedJLanguageTool. I don't think there's a lot of need for cache with the multi-threaded LT. The most common use case is server-side code, and in those cases it mostly makes sense to use single-thread code, as you have multiple users and get good CPU utilization just by the fact that several users are using the service at the same time.

Background: ResultCache only works if the set of rules doesn't change (that's even documented). But for multi-threading, we don't split the text, but the rule set. So we test the same text several times with different sets of rules. And that doesn't work properly with a ResultCache.

I'll think more about whether there's a better solution.

@JFormDesigner
Copy link
Author

Thanks for looking into this issue.

My use case is a GUI markdown editor (https://github.com/JFormDesigner/markdown-writer-fx/tree/feature/spellchecker) and I tried MultiThreadedJLanguageTool because it is simply faster.

Background: ResultCache only works if the set of rules doesn't change (that's even documented)

Hmm, I've read this in ResultCache Javadoc, but InputSentence (used as key in ResultCache) has fields for enabled/disabled rules. So it should be no problem for the cache to enable/disable rules. Actually I'm doing this in my app without problems (single-threaded).

But for multi-threading, we don't split the text, but the rule set. So we test the same text several times with different sets of rules.

Understood, but if I run the above test program with language AmericanEnglish (instead of german), I get result below. There is only a single rule in the results, but they are different. So there must be some kind of text splitting, right?

---- single threaded ----
36
[MORFOLOGIK_RULE_EN_US:0-8:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:9-19:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:20-22:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:23-30:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:31-43:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:52-55:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:56-66:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:67-73:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:74-80:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:81-90:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:91-96:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:98-101:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:102-108:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:109-114:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:121-127:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:129-134:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:138-152:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:154-158:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:159-162:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:163-168:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:169-174:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:189-197:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:201-207:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:208-214:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:215-225:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:238-243:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:250-253:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:254-259:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:261-265:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:266-273:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:277-281:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:282-288:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:289-299:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:301-307:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:320-338:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:339-354:Possible spelling mistake found]
---- multi threaded ----
8
[MORFOLOGIK_RULE_EN_US:154-158:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:159-162:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:163-168:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:169-174:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:189-197:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:201-207:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:208-214:Possible spelling mistake found
MORFOLOGIK_RULE_EN_US:215-225:Possible spelling mistake found]

Results are NOT equal

@danielnaber
Copy link
Member

So there must be some kind of text splitting, right?

The text is split into sentences and all sentences get iterated over several times in parallel, with different sets of rules. That's different to having the rules deactivated, so having the list of enabled/disabled rules in InputSentence doesn't help in this case.

@JFormDesigner
Copy link
Author

@danielnaber many thanks for the explanation

JFormDesigner pushed a commit to JFormDesigner/markdown-writer-fx that referenced this issue Feb 15, 2018
JFormDesigner pushed a commit to JFormDesigner/markdown-writer-fx that referenced this issue Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants