-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Danish stemming - some words are wrong #203
Comments
Thanks for raising this. Note that we aren't aiming to reduce words to their linguistic root (if that's what you're after, snowball probably isn't the solution to your problem). Snowball aims to improve recall in text search. The stems produced aren't necessarily words - conceptually the stemmer maps a word to a stem, where the only thing that matters about the stems are that words that are "the same" map tend to map to the same stem, and words which aren't don't. The latter part is actually more important as conflating unrelated words (like most of your cases above) is problematic as it results in false matches to searches whereas failing to conflate is at least no worse than not using a stemmer at all. In practice it's easiest to design an algorithm where the stems either are or look a lot like the linguistic root in most cases, but it's explicitly not a design goal.
So these are actually OK (I don't know Danish, but other forms of the first case seem to include I mention this not to split hairs, but because it helps understand what the problem here actually is. Anyway, I wonder if the best way to solve this might be to change handling of @KeLeKaPe Are you a Danish speaker? If so, could you suggest which of these suffixes that change should probably be applied to:
The longest matching suffix from this list gets removed (but only if it occurs in region R1, which starts after the first consonant which follows a vowel, or at least 3 characters from the start so you don't need to worry about the effect on very short words). We're wanting to change this for all the cases like I can then amend the stemmer code and run a comparison of the output with and without the change (we have a script which analyses changes in how the stemmer groups words from a test vocabulary). |
I am encountering a lot of issues with the danish stemming.
Here is some examples of some wrong stem results:
These are some of the examples I could find, there might be more.
The text was updated successfully, but these errors were encountered: