Danish stemming - some words are wrong #203

KeLeKaPe · 2024-11-27T11:49:51Z

I am encountering a lot of issues with the danish stemming.
Here is some examples of some wrong stem results:

"bøger" -> "bøg": Incorrect, as "bøger" means "books," while "bøg" means "beech tree."
"talen" -> "tal": Incorrect, as "talen" means "the speech," while "tal" means "number."
"moden" -> "mod": Incorrect, as "moden" means "mature," while "mod" means "courage."
"modner" -> "modn": Incorrect, as "modner" means "ripens," while "modn" is not a valid word.
"slangen" -> "slang": Incorrect, as "slangen" means "the snake," while "slang" means "casual language."
"slanger" -> "slang": Incorrect, as "slanger" means "snakes," while "slang" means "casual language."
"vinder" -> "vind": Incorrect, as "vinder" means "winner" or "wins," while "vind" means "wind."
"ruten" -> "rut": Incorrect, as "ruten" means "the route," while "rut" is not a valid word.
"ruter" -> "rut": Incorrect, as "ruter" means "routes," while "rut" is not a valid word.
"haven" -> "hav": Incorrect, as "haven" means "the garden," while "hav" means "ocean."
"haver" -> "hav": Incorrect, as "haver" means "gardens," while "hav" means "ocean."

These are some of the examples I could find, there might be more.

ojwb · 2024-12-18T02:42:42Z

Thanks for raising this.

Note that we aren't aiming to reduce words to their linguistic root (if that's what you're after, snowball probably isn't the solution to your problem).

Snowball aims to improve recall in text search. The stems produced aren't necessarily words - conceptually the stemmer maps a word to a stem, where the only thing that matters about the stems are that words that are "the same" map tend to map to the same stem, and words which aren't don't. The latter part is actually more important as conflating unrelated words (like most of your cases above) is problematic as it results in false matches to searches whereas failing to conflate is at least no worse than not using a stemmer at all. In practice it's easiest to design an algorithm where the stems either are or look a lot like the linguistic root in most cases, but it's explicitly not a design goal.

"modner" -> "modn": Incorrect, as "modner" means "ripens," while "modn" is not a valid word.
"ruten" -> "rut": Incorrect, as "ruten" means "the route," while "rut" is not a valid word.
"ruten" -> "rut": Incorrect, as "ruten" means "the route," while "rut" is not a valid word.

So these are actually OK (I don't know Danish, but other forms of the first case seem to include modne and modnet which also both stem to modn, while other forms of the second case seem to include rute and ruterne which also both stem to rut).

I mention this not to split hairs, but because it helps understand what the problem here actually is.

Anyway, I wonder if the best way to solve this might be to change handling of -er, -en, and probably many of the other suffixes that start with e which we currently remove completely, to be instead replaced with -e. So e.g. bøger would then stem to bøge instead.

@KeLeKaPe Are you a Danish speaker? If so, could you suggest which of these suffixes that change should probably be applied to:

            'hed' 'ethed' 'ered' 'e' 'erede' 'ende' 'erende' 'ene' 'erne' 'ere'
            'en' 'heden' 'eren' 'er' 'heder' 'erer' 'heds' 'es' 'endes'
            'erendes' 'enes' 'ernes' 'eres' 'ens' 'hedens' 'erens' 'ers' 'ets'
            'erets' 'et' 'eret'

The longest matching suffix from this list gets removed (but only if it occurs in region R1, which starts after the first consonant which follows a vowel, or at least 3 characters from the start so you don't need to worry about the effect on very short words).

We're wanting to change this for all the cases like modn-er, modn-e, modn-et, rut-en, rut-er, rut-e, rut-erne, etc but ideally without affecting any suffixes which should be completely removed. If there's a suffix on the list where the correct answer depends on the word we may need to come up with a rule to decide which way to handle it, but probably best to just note if you see any such issues rather than worry about how to solve them at this point.

I can then amend the stemmer code and run a comparison of the output with and without the change (we have a script which analyses changes in how the stemmer groups words from a test vocabulary).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Danish stemming - some words are wrong #203

Danish stemming - some words are wrong #203

KeLeKaPe commented Nov 27, 2024

ojwb commented Dec 18, 2024 •

edited

Loading

Danish stemming - some words are wrong #203

Danish stemming - some words are wrong #203

Comments

KeLeKaPe commented Nov 27, 2024

ojwb commented Dec 18, 2024 • edited Loading

ojwb commented Dec 18, 2024 •

edited

Loading