Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Danish stemming - some words are wrong #203

Open
KeLeKaPe opened this issue Nov 27, 2024 · 1 comment
Open

Danish stemming - some words are wrong #203

KeLeKaPe opened this issue Nov 27, 2024 · 1 comment

Comments

@KeLeKaPe
Copy link

I am encountering a lot of issues with the danish stemming.
Here is some examples of some wrong stem results:

  • "bøger" -> "bøg": Incorrect, as "bøger" means "books," while "bøg" means "beech tree."
  • "talen" -> "tal": Incorrect, as "talen" means "the speech," while "tal" means "number."
  • "moden" -> "mod": Incorrect, as "moden" means "mature," while "mod" means "courage."
  • "modner" -> "modn": Incorrect, as "modner" means "ripens," while "modn" is not a valid word.
  • "slangen" -> "slang": Incorrect, as "slangen" means "the snake," while "slang" means "casual language."
  • "slanger" -> "slang": Incorrect, as "slanger" means "snakes," while "slang" means "casual language."
  • "vinder" -> "vind": Incorrect, as "vinder" means "winner" or "wins," while "vind" means "wind."
  • "ruten" -> "rut": Incorrect, as "ruten" means "the route," while "rut" is not a valid word.
  • "ruter" -> "rut": Incorrect, as "ruter" means "routes," while "rut" is not a valid word.
  • "haven" -> "hav": Incorrect, as "haven" means "the garden," while "hav" means "ocean."
  • "haver" -> "hav": Incorrect, as "haver" means "gardens," while "hav" means "ocean."

These are some of the examples I could find, there might be more.

@ojwb
Copy link
Member

ojwb commented Dec 18, 2024

Thanks for raising this.

Note that we aren't aiming to reduce words to their linguistic root (if that's what you're after, snowball probably isn't the solution to your problem).

Snowball aims to improve recall in text search. The stems produced aren't necessarily words - conceptually the stemmer maps a word to a stem, where the only thing that matters about the stems are that words that are "the same" map tend to map to the same stem, and words which aren't don't. The latter part is actually more important as conflating unrelated words (like most of your cases above) is problematic as it results in false matches to searches whereas failing to conflate is at least no worse than not using a stemmer at all. In practice it's easiest to design an algorithm where the stems either are or look a lot like the linguistic root in most cases, but it's explicitly not a design goal.

"modner" -> "modn": Incorrect, as "modner" means "ripens," while "modn" is not a valid word.
"ruten" -> "rut": Incorrect, as "ruten" means "the route," while "rut" is not a valid word.
"ruten" -> "rut": Incorrect, as "ruten" means "the route," while "rut" is not a valid word.

So these are actually OK (I don't know Danish, but other forms of the first case seem to include modne and modnet which also both stem to modn, while other forms of the second case seem to include rute and ruterne which also both stem to rut).

I mention this not to split hairs, but because it helps understand what the problem here actually is.

Anyway, I wonder if the best way to solve this might be to change handling of -er, -en, and probably many of the other suffixes that start with e which we currently remove completely, to be instead replaced with -e. So e.g. bøger would then stem to bøge instead.

@KeLeKaPe Are you a Danish speaker? If so, could you suggest which of these suffixes that change should probably be applied to:

            'hed' 'ethed' 'ered' 'e' 'erede' 'ende' 'erende' 'ene' 'erne' 'ere'
            'en' 'heden' 'eren' 'er' 'heder' 'erer' 'heds' 'es' 'endes'
            'erendes' 'enes' 'ernes' 'eres' 'ens' 'hedens' 'erens' 'ers' 'ets'
            'erets' 'et' 'eret'

The longest matching suffix from this list gets removed (but only if it occurs in region R1, which starts after the first consonant which follows a vowel, or at least 3 characters from the start so you don't need to worry about the effect on very short words).

We're wanting to change this for all the cases like modn-er, modn-e, modn-et, rut-en, rut-er, rut-e, rut-erne, etc but ideally without affecting any suffixes which should be completely removed. If there's a suffix on the list where the correct answer depends on the word we may need to come up with a rule to decide which way to handle it, but probably best to just note if you see any such issues rather than worry about how to solve them at this point.

I can then amend the stemmer code and run a comparison of the output with and without the change (we have a script which analyses changes in how the stemmer groups words from a test vocabulary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants