Handle x plural forms for French #91

merwok · 2019-01-07T04:51:57Z

Hello! I hope this is the right place to report a problem I found with an app that uses PostgreSQL 10's full-text search.

There is a class of French nouns that form their plural in x: jeux, hiboux, choux, aulx, baux, etc.

Testing with PG and reading the doc at https://snowballstem.org/algorithms/french/stemmer.html make me think that these are not handled.

The text was updated successfully, but these errors were encountered:

dscorbett · 2019-01-07T14:32:35Z

It’s hard to stem such words properly without overstemming words like roux and faux. Do you have any suggestion for how to do it?

merwok · 2019-01-07T14:56:55Z

Maybe this needs a dictionary solution rather than an algorithmic one!

Only seven -ou words have plurals with x: chou bijou joujou genou caillou hibou pou

For aulx / yeux / baux, the singular is aïl / œil / bail, so these can’t be derived automatically either but should be in a list or dictionary.

For words ending in -eu, the plural is with x except for a couple of them.
There are a ton of adjective ending in -eux (singular and plural) though, so I don’t know if stemming -eux to -eu would be ok or not.

merwok · 2019-03-25T03:25:41Z

Hello! How can we move this forward?

ririsoft · 2019-08-29T15:59:25Z

Hello,

I have an issue with the adjective 'français' which has no masculine singular form while the feminine form has. The steaming result using PostgreSQL 11 is the following :

'français' -> 'franc'
'française' -> 'français'
'françaises' -> 'français'

I would expect they are all 'franc' or 'français'.

I guess that a dictionary approach could help for such case such as the issue started here, or you prefer that I open a separate issue to track my issue ?

Note : on PostgreSQL I can solve the issue by creating my own synonym dictionary and change the text search config to use this dictionary before snowball steam, but I believe it makes more sense to have this fixed on snowball side.

merwok · 2019-08-29T16:10:14Z

I would expect they are all 'franc' or 'français'.

I think français is the stem.

on PostgreSQL I can solve the issue by creating my own synonym dictionary and change the text search config to use this dictionary before snowball steam

Oh so this means I could solve the issue with jeux, hiboux, etc?

ririsoft · 2019-08-29T18:45:09Z

Oh so this means I could solve the issue with jeux, hiboux, etc?

Yes you could create a synonym dictionary and have it used in the pipeline before the french_stem. I recommend reading the PostgreSQL documentation on textsearch dictionnaries for how to do that. The documentation is also available in french ;)

But since your issue and mine are so basic as far as french is concerned I would much prefer to have it solved here.

ojwb · 2019-09-04T03:27:55Z

Please don't hijack a ticket with unrelated issues - open your own ticket if you have something to report.

merwok · 2020-06-30T17:13:07Z

ping! how can I help solve this?

ojwb · 2020-07-07T21:50:08Z

The purpose of these stemmers is for use with Information Retrieval ("text search" in less formal terms) - stemming words essentially allows conflating different forms of the "same" word, which usually improves recall (as in https://en.wikipedia.org/wiki/Precision_and_recall). The risk is it tends to reduce precision - that's somewhat inherent as different forms of a word can convey subtle differences in meaning, but the more problematic cases are where forms of different words get conflated.

For example, the original English Porter stemmer ("porter" in snowball) stemmed both "skies" and "skis" to "ski", so a search for "ski" would find a document which wasn't connected to skiing at all but actually about "skies" - that's much worse than if it didn't stem "skies" at all. (The snowball "english" stemmer addresses this and stems "skies" to "sky".)

A stemmer for a human language is probably inevitably going to be imperfect, but it's good to keep in mind the end goal is improving retrieval rather than reducing every word to its linguistic root.

So to "solve" this we need a concrete plan for how to stem plural nouns which end with "x" without adversely affecting other words which end with "x". It may be there just isn't a sensible way to do that, but then not stemming such words is a decent status quo - using a stemmer which doesn't stem such words is no worse than not using a stemmer at all.

merwok · 2020-07-08T01:21:55Z

The oux plurals and funky forms (yeux etc) are a small, fixed list. Is there a way in the current codebase to add that somewhere to do fixed replacements?

ojwb · 2020-07-09T03:41:37Z

There isn't currently such an exception list, though you obviously could add one.

But the first thing we need is to come up with a proposed list of such replacements, and carefully check it for potential problems - for example, "baux" -> "bail" seems problematic, as "baux" is also the plural of "bau" (https://en.wiktionary.org/wiki/baux#French).

Then we need a patch that implements those.

And also such exceptions should all be added to the test vocabulary if they aren't already in it, so that we have good test coverage for the change.

And finally the algorithm description on the website needs updating to match.

bkazez · 2020-11-02T06:55:21Z

mal/maux is also not handled. How can I help solve this?

ojwb · 2020-11-03T02:38:25Z

@bkazez I already outlined what's needed in a comment just above yours.

merwok · 2020-11-03T03:27:59Z

«We need a patch» doesn’t provide much guidance for people who don’t know the codebase 🙂

ojwb · 2020-11-04T00:11:21Z

I didn't just say "we need a patch" though.

The first thing we need is a workable plan for the change we're going to make. This needs to consider all the effects of the proposed change, to make sure we aren't making things worse for other cases (or if we are, that such unwanted consequences are definitely outweighed by the benefits).

The patch itself is literally an implementation detail.

But I should note that these algorithms are intended for use in text search systems, where stemming is a common way to improve recall. For the intended use, over-stemming is more problematic than under-stemming, so we tend not to stem in cases that are hard to resolve. (If you want to always reduce words to a root form then Snowball's stemming algorithms likely aren't the right answer as that isn't the purpose they are designed for.)

bkazez · 2020-11-04T00:29:11Z

I have never contributed to Snowball. I'm not a native French speaker, but I know some and am happy to start the exception list. Where do you store such a list for remote collaboration?

Also, should I start such a list for #139 too or is that one different?

merwok · 2020-11-04T16:31:26Z

Apologies, you did provide a list of steps to be taken.

PostgreSQL full-text search system relies on this stemmer, and defining a custom dictionary or lexer is not possible in environments where you don’t have full control of the database server, so I still think this would be best fixed here. Again, I would limit the scope to a very small list of well-defined irregular plurals, to avoid overreaching and adding regressions. But I believe it is not controversial to want choux → chou just like we have ballons → ballon!

There isn't currently such an exception list, though you obviously could add one.

I am not a C programmer so I could try but without guarantee of success.

But the first thing we need is to come up with a proposed list of such replacements, and carefully check it for potential problems - for example, "baux" -> "bail" seems problematic, as "baux" is also the plural of "bau" (en.wiktionary.org/wiki/baux#French).

This is such a rare word that a false positive does not worry me (but my uses cases tend to want more results in preference to fewer but more exact matches). I will consult some dictionaries and propose a list!

And also such exceptions should all be added to the test vocabulary if they aren't already in it, so that we have good test coverage for the change.

I wouldn’t want to change code without having tests!

And finally the algorithm description on the website needs updating to match.

Sure. I found the file to be edited at https://github.com/snowballstem/snowball-website/blob/master/algorithms/french/stemmer.tt ; I’ll just need someone to validate results, as I don’t have java installed and the repo doesn’t seem to have CI building a preview.

ojwb · 2022-11-16T05:04:27Z

I will consult some dictionaries and propose a list!

@merwok Did you get anywhere with this?

This is such a rare word that a false positive does not worry me (but my uses cases tend to want more results in preference to fewer but more exact matches).

Indeed - it may be fine for your case, but we do need to consider that a change will affect everyone using the stemmer, so we need to think about whether it's problematic for someone's situation.

My thoughts on this case is that it's better to just leave "baux" alone. That follows the general principle that overstemming is much more harmful than understemming, and would leave things as they are with the existing stemmer code (and if you didn't use stemming at all). It seems "baux" is going to be a fairly rare word whether it's the plural of "bau" or "bail".

merwok · 2022-11-16T15:50:41Z

I haven’t made a list yet, but am still interested in helping to improve this!

Agreed on leaving baux as is; people can search for bail then baux if they need it.

ojwb mentioned this issue Oct 15, 2024

German stemmer doesn't match schlummert/schlummern or grüßend/gegrüßt/grüßen #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle x plural forms for French #91

Handle x plural forms for French #91

merwok commented Jan 7, 2019

dscorbett commented Jan 7, 2019

merwok commented Jan 7, 2019 •

edited

Loading

merwok commented Mar 25, 2019

ririsoft commented Aug 29, 2019 •

edited

Loading

merwok commented Aug 29, 2019

ririsoft commented Aug 29, 2019

ojwb commented Sep 4, 2019

merwok commented Jun 30, 2020

ojwb commented Jul 7, 2020

merwok commented Jul 8, 2020 •

edited

Loading

ojwb commented Jul 9, 2020

bkazez commented Nov 2, 2020

ojwb commented Nov 3, 2020

merwok commented Nov 3, 2020

ojwb commented Nov 4, 2020

bkazez commented Nov 4, 2020 •

edited

Loading

merwok commented Nov 4, 2020

ojwb commented Nov 16, 2022

merwok commented Nov 16, 2022

Handle x plural forms for French #91

Handle x plural forms for French #91

Comments

merwok commented Jan 7, 2019

dscorbett commented Jan 7, 2019

merwok commented Jan 7, 2019 • edited Loading

merwok commented Mar 25, 2019

ririsoft commented Aug 29, 2019 • edited Loading

merwok commented Aug 29, 2019

ririsoft commented Aug 29, 2019

ojwb commented Sep 4, 2019

merwok commented Jun 30, 2020

ojwb commented Jul 7, 2020

merwok commented Jul 8, 2020 • edited Loading

ojwb commented Jul 9, 2020

bkazez commented Nov 2, 2020

ojwb commented Nov 3, 2020

merwok commented Nov 3, 2020

ojwb commented Nov 4, 2020

bkazez commented Nov 4, 2020 • edited Loading

merwok commented Nov 4, 2020

ojwb commented Nov 16, 2022

merwok commented Nov 16, 2022

merwok commented Jan 7, 2019 •

edited

Loading

ririsoft commented Aug 29, 2019 •

edited

Loading

merwok commented Jul 8, 2020 •

edited

Loading

bkazez commented Nov 4, 2020 •

edited

Loading