Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle x plural forms for French #91

Open
merwok opened this issue Jan 7, 2019 · 19 comments
Open

Handle x plural forms for French #91

merwok opened this issue Jan 7, 2019 · 19 comments

Comments

@merwok
Copy link

merwok commented Jan 7, 2019

Hello! I hope this is the right place to report a problem I found with an app that uses PostgreSQL 10's full-text search.

There is a class of French nouns that form their plural in x: jeux, hiboux, choux, aulx, baux, etc.

Testing with PG and reading the doc at https://snowballstem.org/algorithms/french/stemmer.html make me think that these are not handled.

@dscorbett
Copy link
Contributor

It’s hard to stem such words properly without overstemming words like roux and faux. Do you have any suggestion for how to do it?

@merwok
Copy link
Author

merwok commented Jan 7, 2019

Maybe this needs a dictionary solution rather than an algorithmic one!

Only seven -ou words have plurals with x: chou bijou joujou genou caillou hibou pou

For aulx / yeux / baux, the singular is aïl / œil / bail, so these can’t be derived automatically either but should be in a list or dictionary.

For words ending in -eu, the plural is with x except for a couple of them.
There are a ton of adjective ending in -eux (singular and plural) though, so I don’t know if stemming -eux to -eu would be ok or not.

@merwok
Copy link
Author

merwok commented Mar 25, 2019

Hello! How can we move this forward?

@ririsoft
Copy link

ririsoft commented Aug 29, 2019

Hello,

I have an issue with the adjective 'français' which has no masculine singular form while the feminine form has. The steaming result using PostgreSQL 11 is the following :

'français' -> 'franc'
'française' -> 'français'
'françaises' -> 'français'

I would expect they are all 'franc' or 'français'.

I guess that a dictionary approach could help for such case such as the issue started here, or you prefer that I open a separate issue to track my issue ?

Note : on PostgreSQL I can solve the issue by creating my own synonym dictionary and change the text search config to use this dictionary before snowball steam, but I believe it makes more sense to have this fixed on snowball side.

@merwok
Copy link
Author

merwok commented Aug 29, 2019

I would expect they are all 'franc' or 'français'.

I think français is the stem.

on PostgreSQL I can solve the issue by creating my own synonym dictionary and change the text search config to use this dictionary before snowball steam

Oh so this means I could solve the issue with jeux, hiboux, etc?

@ririsoft
Copy link

Oh so this means I could solve the issue with jeux, hiboux, etc?

Yes you could create a synonym dictionary and have it used in the pipeline before the french_stem. I recommend reading the PostgreSQL documentation on textsearch dictionnaries for how to do that. The documentation is also available in french ;)

But since your issue and mine are so basic as far as french is concerned I would much prefer to have it solved here.

@ojwb
Copy link
Member

ojwb commented Sep 4, 2019

Please don't hijack a ticket with unrelated issues - open your own ticket if you have something to report.

@merwok
Copy link
Author

merwok commented Jun 30, 2020

ping! how can I help solve this?

@ojwb
Copy link
Member

ojwb commented Jul 7, 2020

The purpose of these stemmers is for use with Information Retrieval ("text search" in less formal terms) - stemming words essentially allows conflating different forms of the "same" word, which usually improves recall (as in https://en.wikipedia.org/wiki/Precision_and_recall). The risk is it tends to reduce precision - that's somewhat inherent as different forms of a word can convey subtle differences in meaning, but the more problematic cases are where forms of different words get conflated.

For example, the original English Porter stemmer ("porter" in snowball) stemmed both "skies" and "skis" to "ski", so a search for "ski" would find a document which wasn't connected to skiing at all but actually about "skies" - that's much worse than if it didn't stem "skies" at all. (The snowball "english" stemmer addresses this and stems "skies" to "sky".)

A stemmer for a human language is probably inevitably going to be imperfect, but it's good to keep in mind the end goal is improving retrieval rather than reducing every word to its linguistic root.

So to "solve" this we need a concrete plan for how to stem plural nouns which end with "x" without adversely affecting other words which end with "x". It may be there just isn't a sensible way to do that, but then not stemming such words is a decent status quo - using a stemmer which doesn't stem such words is no worse than not using a stemmer at all.

@merwok
Copy link
Author

merwok commented Jul 8, 2020

The oux plurals and funky forms (yeux etc) are a small, fixed list. Is there a way in the current codebase to add that somewhere to do fixed replacements?

@ojwb
Copy link
Member

ojwb commented Jul 9, 2020

There isn't currently such an exception list, though you obviously could add one.

But the first thing we need is to come up with a proposed list of such replacements, and carefully check it for potential problems - for example, "baux" -> "bail" seems problematic, as "baux" is also the plural of "bau" (https://en.wiktionary.org/wiki/baux#French).

Then we need a patch that implements those.

And also such exceptions should all be added to the test vocabulary if they aren't already in it, so that we have good test coverage for the change.

And finally the algorithm description on the website needs updating to match.

@bkazez
Copy link

bkazez commented Nov 2, 2020

mal/maux is also not handled. How can I help solve this?

@ojwb
Copy link
Member

ojwb commented Nov 3, 2020

@bkazez I already outlined what's needed in a comment just above yours.

@merwok
Copy link
Author

merwok commented Nov 3, 2020

«We need a patch» doesn’t provide much guidance for people who don’t know the codebase 🙂

@ojwb
Copy link
Member

ojwb commented Nov 4, 2020

I didn't just say "we need a patch" though.

The first thing we need is a workable plan for the change we're going to make. This needs to consider all the effects of the proposed change, to make sure we aren't making things worse for other cases (or if we are, that such unwanted consequences are definitely outweighed by the benefits).

The patch itself is literally an implementation detail.

But I should note that these algorithms are intended for use in text search systems, where stemming is a common way to improve recall. For the intended use, over-stemming is more problematic than under-stemming, so we tend not to stem in cases that are hard to resolve. (If you want to always reduce words to a root form then Snowball's stemming algorithms likely aren't the right answer as that isn't the purpose they are designed for.)

@bkazez
Copy link

bkazez commented Nov 4, 2020

I have never contributed to Snowball. I'm not a native French speaker, but I know some and am happy to start the exception list. Where do you store such a list for remote collaboration?

Also, should I start such a list for #139 too or is that one different?

@merwok
Copy link
Author

merwok commented Nov 4, 2020

Apologies, you did provide a list of steps to be taken.

PostgreSQL full-text search system relies on this stemmer, and defining a custom dictionary or lexer is not possible in environments where you don’t have full control of the database server, so I still think this would be best fixed here. Again, I would limit the scope to a very small list of well-defined irregular plurals, to avoid overreaching and adding regressions. But I believe it is not controversial to want choux → chou just like we have ballons → ballon!

There isn't currently such an exception list, though you obviously could add one.

I am not a C programmer so I could try but without guarantee of success.

But the first thing we need is to come up with a proposed list of such replacements, and carefully check it for potential problems - for example, "baux" -> "bail" seems problematic, as "baux" is also the plural of "bau" (en.wiktionary.org/wiki/baux#French).

This is such a rare word that a false positive does not worry me (but my uses cases tend to want more results in preference to fewer but more exact matches). I will consult some dictionaries and propose a list!

And also such exceptions should all be added to the test vocabulary if they aren't already in it, so that we have good test coverage for the change.

I wouldn’t want to change code without having tests!

And finally the algorithm description on the website needs updating to match.

Sure. I found the file to be edited at https://github.com/snowballstem/snowball-website/blob/master/algorithms/french/stemmer.tt ; I’ll just need someone to validate results, as I don’t have java installed and the repo doesn’t seem to have CI building a preview.

@ojwb
Copy link
Member

ojwb commented Nov 16, 2022

I will consult some dictionaries and propose a list!

@merwok Did you get anywhere with this?

This is such a rare word that a false positive does not worry me (but my uses cases tend to want more results in preference to fewer but more exact matches).

Indeed - it may be fine for your case, but we do need to consider that a change will affect everyone using the stemmer, so we need to think about whether it's problematic for someone's situation.

My thoughts on this case is that it's better to just leave "baux" alone. That follows the general principle that overstemming is much more harmful than understemming, and would leave things as they are with the existing stemmer code (and if you didn't use stemming at all). It seems "baux" is going to be a fairly rare word whether it's the plural of "bau" or "bail".

@merwok
Copy link
Author

merwok commented Nov 16, 2022

I haven’t made a list yet, but am still interested in helping to improve this!

Agreed on leaving baux as is; people can search for bail then baux if they need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants