-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Search user does not work with some specific Vietnamese letters #13655
Comments
Can you please confirm what database you're using? If postgres, what is the synapse database's locale and encoding? ( |
Hi, We have reported the problem for matrix.org, because the behaviour there is the same. But originally encountered it on our own server which has a postgres DB.
|
It's worth noting that the
synapse/synapse/storage/databases/main/user_directory.py Lines 910 to 924 in 888a29f
>>> from synapse.storage.databases.main.user_directory import _parse_query_postgres
>>> _parse_query_postgres("Gá")
('(Ga:* | Ga)', 'Ga', 'Ga:*')
>>> _parse_query_postgres("Gáo")
('(Ga:* | Ga) & (o:* | o)', 'Ga & o', 'Ga:* & o:*') Line 918 probably needs to accept There are probably other places in the code where |
We have a similar problem is with all Russian characters too - case-insensitive search does not work, here is the issue about this: #3116 so I guess the solution could be the same. |
To expand on this: the quick and dirty proposal is to use something like This will fix exact matches not working, but will not resolve #1523, where Note that this still performs poorly. There are languages whose words consist of a variable number of Test code to compare re, regex and icu#!/usr/bin/env python3
import re
import regex
import icu
test_cases = [
"It's a nice day outside.",
"Received foo.png!",
"Gáo",
"C++20",
"3.14159. 3.",
"あなたはそれを行うべきではありません",
]
for text in test_cases:
re1_output = re.findall(r"([\w\-]+)", text, re.UNICODE)
re2_output = re.findall(r"\b\w.*?\b", text, re.UNICODE)
regex_output = regex.findall(r"\b\w.*?\b", text, regex.WORD)
icu_output = []
breaker = icu.BreakIterator.createWordInstance(icu.Locale.getDefault())
breaker.setText(text)
i = 0
while True:
j = breaker.nextBoundary()
if j < 0:
break
icu_output.append(text[i:j])
i = j
print(f"Text: {text!r}")
print(f" re.findall(r\"([\\w\\-]+)\"): {re1_output!r}")
print(f" re.findall(r\"\\b\\w.*?\\b\"): {re2_output!r}")
print(f" regex.findall(r\"\\b\\w.*?\\b\"): {regex_output!r}")
print(f" icu: {icu_output!r}")
This issue only concerns word boundaries, and not any sort of normalization, stemming or case/accent folding for searching. And we will have to ensure that the words in the postgres index are the same as the words we search for if/when we change the logic. @reivilibre's view is that it would be best if we can find a way to have postgres or some library handle all this for us. |
Or even some external full-text search database. Lucene or something that uses it? |
We have tried this approach, but unfortunately it didn't help. |
Hey, just wondering, what would the timeline be for integrating such a library or some external full-text search database? |
Customer tried to use the fix proposed by @squahtx, but was unsuccessful. |
Fixes #13655 This change uses ICU (International Components for Unicode) to improve boundary detection in user search. This change also adds a new dependency on libicu-dev and pkg-config for the Debian packages, which are available in all supported distros.
Description
If you search for a user having some special letters with accents in its name (like "á") then the suggestions become empty as soon as you type the subsequent letter after the special character.
See also the attached video demonstrating the problem.
Steps to reproduce
Please refer to the attached video.
Homeserver
matrix.org
Synapse Version
{"server_version":"1.66.0rc1 (b=matrix-org-hotfixes,ce8f7d118c)","python_version":"3.8.12"}
Installation Method
No response
Platform
app.element.io as webclient, matrix.org as homeserver.
2022-08-29.13-42-55.mp4
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: