-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]: Diacritic-insensitive search #2678
Comments
Related to (currently closed) #2187 |
In #2187 I mention a sqlite extension we could look at using to support case insensitive queries for non-ASCII |
OK, so here's rougly the solution I was thinking of:
@advplyr does this seems like a reasonable approach? I don't think there are many normalized_x columns we'll need to create. This approach will of course work with any kind of normalization which we don't currently support (e.g. lowercasing non-ascii capitals, removing Hebrew and Arabic diacritic marks, etc...) |
I think this approach would be difficult to scale if we wanted to expand what is currently searchable. Currently we would need a normalized_x for: Genres and tags are currently JSON arrays on each book, so normalized_x would need to be a JSON array for those. I think we should exhaust the possibility of doing this with sqlite first. If we have to do it this way we might be able to get away with only 1 column per table with a |
According to my research, it seems like the only sane way to do this without additional columns is by adding a UDF that implements remove_diacritics(), and calling that UDF at runtime, where the condition looks like:
I'm worried about the performance implications of this approach, but we can try this. |
Unfortunately, sqlite3 on node.js does not seem to support UDFs :( |
It looks like the only option is to build the sqlite3 binaries ourselves. TryGhost/node-sqlite3#70 (comment) |
OK, so after a few hours of struggle, I was able to load the sqlean unicode extension into the underlying sqlite3 db, so the So it looks like we won't need to build sqlite3 binaries, although we would need to deploy the unicode extension binaries (which are very small, ~100-150kB) Now I'm progressing with modifying all the relevant queries. Stay tuned... |
Reverted in v2.12.3 |
I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one. Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:
What do you think? Edit: It looks like building ICU degrades performances in general:
|
I haven't tried, but looking at it, it doesn't seem to come with accent-normalization functions (unaccent), just unicode aware case functions (lower, upper). We need both.
Yeah. Amusingly enoguh, it was deprecated a couple of weeks ago, right after I integrated it in my PR... |
I'm not sure how worthwhile it is but just supporting unicode lowercase would be helpful to some. A bug report was put in just for that #2187 |
I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements... I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features? Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin. |
Replacing the underlying db for supporting unaccent seems like a bit of an overkill to me.
Yes, for the record (after some digging) it looks like using the SQLite ICU + ICU extension can work as well (at least in theory). This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them. I am quite reluctant to go that way.
I haven't looked into this in detail, but yes, it does seem to provide unaccent. However, I must say I'm not sure why the unicode extension deprecation seems to give you so much grief. I'm sure that at some point its functionality will be fully supported by the text extension, and in the meantime, it provides exactly the functionality we need at almost no cost (performance, size, loading). The issues caused by its introduction were due to my own failings, not the extension itself, and those can (hopefully) be fixed. |
Update: I explicitly asked for unaccent support in sqlean/text, and the maintainer, for some reason, says that he does not plan to support it. So, at this point I'm dropping my plans to re-introduce sqlean/unicode as is. I'm going to either look into the nunicode alternative @devnoname120 suggested, or fork the unicode extension and maintain it myself. This is all going to take a while, so unfortunately the fix is not likely to happen in the very near future. |
I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this. |
Are there any open source ones where I can look at the code?
…On Wed, Aug 14, 2024 at 6:02 PM advplyr ***@***.***> wrote:
I haven't looked into this yet but I know that most of the other media
servers use Sqlite so I wonder how they are handling this.
—
Reply to this email directly, view it on GitHub
<#2678 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFMDFVXDQPBHFN5WI7FYNSLZRNWRJAVCNFSM6AAAAABD4T3Q5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBZGA2TSMBVGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Jellyfin is open source and uses sqlite. They use 2 db files and I just looked at the one named It looks like that table stores everything that would be searchable. My movies and tv series I scanned in created about 125k rows in that table. |
So it seems they use your first suggestion of an additional column but they only need 1 column because all the values are in 1 table. |
OK, I'll look into this direction as well. I must admit, though, that extensions seem like a better idea generally, due to performance considerations, and also because everything is done at the database level. I think we'll also need additional extensions in the future (e.g. to support natural sorting), so the idea of depending on db extensions has merit IMO, and is superior to implementing stuff with application logic. If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding (which we should never touch), and a quite thin layer of logic that uses those tables to extend/replace existing functionality. There's little chance that any of this would ever need to change (unless there's breaking changes in SQLite itself), barring bug fixes. |
Thanks, I agree. It would be much better if we can handle this with sqlite |
I looked at the plex sqlite db and they are doing something different. I can't open the Quick search led me https://forums.plex.tv/t/can-no-longer-update-library-database-with-sqlite3/701405/3 |
The discussion indicates the Plex uses the fts4 extension for implementing full text search. |
I don't think so. It is useful to find out how everyone else is solving this with sqlite. I'm sure there are others I can look into later. |
As far as I understand it's an either, not an and. You can compile
For other operating systems, you can use https://www.sqlite.org/loadext.html#compiling_a_loadable_extension as an inspiration (just need to add the icu lib in the arguments).
Handling diacritics is exquisitely tricky to get right. There is a staggering myriad of exceptions and edge cases that need to be taken in account. Reading the official Unicode Collation Algorithm (UCA) is enough to convince oneself of that. Although it handles ordering rather than just equality, it's 66 pages long and it keeps evolving. This official algorithm also requires the Default Unicode Collation Element Table (DUCET), which is even bigger. IMHO we would be better off using the reference implementation (https://github.com/unicode-org/icu) from the Unicode group, which is what What do you think? |
Yeah, you're probably right. I probably misread some instructions on stackoverflow. You still need to also deploy/install icu libs for this to work.
I'm guessing there are good reasons why prople are providing these alternate implementations. As you say, icu collations are extremely complex, which likely leads to performance issues, and many are looking for good-enough implementations that are standalone, less complex, and probably don't handle all the small details and all the many locales that the reference implementation has to deal with. I think that for ABS, it would be ok to use one of these "good-enough" implementations. I've checked the unicode extension against many accented books and authors in my library, and they all work fine. I'll try it though, and if I see it's not complex to use and deploy, and doesn't have glaring performance issues, I'll use that. |
@mikiher It looks like Node.js is shipped with libicu by default, and starting from Node.js 13 it comes with the full ICU data as well (prior to that it was only English by default). I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js). I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update? Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere. For example our Docker containers currently rely on For SQLite's Note that |
At least on Windows (and I think on other platforms by default), I believe icu is statically linked to the node executable, so I don't think it can be used by the icu extension.
Probably not, but if you use indices that depend on icu collations (as we painfully learned in our case), you may need to reindex those.
Agreed, I don't see installing icu dependencies as a big issue (except maybe for Windows, but there we have an installer which can take care of that).
I'm not sure how exactly to use node-gyp for our purposes. IIUC, its purpose is building native addon modules for node. I don't need to build a native addon - I need to build a shared library, which sqlite3 loads from inside its C implementation. In any case, I think it's important to remember that this is a node project. I don't want to require developers to have C/C++ toolchains on their dev machines. If dependencies need to be built, we need to do it on github workflows or a similar infrastructure, and make it available as release assets. or just check-in the pre-built extension binaries. |
OK, so here's some update on my experiments.
Now since we need The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs. And I haven't tested query performance at all yet. |
@mikiher Thanks for the tests. With regard to the
See Note: I was able to reproduce your issue, and here is what I did for my future reference: brew install sqlite3 icu4c
export PKG_CONFIG_PATH="/opt/homebrew/opt/icu4c/lib/pkgconfig:/opt/homebrew/opt/sqlite/lib/pkgconfig"
# Note: I first tried -o libSqliteIcu.dylib but it broke .load with this error:
# Error: dlsym(0x83836050, sqlite3_sqliteicu_init): symbol not found
gcc -fPIC -dynamiclib icu.c `pkg-config --libs --cflags icu-uc icu-io sqlite3` -o icu.dylib export PATH="/opt/homebrew/opt/sqlite/bin:$PATH"
❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load icu.dylib
sqlite> SELECT icu_load_collation('root', 'aici', 'PRIMARY');
sqlite> SELECT 'Árbol' = 'arbol' collate aici AS result;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici AS result;
0 I'm not sure if we are doing this right in order to make |
We are doing it right. The LIKE implementation is just hard-coded to doing what it's written to do (Unicode aware case insensitive matching), and does not respect the specified collation. You can look at the code - it's relatively straightforward. The SQL syntax is correct, otherwise there would be a syntax error. the COLLATE part is just ignored. If you find another suitable SQL expression to make the collation work with substring matching, please let me know. |
@mikiher Ah indeed. I took a look at the current Docker image and it's currently I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad. As an exercise I made a quick implementation of an ICU-based
The code (this is an old version, I'm almost done cleaning it up, using #include <sqlite3ext.h>
#include <unicode/ustring.h>
#include <unicode/unorm2.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
#include <stdlib.h>
#include <string.h>
SQLITE_EXTENSION_INIT1
void unaccent(const UChar *input, UChar **output, UErrorCode *status) {
const UNormalizer2 *normalizer = unorm2_getNFDInstance(status);
if (U_FAILURE(*status)) {
return;
}
int32_t inputLength = u_strlen(input);
/* Maximum output of NFD transformation is a factor of 4.
** See: https://unicode.org/faq/normalization.html#12
*/
int32_t maxNormalizedLength = 4 * inputLength;
UChar *normalized = (UChar *)malloc(maxNormalizedLength * sizeof(UChar));
if (!normalized) {
*status = U_MEMORY_ALLOCATION_ERROR;
return;
}
int32_t normalizedLength = unorm2_normalize(normalizer, input, -1, normalized, maxNormalizedLength, status);
if (U_FAILURE(*status)) {
free(normalized);
return;
}
*output = (UChar *)malloc((normalizedLength + 1) * sizeof(UChar));
if (!(*output)) {
*status = U_MEMORY_ALLOCATION_ERROR;
free(normalized);
return;
}
int32_t resultIndex = 0;
for (int32_t i = 0; i < normalizedLength;) {
UChar32 c;
U16_NEXT(normalized, i, normalizedLength, c);
/* See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a6a2dbc531efce8d77fdb4c314e7fc25e */
if (u_charType(c) != U_COMBINING_SPACING_MARK && u_charType(c) != U_NON_SPACING_MARK && u_charType(c) != U_ENCLOSING_MARK) {
U16_APPEND(*output, resultIndex, normalizedLength, c, *status);
}
}
(*output)[resultIndex] = 0;
free(normalized);
}
static void unaccent_sqlite(sqlite3_context *context, int argc, sqlite3_value **argv) {
if (argc != 1) {
sqlite3_result_error(context, "Invalid number of arguments", -1);
return;
}
const char *input = (const char *)sqlite3_value_text(argv[0]);
if (input == NULL) {
sqlite3_result_null(context);
return;
}
/* FIXME: do a dynamic allocation like the for the output */
UChar u_input[256];
UErrorCode status = U_ZERO_ERROR;
u_strFromUTF8(u_input, sizeof(u_input)/sizeof(u_input[0]), NULL, input, -1, &status);
if (U_FAILURE(status)) {
sqlite3_result_error(context, u_errorName(status), -1);
return;
}
UChar *u_output = NULL;
unaccent(u_input, &u_output, &status);
if (U_FAILURE(status)) {
sqlite3_result_error(context, u_errorName(status), -1);
if (u_output) free(u_output);
return;
}
char *output = (char *)malloc((u_strlen(u_output) * 4 + 1) * sizeof(char));
if (!output) {
sqlite3_result_error(context, "Memory allocation error", -1);
free(u_output);
return;
}
u_strToUTF8(output, u_strlen(u_output) * 4 + 1, NULL, u_output, -1, &status);
if (U_FAILURE(status)) {
sqlite3_result_error(context, u_errorName(status), -1);
free(output);
free(u_output);
return;
}
sqlite3_result_text(context, output, -1, SQLITE_TRANSIENT);
free(output);
free(u_output);
}
#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_unaccent_init(
sqlite3 *db,
char **pzErrMsg,
const sqlite3_api_routines *pApi
){
SQLITE_EXTENSION_INIT2(pApi);
return sqlite3_create_function(db, "unaccent", 1, SQLITE_UTF8, NULL, unaccent_sqlite, NULL, NULL);
} I could definitely integrate it in |
This is very nice, and (at a glance) seems to do what's needed (NFD normalization and removal of all accent characters). At this point, since you're now implementing your own full-fledged extension, if you wish to continue with this effort, I'm happy to leave the rest to you. From my perspective, I'd be happy to use this code if it's well tested, and if it doesn't degrade search performance significantly. I think it would also be a valuable extension for the SQLite user community in general. I think this should be put in a separate project, and preferably provide pre-built binaries for the top platforms, like SQLean does - as I said above, I'd prefer not to require ABS developers to maintain C toolchains. @advplyr, since you will need to eventually approve integrating this, it would be good to get early feedback on this.
I think we have some users with tens of thousands of books in their library. Regarding an accent and case-insensitive INDEX - yes, I was intending to replace the existing NOCASE index with an aici index. As for how much it can improve performance (especially with LIKE comparisons), I really don't know - it needs to be tested. |
@mikiher Thank you for your thoughtful comment. After digging into alekseyt/nunicode I'm convinced that it's an amazing option and I'd advocate for going in that direction. I think it's much better/performant/resilient than I could do building my own SQLite extension using ICU. Upsides:
Downsides:
Now for the tests:
@mikiher Let me know your thoughts! |
Hi, sorry for the long time it took - had many other things on my plate. Wanted to give an update on my experiments with nunicode. The first thing I needed to do was to built it. The project as it is currently only provides pre-built binaries for Windows x86 32-bit and Linux x86 32-bit and 64-bit. So:
The experiments are at https://github.com/mikiher/nunicode-sqlite I tested the built binaries:
I ran the same rudimentary tests as you did above with the sqlite3 CLI. the Windows and Ubuntu tests seem to work nicely, but the arm64 test has some issues, which I still need to figure out: It loads the library successfully and runs some of the test queries successfully, but fails on the following:
I have no idea why it fails - you're welcome to take a look at the workflow and if you have any suggestions, I'll be glad to hear. |
@mikiher Thank you for your tests. Weirdly enough the tests that I performed were done on arm64… Specifically on macOS where both M1 and M2 worked fine. I didn't (need to) cross-compile though. Which version of sqlite3 do you use? Which options are enabled? I just tested your example and here is what I see on my end:
I did this in order to compile the sqlite extension: mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make |
It doesn't? so it doesn't matter which compiler I used for cross compilation?
This is roughly what I did as well. You can see all of the actions I ran in the github workflow As I said there were a few changes I had to make:
I don't think any of these changes really matter, with (maybe) the exception of the choice of cross compiler - I used |
@mikiher Hmm it seems that the Tangentially I think that it would make sense at some point to reach out to @alekseyt directly ([email protected]). The README of nunicode provides this email address and explicitly encourages people to contact him. |
Thanks, I will also try to cross compile for musl and see if that helps. The cross compilation is kind of important since I need a way to automatically build the required libraries for each architecture, and there are no github-hosted Alpine aarch64 runners. |
So it turned out all of the issues were due to the crappy QEMU console, that seems to mess up non-ascii input :( Once I connected to the QEMU emulator using ssh (which has proper UTF-8 support), all the issues went away.
They all pass the simple tests above. I think I'll stick with the original version I built because it's easiest to set-up the cross-compiler to build it. Next I'm going to work on integrating the extenstions in ABS and testing. |
@mikiher For Docker I'd suggest compiling nunicode directly for each architecture from an Alpine container as part of a multi-stage build and not rely on See here for an example of a multi-platform build. Docker makes it surprisingly easy, and you don't have to bother with QEMU by yourself: And the doc for multi-stage builds in case you are not familiar with them: I can cobble up a PR if it helps! |
But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available? In the future we can consider providing Nunicode as an apk (I admit the additional code in the dockerfile is somewhat ugly). |
As far as I know Alpine is the only non-marginal Linux distribution that uses musl by default. It does so because it's designed to minimize the size to the strict minimum for use in containers and embedded systems. Virtually all the other Linux distributions use glibc. IMO the library that is deployed in the Alpine containers should be built from a musl distro (e.g. in an Alpine container) to avoid any issues that may arise while using the gcompat compatibility layer. I'm worried that bugs that end up being actually caused by the compat layer will be very difficult to diagnose. For all the other distributions the libraries that are distributed should be compiled against glibc. So you could either release two flavors of the nunicode library (musl and glibc), or only release the glibc flavor and build the musl flavor in a stage of the audiobookshelf Docker image. |
I'll add the musl flavor to the nunicode-binaries release, and use it to build the arm64 Docker image. |
@mikiher Hmm why only the arm64 flavor? |
Sorry, was a bit distracted. I meant both arm64 and amd64. |
Submitted in PR #3488 |
Describe the issue
A lot of non-english languages have accented letters/diacritics (https://en.wikipedia.org/wiki/Diacritic). It is quite common to not include the accent when writing text online, especially when on mobile phones (which make it difficult and slow to write accented letters). Therefore, it is common for all search engines to drop the accent before performing the search (pretty much every single search engine does this). Unfortunately, this is not the case for ABS.
For example all the following searches should find the book "Černá lilie", but they don't:
Also, note that the last example of search
černá
is the same asČerná
except for case of the first letter. That also proves that the search is not case-insensitive if accented letters are present.I never wrote a line of code in nodejs, so unfortunately I don't feel confident to propose a fix myself. But I tested that this nodejs code can be used to drop all accents in my language (Czech), so perhaps that could be used somehow to fix this issue.
Probably better way to fix this would be directly in the SQLite query, maybe something like this could work. However, there could be some issues with that too (https://www.sqlite.org/faq.html#q18).
Steps to reproduce the issue
Černá
cerna
,černa
,cerná
,černá
orCerna
.Audiobookshelf version
2.7.2
How are you running audiobookshelf?
Docker
The text was updated successfully, but these errors were encountered: