[Enhancement]: Diacritic-insensitive search #2678

tululum · 2024-02-27T18:29:00Z

Describe the issue

A lot of non-english languages have accented letters/diacritics (https://en.wikipedia.org/wiki/Diacritic). It is quite common to not include the accent when writing text online, especially when on mobile phones (which make it difficult and slow to write accented letters). Therefore, it is common for all search engines to drop the accent before performing the search (pretty much every single search engine does this). Unfortunately, this is not the case for ABS.

For example all the following searches should find the book "Černá lilie", but they don't:

Also, note that the last example of search černá is the same as Černá except for case of the first letter. That also proves that the search is not case-insensitive if accented letters are present.

I never wrote a line of code in nodejs, so unfortunately I don't feel confident to propose a fix myself. But I tested that this nodejs code can be used to drop all accents in my language (Czech), so perhaps that could be used somehow to fix this issue.

Probably better way to fix this would be directly in the SQLite query, maybe something like this could work. However, there could be some issues with that too (https://www.sqlite.org/faq.html#q18).

Steps to reproduce the issue

Create a book named Černá
Try to search for cerna, černa, cerná, černá or Cerna.
Note that the search result is empty.

Audiobookshelf version

2.7.2

How are you running audiobookshelf?

Docker

The text was updated successfully, but these errors were encountered:

nichwall · 2024-02-27T19:05:53Z

Related to (currently closed) #2187

advplyr · 2024-02-27T19:13:11Z

In #2187 I mention a sqlite extension we could look at using to support case insensitive queries for non-ASCII

mikiher · 2024-07-22T19:06:02Z

OK, so here's rougly the solution I was thinking of:

Assume we have a function remove_diacritics(x), which returns a diacricts-free version of x
For each table t with a column x for which we want to support diacritic-insensitive search, we add a column normalized_x
We perform a db migration, where for each row in table t:
- normalized_x = remove_diacritics(x) != x ? remove_diacritics(x) : null
After migrating, for every update or create of a row in t, we set normalized_x as above (we can use hooks for this).
When searching for query, we perform an OR of two conditions (pseudo-code below)
- x like %${query}%
- normalized_x like %${remove_diacritics(query)}%
If most x values in t are unchanged by remove_diacritics(), then both the performance and db size overheads are probably reasonable.

@advplyr does this seems like a reasonable approach? I don't think there are many normalized_x columns we'll need to create. This approach will of course work with any kind of normalization which we don't currently support (e.g. lowercasing non-ascii capitals, removing Hebrew and Arabic diacritic marks, etc...)

advplyr · 2024-07-22T21:31:03Z

I think this approach would be difficult to scale if we wanted to expand what is currently searchable. Currently we would need a normalized_x for:
Books: author, narrator, series, title, subtitle, tags, genres.
Podcasts: title, author, tags (should add genres and episode title)

Genres and tags are currently JSON arrays on each book, so normalized_x would need to be a JSON array for those.
If we wanted to expand to allow global search to search on anything else (like playlists, collections, descriptions, etc) we would have to write a script that backfills the new normalized_x column.

I think we should exhaust the possibility of doing this with sqlite first. If we have to do it this way we might be able to get away with only 1 column per table with a normalizedSearchKeys column or something like that.
Using a hook we normalize all the searchable values to put in a single column. Scaling that would still require backfilling but easier at least.

mikiher · 2024-07-23T05:58:36Z

According to my research, it seems like the only sane way to do this without additional columns is by adding a UDF that implements remove_diacritics(), and calling that UDF at runtime, where the condition looks like:

    sequelize.fn('remove_diacritics', sequelize.col('x')),
    {
      [Op.like]: `%${remove_diacritics(query)}%`
    }

I'm worried about the performance implications of this approach, but we can try this.

mikiher · 2024-07-23T07:51:08Z

Unfortunately, sqlite3 on node.js does not seem to support UDFs :(

advplyr · 2024-07-23T13:55:45Z

It looks like the only option is to build the sqlite3 binaries ourselves.

TryGhost/node-sqlite3#70 (comment)
https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md

mikiher · 2024-07-24T00:22:07Z

OK, so after a few hours of struggle, I was able to load the sqlean unicode extension into the underlying sqlite3 db, so the unaccent and other functions there became available to use in queries (it was quite hard to figure out how to do this, because the prescribed way of accessing it, hooking to afterConnect, doesn't work for sqlite due to some bug in Sequelize).

So it looks like we won't need to build sqlite3 binaries, although we would need to deploy the unicode extension binaries (which are very small, ~100-150kB)

Now I'm progressing with modifying all the relevant queries. Stay tuned...

advplyr · 2024-08-09T22:24:44Z

Reverted in v2.12.3

devnoname120 · 2024-08-10T14:58:23Z

It looks like the only option is to build the sqlite3 binaries ourselves.

I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.

Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:

⛔️ This extension is deprecated. Use text instead.

What do you think?

Edit: It looks like building ICU degrades performances in general:

The icu extension doesn't really help, because it replaces the SQL like() function which automatically disables the use of any index for the LIKE operator, regardless of the collation, so this will slow down with large data sets. As it affects every LIKE operator, not just specific ones, it might have unwanted side effects.

https://sqlite.org/forum/forumpost/524c146fbf

mikiher · 2024-08-10T19:11:50Z

It looks like the only option is to build the sqlite3 binaries ourselves.

I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.

I haven't tried, but looking at it, it doesn't seem to come with accent-normalization functions (unaccent), just unicode aware case functions (lower, upper). We need both.

Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:

⛔️ This extension is deprecated. Use text instead.

What do you think?

Yeah. Amusingly enoguh, it was deprecated a couple of weeks ago, right after I integrated it in my PR...
The problem is that the suggested replacement extension (text), does not seem to support unaccent() yet. I will be able to move to from unicode to text when it supports that.

advplyr · 2024-08-10T20:50:37Z

I'm not sure how worthwhile it is but just supporting unicode lowercase would be helpful to some. A bug report was put in just for that #2187

devnoname120 · 2024-08-10T22:27:59Z

I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the utfmb4 charset with the modern uca1400_ai_ci collation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.

About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...

I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?

Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.

mikiher · 2024-08-11T07:09:42Z

I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the utfmb4 charset with the modern uca1400_ai_ci collation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.

Replacing the underlying db for supporting unaccent seems like a bit of an overkill to me.

About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...

Yes, for the record (after some digging) it looks like using the SQLite ICU + ICU extension can work as well (at least in theory).
When you have that extension loaded, you can run:
SELECT icu_load_collation('root', 'aici', 'PRIMARY');
This creates a custom collation called aici which works on root (i.e. applies to all locales) and uses UCOL_PRIMARY collation strength (which means case and accent insensitive).
You can then (therretically) use this collation for comparisons and indexing, and an unaccent() function is not required.

This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.

I am quite reluctant to go that way.

I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?

Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.

I haven't looked into this in detail, but yes, it does seem to provide unaccent.

However, I must say I'm not sure why the unicode extension deprecation seems to give you so much grief. I'm sure that at some point its functionality will be fully supported by the text extension, and in the meantime, it provides exactly the functionality we need at almost no cost (performance, size, loading). The issues caused by its introduction were due to my own failings, not the extension itself, and those can (hopefully) be fixed.

mikiher · 2024-08-14T14:52:18Z

Update: I explicitly asked for unaccent support in sqlean/text, and the maintainer, for some reason, says that he does not plan to support it.

So, at this point I'm dropping my plans to re-introduce sqlean/unicode as is. I'm going to either look into the nunicode alternative @devnoname120 suggested, or fork the unicode extension and maintain it myself.

This is all going to take a while, so unfortunately the fix is not likely to happen in the very near future.

advplyr · 2024-08-14T15:02:22Z

I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this.

mikiher · 2024-08-14T15:06:54Z

Are there any open source ones where I can look at the code?

…

On Wed, Aug 14, 2024 at 6:02 PM advplyr ***@***.***> wrote: I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this. — Reply to this email directly, view it on GitHub <#2678 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMDFVXDQPBHFN5WI7FYNSLZRNWRJAVCNFSM6AAAAABD4T3Q5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBZGA2TSMBVGE> . You are receiving this because you commented.Message ID: ***@***.***>

advplyr · 2024-08-14T15:15:07Z

Jellyfin is open source and uses sqlite. They use 2 db files and I just looked at the one named library.db in the ItemValues table they store "Value" and "CleanValue" that is unaccented.

It looks like that table stores everything that would be searchable. My movies and tv series I scanned in created about 125k rows in that table.

advplyr · 2024-08-14T15:16:46Z

So it seems they use your first suggestion of an additional column but they only need 1 column because all the values are in 1 table.

mikiher · 2024-08-14T15:49:36Z

OK, I'll look into this direction as well.

I must admit, though, that extensions seem like a better idea generally, due to performance considerations, and also because everything is done at the database level. I think we'll also need additional extensions in the future (e.g. to support natural sorting), so the idea of depending on db extensions has merit IMO, and is superior to implementing stuff with application logic.

If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding (which we should never touch), and a quite thin layer of logic that uses those tables to extend/replace existing functionality. There's little chance that any of this would ever need to change (unless there's breaking changes in SQLite itself), barring bug fixes.

advplyr · 2024-08-14T15:52:06Z

Thanks, I agree. It would be much better if we can handle this with sqlite

advplyr · 2024-08-14T15:57:53Z

I looked at the plex sqlite db and they are doing something different. I can't open the icu tables in DBeaver.

Quick search led me https://forums.plex.tv/t/can-no-longer-update-library-database-with-sqlite3/701405/3

mikiher · 2024-08-14T16:22:16Z

The discussion indicates the Plex uses the fts4 extension for implementing full text search.
I think FTS is likely an overkill for our purposes, though it may include the functionality we need. Do you think Audiobookshelf requires full-text-search capabilities?

advplyr · 2024-08-14T16:25:27Z

I don't think so. It is useful to find out how everyone else is solving this with sqlite. I'm sure there are others I can look into later.

devnoname120 · 2024-08-14T18:41:28Z

This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.

As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner, see this excerpt from ext/icu/README.txt:

The easiest way to compile and use the ICU extension is to build
and use it as a dynamically loadable SQLite extension. To do this
using gcc on *nix:
gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io` \
-o libSqliteIcu.so
You may need to add "-I" flags so that gcc can find sqlite3ext.h
and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
loaded into sqlite in the same way as any other dynamically loadable
extension.

For other operating systems, you can use https://www.sqlite.org/loadext.html#compiling_a_loadable_extension as an inspiration (just need to add the icu lib in the arguments).

If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding

Handling diacritics is exquisitely tricky to get right. There is a staggering myriad of exceptions and edge cases that need to be taken in account. Reading the official Unicode Collation Algorithm (UCA) is enough to convince oneself of that.

Although it handles ordering rather than just equality, it's 66 pages long and it keeps evolving. This official algorithm also requires the Default Unicode Collation Element Table (DUCET), which is even bigger.

IMHO we would be better off using the reference implementation (https://github.com/unicode-org/icu) from the Unicode group, which is what ext/icu/icu.c relies on under the hood. Maintaining a custom solution may prove a lot of maintenance down the road in order to account for all the diacritic-related search issues that users may report.

What do you think?

mikiher · 2024-08-15T05:00:16Z

As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner

Yeah, you're probably right. I probably misread some instructions on stackoverflow. You still need to also deploy/install icu libs for this to work.

What do you think?

I'm guessing there are good reasons why prople are providing these alternate implementations. As you say, icu collations are extremely complex, which likely leads to performance issues, and many are looking for good-enough implementations that are standalone, less complex, and probably don't handle all the small details and all the many locales that the reference implementation has to deal with. I think that for ABS, it would be ok to use one of these "good-enough" implementations. I've checked the unicode extension against many accented books and authors in my library, and they all work fine.

I'll try it though, and if I see it's not complex to use and deploy, and doesn't have glaring performance issues, I'll use that.

devnoname120 · 2024-08-17T19:18:47Z

@mikiher It looks like Node.js is shipped with libicu by default, and starting from Node.js 13 it comes with the full ICU data as well (prior to that it was only English by default).

I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).

I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?

Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.

For example our Docker containers currently rely on alpine v3.20 (base image of alpine-node20) whose apk package manager provides an icu-libs package that supports all architectures: https://pkgs.alpinelinux.org/packages?name=icu-libs&branch=v3.20&repo=main&arch=&maintainer=

For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we already use node-gyp?).

Note that icu.c doesn't have any dependencies so compiling it very low-risk in terms of flakiness. As long as there is a bare C compiler/linker around it should compile without issues.

mikiher · 2024-08-18T06:55:05Z

I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).

At least on Windows (and I think on other platforms by default), I believe icu is statically linked to the node executable, so I don't think it can be used by the icu extension.

I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?

Probably not, but if you use indices that depend on icu collations (as we painfully learned in our case), you may need to reindex those.

Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.

Agreed, I don't see installing icu dependencies as a big issue (except maybe for Windows, but there we have an installer which can take care of that).

For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we already use node-gyp?).

I'm not sure how exactly to use node-gyp for our purposes. IIUC, its purpose is building native addon modules for node. I don't need to build a native addon - I need to build a shared library, which sqlite3 loads from inside its C implementation.

In any case, I think it's important to remember that this is a node project. I don't want to require developers to have C/C++ toolchains on their dev machines. If dependencies need to be built, we need to do it on github workflows or a similar infrastructure, and make it available as release assets. or just check-in the pre-built extension binaries.

mikiher · 2024-08-19T16:23:25Z

OK, so here's some update on my experiments.

I compiled the icu extension and loaded it successfully. I was then also able to create a custom collation with:
SELECT icu_load_extension('root', 'aici', 'PRIMARY')
I was also able to test that comparison based on this collation was indeed accent- and case-insensitive with:
SELECT 'Árbol' = 'arbol' collate aici AS result
However, I was not successful testing that the following also returns 1:
SELECT 'Árbol' LIKE '%arb%' collate aici AS result
It turns out that LIKE in SQLite ignores collations (unlike other databases, which seem to support this syntax)

Now since we need LIKE for searching, we're back to square one, at least for this specific solution.

The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs. And I haven't tested query performance at all yet.

devnoname120 · 2024-08-20T01:29:07Z

@mikiher Thanks for the tests. With regard to the LIKE something must be wrong because ext/icu/icu.c explicitly mentions the following (emphasis mine):

This file implements an integration between the ICU library
("International Components for Unicode", an open-source library
for handling unicode data) and SQLite. The integration uses
ICU to provide the following to SQLite:

* An implementation of the SQL regexp() function (and hence REGEXP
operator) using the ICU uregex_XX() APIs.

* Implementations of the SQL scalar upper() and lower() functions
for case mapping.

* Integration of ICU and SQLite collation sequences.

* An implementation of the LIKE operator that uses ICU to
provide case-independent matching.

See icuLikeFunc() in the same file for the implementation.

Note: I was able to reproduce your issue, and here is what I did for my future reference:

brew install sqlite3 icu4c

export PKG_CONFIG_PATH="/opt/homebrew/opt/icu4c/lib/pkgconfig:/opt/homebrew/opt/sqlite/lib/pkgconfig"

# Note: I first tried -o libSqliteIcu.dylib but it broke .load with this error:
# Error: dlsym(0x83836050, sqlite3_sqliteicu_init): symbol not found
gcc -fPIC -dynamiclib icu.c `pkg-config --libs --cflags icu-uc icu-io sqlite3` -o icu.dylib

export PATH="/opt/homebrew/opt/sqlite/bin:$PATH"

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load icu.dylib
sqlite> SELECT icu_load_collation('root', 'aici', 'PRIMARY');

sqlite> SELECT 'Árbol' = 'arbol' collate aici AS result;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici AS result;
0

I'm not sure if we are doing this right in order to make LIKE work with the collation.

mikiher · 2024-08-20T04:30:57Z

We are doing it right.

The LIKE implementation is just hard-coded to doing what it's written to do (Unicode aware case insensitive matching), and does not respect the specified collation. You can look at the code - it's relatively straightforward.

The SQL syntax is correct, otherwise there would be a syntax error. the COLLATE part is just ignored.

If you find another suitable SQL expression to make the collation work with substring matching, please let me know.

devnoname120 · 2024-08-20T21:04:36Z

The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs

@mikiher Ah indeed. I took a look at the current Docker image and it's currently 564 MB. According to your estimations we would add 36 MB which is a 6% increase in image size. Definitely not negligible, but not a deal-breaker IMO especially if it means very little maintenance and all the edge cases and things we haven't thought about nicely working (so we won't have to get back to it and debug possible issues).

I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.

As an exercise I made a quick implementation of an ICU-based unaccent() function for SQLite and here are the results:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load unaccent.dylib
sqlite> SELECT 'Árbol' = 'arbol' COLLATE NOCASE;
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NOCASE;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' COLLATE NOCASE;
0
sqlite> SELECT unaccent('Árbol') LIKE '%arb%' COLLATE NOCASE;
1
sqlite> .load icu.dylib
SELECT icu_load_collation('root', 'aici', 'PRIMARY');

sqlite> SELECT 'Árbol' = 'arbol' collate aici;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici;
0
sqlite> SELECT unaccent('Árbol') LIKE '%arb%' collate aici;
1

The code (this is an old version, I'm almost done cleaning it up, using sqlite_malloc() instead of malloc(), etc etc. and it uses unnecessary dynamic allocations):

#include <sqlite3ext.h>

#include <unicode/ustring.h>
#include <unicode/unorm2.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
#include <stdlib.h>
#include <string.h>

SQLITE_EXTENSION_INIT1

void unaccent(const UChar *input, UChar **output, UErrorCode *status) {
    const UNormalizer2 *normalizer = unorm2_getNFDInstance(status);
    if (U_FAILURE(*status)) {
        return;
    }

    int32_t inputLength = u_strlen(input);

    /* Maximum output of NFD transformation is a factor of 4.
    ** See: https://unicode.org/faq/normalization.html#12
    */
    int32_t maxNormalizedLength = 4 * inputLength;
    UChar *normalized = (UChar *)malloc(maxNormalizedLength * sizeof(UChar));
    if (!normalized) {
        *status = U_MEMORY_ALLOCATION_ERROR;
        return;
    }

    int32_t normalizedLength = unorm2_normalize(normalizer, input, -1, normalized, maxNormalizedLength, status);
    if (U_FAILURE(*status)) {
        free(normalized);
        return;
    }

    *output = (UChar *)malloc((normalizedLength + 1) * sizeof(UChar));
    if (!(*output)) {
        *status = U_MEMORY_ALLOCATION_ERROR;
        free(normalized);
        return;
    }

    int32_t resultIndex = 0;
    for (int32_t i = 0; i < normalizedLength;) {
        UChar32 c;
        U16_NEXT(normalized, i, normalizedLength, c);
        /* See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a6a2dbc531efce8d77fdb4c314e7fc25e */
        if (u_charType(c) != U_COMBINING_SPACING_MARK && u_charType(c) != U_NON_SPACING_MARK && u_charType(c) != U_ENCLOSING_MARK) {
            U16_APPEND(*output, resultIndex, normalizedLength, c, *status);
        }
    }
    (*output)[resultIndex] = 0;

    free(normalized);
}

static void unaccent_sqlite(sqlite3_context *context, int argc, sqlite3_value **argv) {
    if (argc != 1) {
        sqlite3_result_error(context, "Invalid number of arguments", -1);
        return;
    }

    const char *input = (const char *)sqlite3_value_text(argv[0]);
    if (input == NULL) {
        sqlite3_result_null(context);
        return;
    }

    /* FIXME: do a dynamic allocation like the for the output */
    UChar u_input[256];
    UErrorCode status = U_ZERO_ERROR;

    u_strFromUTF8(u_input, sizeof(u_input)/sizeof(u_input[0]), NULL, input, -1, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        return;
    }

    UChar *u_output = NULL;
    unaccent(u_input, &u_output, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        if (u_output) free(u_output);
        return;
    }

    char *output = (char *)malloc((u_strlen(u_output) * 4 + 1) * sizeof(char));
    if (!output) {
        sqlite3_result_error(context, "Memory allocation error", -1);
        free(u_output);
        return;
    }

    u_strToUTF8(output, u_strlen(u_output) * 4 + 1, NULL, u_output, -1, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        free(output);
        free(u_output);
        return;
    }

    sqlite3_result_text(context, output, -1, SQLITE_TRANSIENT);

    free(output);
    free(u_output);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_unaccent_init(
  sqlite3 *db,
  char **pzErrMsg,
  const sqlite3_api_routines *pApi
){
    SQLITE_EXTENSION_INIT2(pApi);
    return sqlite3_create_function(db, "unaccent", 1, SQLITE_UTF8, NULL, unaccent_sqlite, NULL, NULL);
}

I could definitely integrate it in icu.c, and if wanted apply unaccent() in the LIKE operator.

mikiher · 2024-08-21T06:31:20Z

This is very nice, and (at a glance) seems to do what's needed (NFD normalization and removal of all accent characters).

At this point, since you're now implementing your own full-fledged extension, if you wish to continue with this effort, I'm happy to leave the rest to you.

From my perspective, I'd be happy to use this code if it's well tested, and if it doesn't degrade search performance significantly. I think it would also be a valuable extension for the SQLite user community in general.

I think this should be put in a separate project, and preferably provide pre-built binaries for the top platforms, like SQLean does - as I said above, I'd prefer not to require ABS developers to maintain C toolchains.

@advplyr, since you will need to eventually approve integrating this, it would be good to get early feedback on this.

I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.

I think we have some users with tens of thousands of books in their library.

Regarding an accent and case-insensitive INDEX - yes, I was intending to replace the existing NOCASE index with an aici index. As for how much it can improve performance (especially with LIKE comparisons), I really don't know - it needs to be tested.

devnoname120 · 2024-08-23T22:16:38Z

@mikiher Thank you for your thoughtful comment. After digging into alekseyt/nunicode I'm convinced that it's an amazing option and I'd advocate for going in that direction. I think it's much better/performant/resilient than I could do building my own SQLite extension using ICU.

Upsides:

Great performances.
Clear/clean code.
No dependencies.
300 KB.

Downsides:

No accent-insensitive COLLATION, but it has an unaccent() function.
- Should not be too complicated though to create a new collation that is the combination of NU1300_NOCASE + unaccent() if we feel the need to later on.
Not necessarily a downside but the LIKE operator is (surprisingly — at least for me) overridden to perform case-folding aka expansion-insensitive comparisons (see example SELECT 'masse' = 'maße'; below).
- Quote from the README:
  
  The LIKE operation in nunicode extension support cases when strings might grow in size during case transformation. On demand this extension may be modified to be compatible with ICU extension.
No commits for 4 years.
- But it looks mature/feature-complete.
- We might want to shoot an email at the maintainer to gauge his interest into maintaining it if audiobookshelf uses it.

Now for the tests:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> SELECT 'masse' = 'Masse';
0
sqlite> SELECT 'masse' LIKE 'Masse';
1
sqlite> SELECT 'masse' = 'maße';
0
sqlite> SELECT 'masse' LIKE 'maße';
0

sqlite> .load libnusqlite3.dylib

sqlite> SELECT 'masse' = 'Masse';
0
sqlite> SELECT 'masse' LIKE 'Masse';
1
sqlite> SELECT 'masse' = 'maße';
0
sqlite> SELECT 'masse' LIKE 'maße';
1
sqlite> SELECT 'masse' = 'maße' COLLATE NU1300;
0
sqlite> SELECT 'masse' LIKE 'maße' COLLATE NU1300;
1
sqlite> SELECT 'masse' = 'maße' COLLATE NU1300_NOCASE;
1

sqlite> SELECT 'árbol' = 'arbol';
0
sqlite> SELECT unaccent('árbol') = 'arbol';
1
sqlite> SELECT unaccent('Árbol') = 'arbol';
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300;
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300_NOCASE;
1
sqlite> SELECT lower(unaccent('Árbol')) = 'arbol';
1

sqlite> SELECT 'árbol' LIKE 'arbol';
0
sqlite> SELECT unaccent('árbol') LIKE 'arbol';
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol';
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300;
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300_NOCASE;
1
sqlite> SELECT lower(unaccent('Árbol')) LIKE 'arbol';
1

@mikiher Let me know your thoughts!

mikiher · 2024-09-22T06:54:30Z

Hi, sorry for the long time it took - had many other things on my plate.

Wanted to give an update on my experiments with nunicode.

The first thing I needed to do was to built it. The project as it is currently only provides pre-built binaries for Windows x86 32-bit and Linux x86 32-bit and 64-bit. So:

I wrote toolchain files for cross compiling to win64 and arm64.
I had to slightly modify the main cmake configuration of the project
Wrote a github workflow to run the build on a linux machine and upload the generated libraries

The experiments are at https://github.com/mikiher/nunicode-sqlite

I tested the built binaries:

On my Windows machine
On WSL on my Windows (which has Ubuntu installed)
On a QEMU arm64 emulator with Alpine Linux (with gcompat)

I ran the same rudimentary tests as you did above with the sqlite3 CLI. the Windows and Ubuntu tests seem to work nicely, but the arm64 test has some issues, which I still need to figure out: It loads the library successfully and runs some of the test queries successfully, but fails on the following:

sqlite> SELECT 'masse' LIKE 'maße';
0
sqlite> SELECT upper('maße');
MAßÑא≈אאא

I have no idea why it fails - you're welcome to take a look at the workflow and if you have any suggestions, I'll be glad to hear.

devnoname120 · 2024-09-22T14:39:01Z

@mikiher Thank you for your tests. Weirdly enough the tests that I performed were done on arm64… Specifically on macOS where both M1 and M2 worked fine. I didn't (need to) cross-compile though.

Which version of sqlite3 do you use? Which options are enabled?
As far as I know you don't need gcompat for this one because it doesn't rely on glibc.

I just tested your example and here is what I see on my end:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load libnusqlite3.dylib
sqlite> SELECT 'masse' LIKE 'maße';
1
sqlite> SELECT upper('maße');
MASSE

I did this in order to compile the sqlite extension:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make

mikiher · 2024-09-22T15:25:35Z

Which version of sqlite3 do you use?

SQLite version 3.45.3 2024-04-15 13:34:05
It's the one installed by apk add sqlite. I haven't enabled or disabled any particular options.

As far as I know you don't need gcompat for this one because it doesn't rely on glibc.

It doesn't? so it doesn't matter which compiler I used for cross compilation?

I did this in order to compile the sqlite extension

This is roughly what I did as well. You can see all of the actions I ran in the github workflow

As I said there were a few changes I had to make:

Added toolchain files for cross compiling to win64 and arm64 in nunicode/cmake
Slightly modified the main cmake configuration of the project
- I removed the find_package(Sqlite3) directive (which sets SQLITE3_INCLUDE_DIR to /usr/include), and the SQLITE3_FOUND condition. Instead, I copied the sqlite3 headers to nunicode/include, and set SQLITE3_INCLUDE_DIR to that. I had to do this because otherwise the cross-compilers would be trying to use headers from /usr/include when compiling the extension, which is not desired.

I don't think any of these changes really matter, with (maybe) the exception of the choice of cross compiler - I used aarch64-linux-gnu-gcc. I mentioend gcompat because I know gcc uses glibc by default.

devnoname120 · 2024-09-22T22:53:04Z

@mikiher Hmm it seems that the .so is indeed linked against glibc which is not optimal. I'd suggest trying again to compile it but from Alpine this time around. I cannot dig further into it right now unfortunately — the next two weeks will be quite busy on my side.

Tangentially I think that it would make sense at some point to reach out to @alekseyt directly ([email protected]). The README of nunicode provides this email address and explicitly encourages people to contact him.

mikiher · 2024-09-23T14:01:02Z

Thanks, I will also try to cross compile for musl and see if that helps.

The cross compilation is kind of important since I need a way to automatically build the required libraries for each architecture, and there are no github-hosted Alpine aarch64 runners.

mikiher · 2024-09-24T12:24:29Z

So it turned out all of the issues were due to the crappy QEMU console, that seems to mess up non-ascii input :(

Once I connected to the QEMU emulator using ssh (which has proper UTF-8 support), all the issues went away.
I checked all 3 linux aarch64 libraries I built:

cross-compiled with glibc
cross-compiled with musl
native-compiled (on QEMU aarch64 emulator running Alpine) with musl

They all pass the simple tests above.

I think I'll stick with the original version I built because it's easiest to set-up the cross-compiler to build it.

Next I'm going to work on integrating the extenstions in ABS and testing.

devnoname120 · 2024-09-29T10:34:00Z

@mikiher For Docker I'd suggest compiling nunicode directly for each architecture from an Alpine container as part of a multi-stage build and not rely on gcompat or cross-compilation at all for those.

See here for an example of a multi-platform build. Docker makes it surprisingly easy, and you don't have to bother with QEMU by yourself:
https://github.com/code-inflation/cfspeedtest/pull/94/files

And the doc for multi-stage builds in case you are not familiar with them:
https://docs.docker.com/build/building/multi-stage/#use-multi-stage-builds

I can cobble up a PR if it helps!

mikiher · 2024-09-29T13:29:10Z

But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?

In the future we can consider providing Nunicode as an apk (I admit the additional code in the dockerfile is somewhat ugly).

devnoname120 · 2024-10-05T17:51:50Z

But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?

As far as I know Alpine is the only non-marginal Linux distribution that uses musl by default. It does so because it's designed to minimize the size to the strict minimum for use in containers and embedded systems. Virtually all the other Linux distributions use glibc.

IMO the library that is deployed in the Alpine containers should be built from a musl distro (e.g. in an Alpine container) to avoid any issues that may arise while using the gcompat compatibility layer. I'm worried that bugs that end up being actually caused by the compat layer will be very difficult to diagnose.

For all the other distributions the libraries that are distributed should be compiled against glibc. So you could either release two flavors of the nunicode library (musl and glibc), or only release the glibc flavor and build the musl flavor in a stage of the audiobookshelf Docker image.

mikiher · 2024-10-06T17:46:52Z

I'll add the musl flavor to the nunicode-binaries release, and use it to build the arm64 Docker image.

devnoname120 · 2024-10-06T23:09:37Z

@mikiher Hmm why only the arm64 flavor?

mikiher · 2024-10-07T05:36:59Z

Sorry, was a bit distracted. I meant both arm64 and amd64.

mikiher · 2024-10-07T18:07:30Z

Submitted in PR #3488

tululum added the bug Something isn't working label Feb 27, 2024

advplyr added sorting/filtering/searching Issues relating to sorting, filtering and searching enhancement New feature or request and removed bug Something isn't working labels May 8, 2024

advplyr changed the title ~~[Bug]: Diacritic-insensitive search~~ [Enhancement]: Diacritic-insensitive search May 8, 2024

advplyr mentioned this issue May 18, 2024

[Bug]: Search results are too strict resulting in no results found. #2987

Closed

mikiher mentioned this issue Jul 27, 2024

Support accent-insensitive search using SQLean unicode sqlite3 extension #3199

Merged

advplyr closed this as completed in #3199 Jul 28, 2024

nichwall mentioned this issue Sep 13, 2024

[Bug]: Author/Series with non-ASCII names are listed once for every book when updated #2541

Closed

thatguy7 mentioned this issue Sep 13, 2024

Improved handling of Authors and Series with names containing non-ASCII characters #3414

Merged

mikiher mentioned this issue Sep 29, 2024

Nunicode integration #3468

Merged

[Enhancement]: Diacritic-insensitive search #2678

[Enhancement]: Diacritic-insensitive search #2678

Comments

tululum commented Feb 27, 2024 • edited Loading

Describe the issue

Steps to reproduce the issue

Audiobookshelf version

How are you running audiobookshelf?

nichwall commented Feb 27, 2024

advplyr commented Feb 27, 2024

mikiher commented Jul 22, 2024

advplyr commented Jul 22, 2024

mikiher commented Jul 23, 2024

mikiher commented Jul 23, 2024

advplyr commented Jul 23, 2024

mikiher commented Jul 24, 2024 • edited Loading

advplyr commented Aug 9, 2024

devnoname120 commented Aug 10, 2024 • edited Loading

mikiher commented Aug 10, 2024

advplyr commented Aug 10, 2024

devnoname120 commented Aug 10, 2024 • edited Loading

mikiher commented Aug 11, 2024 • edited Loading

mikiher commented Aug 14, 2024

advplyr commented Aug 14, 2024

mikiher commented Aug 14, 2024 via email

advplyr commented Aug 14, 2024

advplyr commented Aug 14, 2024 • edited Loading

mikiher commented Aug 14, 2024

advplyr commented Aug 14, 2024

advplyr commented Aug 14, 2024 • edited Loading

mikiher commented Aug 14, 2024

advplyr commented Aug 14, 2024

devnoname120 commented Aug 14, 2024 • edited Loading

mikiher commented Aug 15, 2024

devnoname120 commented Aug 17, 2024

mikiher commented Aug 18, 2024 • edited Loading

mikiher commented Aug 19, 2024 • edited Loading

devnoname120 commented Aug 20, 2024 • edited Loading

mikiher commented Aug 20, 2024 • edited Loading

devnoname120 commented Aug 20, 2024 • edited Loading

mikiher commented Aug 21, 2024

devnoname120 commented Aug 23, 2024 • edited Loading

mikiher commented Sep 22, 2024

devnoname120 commented Sep 22, 2024 • edited Loading

mikiher commented Sep 22, 2024 • edited Loading

devnoname120 commented Sep 22, 2024

mikiher commented Sep 23, 2024 • edited Loading

mikiher commented Sep 24, 2024

devnoname120 commented Sep 29, 2024 • edited Loading

mikiher commented Sep 29, 2024

devnoname120 commented Oct 5, 2024 • edited Loading

mikiher commented Oct 6, 2024

devnoname120 commented Oct 6, 2024

mikiher commented Oct 7, 2024

mikiher commented Oct 7, 2024

tululum commented Feb 27, 2024 •

edited

Loading

mikiher commented Jul 24, 2024 •

edited

Loading

devnoname120 commented Aug 10, 2024 •

edited

Loading

devnoname120 commented Aug 10, 2024 •

edited

Loading

mikiher commented Aug 11, 2024 •

edited

Loading

advplyr commented Aug 14, 2024 •

edited

Loading

advplyr commented Aug 14, 2024 •

edited

Loading

devnoname120 commented Aug 14, 2024 •

edited

Loading

mikiher commented Aug 18, 2024 •

edited

Loading

mikiher commented Aug 19, 2024 •

edited

Loading

devnoname120 commented Aug 20, 2024 •

edited

Loading

mikiher commented Aug 20, 2024 •

edited

Loading

devnoname120 commented Aug 20, 2024 •

edited

Loading

devnoname120 commented Aug 23, 2024 •

edited

Loading

devnoname120 commented Sep 22, 2024 •

edited

Loading

mikiher commented Sep 22, 2024 •

edited

Loading

mikiher commented Sep 23, 2024 •

edited

Loading

devnoname120 commented Sep 29, 2024 •

edited

Loading

devnoname120 commented Oct 5, 2024 •

edited

Loading