Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Diacritic-insensitive search #2678

Closed
tululum opened this issue Feb 27, 2024 · 47 comments · Fixed by #3199 or #3468
Closed

[Enhancement]: Diacritic-insensitive search #2678

tululum opened this issue Feb 27, 2024 · 47 comments · Fixed by #3199 or #3468
Labels
enhancement New feature or request sorting/filtering/searching Issues relating to sorting, filtering and searching

Comments

@tululum
Copy link

tululum commented Feb 27, 2024

Describe the issue

A lot of non-english languages have accented letters/diacritics (https://en.wikipedia.org/wiki/Diacritic). It is quite common to not include the accent when writing text online, especially when on mobile phones (which make it difficult and slow to write accented letters). Therefore, it is common for all search engines to drop the accent before performing the search (pretty much every single search engine does this). Unfortunately, this is not the case for ABS.

For example all the following searches should find the book "Černá lilie", but they don't:
image
image
image
image
image

Also, note that the last example of search černá is the same as Černá except for case of the first letter. That also proves that the search is not case-insensitive if accented letters are present.

I never wrote a line of code in nodejs, so unfortunately I don't feel confident to propose a fix myself. But I tested that this nodejs code can be used to drop all accents in my language (Czech), so perhaps that could be used somehow to fix this issue.

Probably better way to fix this would be directly in the SQLite query, maybe something like this could work. However, there could be some issues with that too (https://www.sqlite.org/faq.html#q18).

Steps to reproduce the issue

  1. Create a book named Černá
  2. Try to search for cerna, černa, cerná, černá or Cerna.
  3. Note that the search result is empty.

Audiobookshelf version

2.7.2

How are you running audiobookshelf?

Docker

@tululum tululum added the bug Something isn't working label Feb 27, 2024
@nichwall
Copy link
Contributor

Related to (currently closed) #2187

@advplyr
Copy link
Owner

advplyr commented Feb 27, 2024

In #2187 I mention a sqlite extension we could look at using to support case insensitive queries for non-ASCII

@advplyr advplyr added sorting/filtering/searching Issues relating to sorting, filtering and searching enhancement New feature or request and removed bug Something isn't working labels May 8, 2024
@advplyr advplyr changed the title [Bug]: Diacritic-insensitive search [Enhancement]: Diacritic-insensitive search May 8, 2024
@mikiher
Copy link
Contributor

mikiher commented Jul 22, 2024

OK, so here's rougly the solution I was thinking of:

  • Assume we have a function remove_diacritics(x), which returns a diacricts-free version of x
  • For each table t with a column x for which we want to support diacritic-insensitive search, we add a column normalized_x
  • We perform a db migration, where for each row in table t:
    • normalized_x = remove_diacritics(x) != x ? remove_diacritics(x) : null
  • After migrating, for every update or create of a row in t, we set normalized_x as above (we can use hooks for this).
  • When searching for query, we perform an OR of two conditions (pseudo-code below)
    • x like %${query}%
    • normalized_x like %${remove_diacritics(query)}%
  • If most x values in t are unchanged by remove_diacritics(), then both the performance and db size overheads are probably reasonable.

@advplyr does this seems like a reasonable approach? I don't think there are many normalized_x columns we'll need to create. This approach will of course work with any kind of normalization which we don't currently support (e.g. lowercasing non-ascii capitals, removing Hebrew and Arabic diacritic marks, etc...)

@advplyr
Copy link
Owner

advplyr commented Jul 22, 2024

I think this approach would be difficult to scale if we wanted to expand what is currently searchable. Currently we would need a normalized_x for:
Books: author, narrator, series, title, subtitle, tags, genres.
Podcasts: title, author, tags (should add genres and episode title)

Genres and tags are currently JSON arrays on each book, so normalized_x would need to be a JSON array for those.
If we wanted to expand to allow global search to search on anything else (like playlists, collections, descriptions, etc) we would have to write a script that backfills the new normalized_x column.

I think we should exhaust the possibility of doing this with sqlite first. If we have to do it this way we might be able to get away with only 1 column per table with a normalizedSearchKeys column or something like that.
Using a hook we normalize all the searchable values to put in a single column. Scaling that would still require backfilling but easier at least.

@mikiher
Copy link
Contributor

mikiher commented Jul 23, 2024

According to my research, it seems like the only sane way to do this without additional columns is by adding a UDF that implements remove_diacritics(), and calling that UDF at runtime, where the condition looks like:

    sequelize.fn('remove_diacritics', sequelize.col('x')),
    {
      [Op.like]: `%${remove_diacritics(query)}%`
    }

I'm worried about the performance implications of this approach, but we can try this.

@mikiher
Copy link
Contributor

mikiher commented Jul 23, 2024

Unfortunately, sqlite3 on node.js does not seem to support UDFs :(

@advplyr
Copy link
Owner

advplyr commented Jul 23, 2024

It looks like the only option is to build the sqlite3 binaries ourselves.

TryGhost/node-sqlite3#70 (comment)
https://github.com/nalgeon/sqlean/blob/main/docs/unicode.md

@mikiher
Copy link
Contributor

mikiher commented Jul 24, 2024

OK, so after a few hours of struggle, I was able to load the sqlean unicode extension into the underlying sqlite3 db, so the unaccent and other functions there became available to use in queries (it was quite hard to figure out how to do this, because the prescribed way of accessing it, hooking to afterConnect, doesn't work for sqlite due to some bug in Sequelize).

So it looks like we won't need to build sqlite3 binaries, although we would need to deploy the unicode extension binaries (which are very small, ~100-150kB)

Now I'm progressing with modifying all the relevant queries. Stay tuned...

@advplyr
Copy link
Owner

advplyr commented Aug 9, 2024

Reverted in v2.12.3

@devnoname120
Copy link
Contributor

devnoname120 commented Aug 10, 2024

It looks like the only option is to build the sqlite3 binaries ourselves.

I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.

Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:

⛔️ This extension is deprecated. Use text instead.

What do you think?


Edit: It looks like building ICU degrades performances in general:

The icu extension doesn't really help, because it replaces the SQL like() function which automatically disables the use of any index for the LIKE operator, regardless of the collation, so this will slow down with large data sets. As it affects every LIKE operator, not just specific ones, it might have unwanted side effects.

https://sqlite.org/forum/forumpost/524c146fbf

@mikiher
Copy link
Contributor

mikiher commented Aug 10, 2024

It looks like the only option is to build the sqlite3 binaries ourselves.

I wonder at this point if it wouldn't be easier to just to do. I don't have all the context but it seems to me that sqlite3 officially supports icu so it looks safer to me to rely on this one.

I haven't tried, but looking at it, it doesn't seem to come with accent-normalization functions (unaccent), just unicode aware case functions (lower, upper). We need both.

Additionally, the README.md of the unicode extension of sqlean mentions the following which isn't particularly reassuring:

⛔️ This extension is deprecated. Use text instead.

What do you think?

Yeah. Amusingly enoguh, it was deprecated a couple of weeks ago, right after I integrated it in my PR...
The problem is that the suggested replacement extension (text), does not seem to support unaccent() yet. I will be able to move to from unicode to text when it supports that.

@advplyr
Copy link
Owner

advplyr commented Aug 10, 2024

I'm not sure how worthwhile it is but just supporting unicode lowercase would be helpful to some. A bug report was put in just for that #2187

@devnoname120
Copy link
Contributor

devnoname120 commented Aug 10, 2024

I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the utfmb4 charset with the modern uca1400_ai_ci collation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.

About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...

I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?

Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.

@mikiher
Copy link
Contributor

mikiher commented Aug 11, 2024

I'm not an expert in any way regarding these topics but apparently MariaDB has out-of-the-box support for the utfmb4 charset with the modern uca1400_ai_ci collation (unicode collate algorithm 14.0.0, accent insensitive, case insensitive, see the naming convention). See also the documentation of supported character sets.

Replacing the underlying db for supporting unaccent seems like a bit of an overkill to me.

About SQLite ICU I don't know which collation algorithm it implements exactly, any ideas? I can't find either what UCA the unicode Sqlean plugin implements...

Yes, for the record (after some digging) it looks like using the SQLite ICU + ICU extension can work as well (at least in theory).
When you have that extension loaded, you can run:
SELECT icu_load_collation('root', 'aici', 'PRIMARY');
This creates a custom collation called aici which works on root (i.e. applies to all locales) and uses UCOL_PRIMARY collation strength (which means case and accent insensitive).
You can then (therretically) use this collation for comparisons and indexing, and an unaccent() function is not required.

This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.

I am quite reluctant to go that way.

I don't know if this makes it worth it to migrate from SQLite to MariaDB. Among other things I guess it depends on whether we use SQLite-specific features?

Edit: nunicode looks like a very interesting alternative to Sqlean's unicode plugin.

I haven't looked into this in detail, but yes, it does seem to provide unaccent.

However, I must say I'm not sure why the unicode extension deprecation seems to give you so much grief. I'm sure that at some point its functionality will be fully supported by the text extension, and in the meantime, it provides exactly the functionality we need at almost no cost (performance, size, loading). The issues caused by its introduction were due to my own failings, not the extension itself, and those can (hopefully) be fixed.

@mikiher
Copy link
Contributor

mikiher commented Aug 14, 2024

Update: I explicitly asked for unaccent support in sqlean/text, and the maintainer, for some reason, says that he does not plan to support it.

So, at this point I'm dropping my plans to re-introduce sqlean/unicode as is. I'm going to either look into the nunicode alternative @devnoname120 suggested, or fork the unicode extension and maintain it myself.

This is all going to take a while, so unfortunately the fix is not likely to happen in the very near future.

@advplyr
Copy link
Owner

advplyr commented Aug 14, 2024

I haven't looked into this yet but I know that most of the other media servers use Sqlite so I wonder how they are handling this.

@mikiher
Copy link
Contributor

mikiher commented Aug 14, 2024 via email

@advplyr
Copy link
Owner

advplyr commented Aug 14, 2024

Jellyfin is open source and uses sqlite. They use 2 db files and I just looked at the one named library.db in the ItemValues table they store "Value" and "CleanValue" that is unaccented.

It looks like that table stores everything that would be searchable. My movies and tv series I scanned in created about 125k rows in that table.

@advplyr
Copy link
Owner

advplyr commented Aug 14, 2024

So it seems they use your first suggestion of an additional column but they only need 1 column because all the values are in 1 table.

@mikiher
Copy link
Contributor

mikiher commented Aug 14, 2024

OK, I'll look into this direction as well.

I must admit, though, that extensions seem like a better idea generally, due to performance considerations, and also because everything is done at the database level. I think we'll also need additional extensions in the future (e.g. to support natural sorting), so the idea of depending on db extensions has merit IMO, and is superior to implementing stuff with application logic.

If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding (which we should never touch), and a quite thin layer of logic that uses those tables to extend/replace existing functionality. There's little chance that any of this would ever need to change (unless there's breaking changes in SQLite itself), barring bug fixes.

@advplyr
Copy link
Owner

advplyr commented Aug 14, 2024

Thanks, I agree. It would be much better if we can handle this with sqlite

@advplyr
Copy link
Owner

advplyr commented Aug 14, 2024

I looked at the plex sqlite db and they are doing something different. I can't open the icu tables in DBeaver.
image

Quick search led me https://forums.plex.tv/t/can-no-longer-update-library-database-with-sqlite3/701405/3

@mikiher
Copy link
Contributor

mikiher commented Aug 14, 2024

The discussion indicates the Plex uses the fts4 extension for implementing full text search.
I think FTS is likely an overkill for our purposes, though it may include the functionality we need. Do you think Audiobookshelf requires full-text-search capabilities?

@advplyr
Copy link
Owner

advplyr commented Aug 14, 2024

I don't think so. It is useful to find out how everyone else is solving this with sqlite. I'm sure there are others I can look into later.

@devnoname120
Copy link
Contributor

devnoname120 commented Aug 14, 2024

This does seem to require both building our own SQLite icu-enabled version, and building the the icu extension, for all supported platforms and architectures, plus deploying them.

As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner, see this excerpt from ext/icu/README.txt:

The easiest way to compile and use the ICU extension is to build
and use it as a dynamically loadable SQLite extension. To do this
using gcc on *nix:

gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io` \
-o libSqliteIcu.so

You may need to add "-I" flags so that gcc can find sqlite3ext.h
and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
loaded into sqlite in the same way as any other dynamically loadable
extension.

For other operating systems, you can use https://www.sqlite.org/loadext.html#compiling_a_loadable_extension as an inspiration (just need to add the icu lib in the arguments).


If you look at the unicode extension code, it is actually quite simple. It is a single file containing big tables for case and accent folding

Handling diacritics is exquisitely tricky to get right. There is a staggering myriad of exceptions and edge cases that need to be taken in account. Reading the official Unicode Collation Algorithm (UCA) is enough to convince oneself of that.

Although it handles ordering rather than just equality, it's 66 pages long and it keeps evolving. This official algorithm also requires the Default Unicode Collation Element Table (DUCET), which is even bigger.

IMHO we would be better off using the reference implementation (https://github.com/unicode-org/icu) from the Unicode group, which is what ext/icu/icu.c relies on under the hood. Maintaining a custom solution may prove a lot of maintenance down the road in order to account for all the diacritic-related search issues that users may report.

What do you think?

@mikiher
Copy link
Contributor

mikiher commented Aug 15, 2024

As far as I understand it's an either, not an and. You can compile icu.c standalone with a one-liner

Yeah, you're probably right. I probably misread some instructions on stackoverflow. You still need to also deploy/install icu libs for this to work.

What do you think?

I'm guessing there are good reasons why prople are providing these alternate implementations. As you say, icu collations are extremely complex, which likely leads to performance issues, and many are looking for good-enough implementations that are standalone, less complex, and probably don't handle all the small details and all the many locales that the reference implementation has to deal with. I think that for ABS, it would be ok to use one of these "good-enough" implementations. I've checked the unicode extension against many accented books and authors in my library, and they all work fine.

I'll try it though, and if I see it's not complex to use and deploy, and doesn't have glaring performance issues, I'll use that.

@devnoname120
Copy link
Contributor

@mikiher It looks like Node.js is shipped with libicu by default, and starting from Node.js 13 it comes with the full ICU data as well (prior to that it was only English by default).

I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).

I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?

Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.

For example our Docker containers currently rely on alpine v3.20 (base image of alpine-node20) whose apk package manager provides an icu-libs package that supports all architectures: https://pkgs.alpinelinux.org/packages?name=icu-libs&branch=v3.20&repo=main&arch=&maintainer=


For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we already use node-gyp?).

Note that icu.c doesn't have any dependencies so compiling it very low-risk in terms of flakiness. As long as there is a bare C compiler/linker around it should compile without issues.

@mikiher
Copy link
Contributor

mikiher commented Aug 18, 2024

I'm not sure that it would be a great idea but if you can't find libicu binaries for all platforms and you don't want to build it you may be able to piggyback on the one bundled with Node.js (or the one installed by the package manager if it's listed as a dependency of Node.js).

At least on Windows (and I think on other platforms by default), I believe icu is statically linked to the node executable, so I don't think it can be used by the icu extension.

I don't know the possible implications of updating libicu or the ICU data on the collation so implicitly relying on the not-pinned version of Node.js may not be a wise idea. For example, do the SQLite tables have to be reindexed after every ICU update?

Probably not, but if you use indices that depend on icu collations (as we painfully learned in our case), you may need to reindex those.

Either way, if a package manager is available on the system it's more likely than not to support installing libicu — it's needed everywhere.

Agreed, I don't see installing icu dependencies as a big issue (except maybe for Windows, but there we have an installer which can take care of that).

For SQLite's icu.c we could create a lightweight Node.js package that uses node-gyp to compile it (we already use node-gyp?).

I'm not sure how exactly to use node-gyp for our purposes. IIUC, its purpose is building native addon modules for node. I don't need to build a native addon - I need to build a shared library, which sqlite3 loads from inside its C implementation.

In any case, I think it's important to remember that this is a node project. I don't want to require developers to have C/C++ toolchains on their dev machines. If dependencies need to be built, we need to do it on github workflows or a similar infrastructure, and make it available as release assets. or just check-in the pre-built extension binaries.

@mikiher
Copy link
Contributor

mikiher commented Aug 19, 2024

OK, so here's some update on my experiments.

  • I compiled the icu extension and loaded it successfully. I was then also able to create a custom collation with:
    SELECT icu_load_extension('root', 'aici', 'PRIMARY')
  • I was also able to test that comparison based on this collation was indeed accent- and case-insensitive with:
    SELECT 'Árbol' = 'arbol' collate aici AS result
  • However, I was not successful testing that the following also returns 1:
    SELECT 'Árbol' LIKE '%arb%' collate aici AS result
    It turns out that LIKE in SQLite ignores collations (unlike other databases, which seem to support this syntax)

Now since we need LIKE for searching, we're back to square one, at least for this specific solution.

The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs. And I haven't tested query performance at all yet.

@devnoname120
Copy link
Contributor

devnoname120 commented Aug 20, 2024

@mikiher Thanks for the tests. With regard to the LIKE something must be wrong because ext/icu/icu.c explicitly mentions the following (emphasis mine):

This file implements an integration between the ICU library
("International Components for Unicode", an open-source library
for handling unicode data) and SQLite. The integration uses
ICU to provide the following to SQLite:

* An implementation of the SQL regexp() function (and hence REGEXP
operator) using the ICU uregex_XX() APIs.

* Implementations of the SQL scalar upper() and lower() functions
for case mapping.

* Integration of ICU and SQLite collation sequences.

* An implementation of the LIKE operator that uses ICU to
provide case-independent matching.

See icuLikeFunc() in the same file for the implementation.


Note: I was able to reproduce your issue, and here is what I did for my future reference:

brew install sqlite3 icu4c

export PKG_CONFIG_PATH="/opt/homebrew/opt/icu4c/lib/pkgconfig:/opt/homebrew/opt/sqlite/lib/pkgconfig"

# Note: I first tried -o libSqliteIcu.dylib but it broke .load with this error:
# Error: dlsym(0x83836050, sqlite3_sqliteicu_init): symbol not found
gcc -fPIC -dynamiclib icu.c `pkg-config --libs --cflags icu-uc icu-io sqlite3` -o icu.dylib
export PATH="/opt/homebrew/opt/sqlite/bin:$PATH"

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load icu.dylib
sqlite> SELECT icu_load_collation('root', 'aici', 'PRIMARY');

sqlite> SELECT 'Árbol' = 'arbol' collate aici AS result;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici AS result;
0

I'm not sure if we are doing this right in order to make LIKE work with the collation.

@mikiher
Copy link
Contributor

mikiher commented Aug 20, 2024

We are doing it right.

The LIKE implementation is just hard-coded to doing what it's written to do (Unicode aware case insensitive matching), and does not respect the specified collation. You can look at the code - it's relatively straightforward.

The SQL syntax is correct, otherwise there would be a syntax error. the COLLATE part is just ignored.

If you find another suitable SQL expression to make the collation work with substring matching, please let me know.

@devnoname120
Copy link
Contributor

devnoname120 commented Aug 20, 2024

The other thing I noted about icu, is that the full data for icu (libicudata) is ~30MB, and the size of the required icu shared objects (libicuuc and libicui18n) is an additional ~6MB. Even if we were able to make it work with icu dependencies, we would pay a hefty price in terms in terms of size, compared to the ~100KB that the unicode extension weighs

@mikiher Ah indeed. I took a look at the current Docker image and it's currently 564 MB. According to your estimations we would add 36 MB which is a 6% increase in image size. Definitely not negligible, but not a deal-breaker IMO especially if it means very little maintenance and all the edge cases and things we haven't thought about nicely working (so we won't have to get back to it and debug possible issues).

I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.


As an exercise I made a quick implementation of an ICU-based unaccent() function for SQLite and here are the results:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load unaccent.dylib
sqlite> SELECT 'Árbol' = 'arbol' COLLATE NOCASE;
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NOCASE;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' COLLATE NOCASE;
0
sqlite> SELECT unaccent('Árbol') LIKE '%arb%' COLLATE NOCASE;
1
sqlite> .load icu.dylib
SELECT icu_load_collation('root', 'aici', 'PRIMARY');

sqlite> SELECT 'Árbol' = 'arbol' collate aici;
1
sqlite> SELECT 'Árbol' LIKE '%arb%' collate aici;
0
sqlite> SELECT unaccent('Árbol') LIKE '%arb%' collate aici;
1

The code (this is an old version, I'm almost done cleaning it up, using sqlite_malloc() instead of malloc(), etc etc. and it uses unnecessary dynamic allocations):

#include <sqlite3ext.h>

#include <unicode/ustring.h>
#include <unicode/unorm2.h>
#include <unicode/uchar.h>
#include <unicode/utypes.h>
#include <stdlib.h>
#include <string.h>

SQLITE_EXTENSION_INIT1

void unaccent(const UChar *input, UChar **output, UErrorCode *status) {
    const UNormalizer2 *normalizer = unorm2_getNFDInstance(status);
    if (U_FAILURE(*status)) {
        return;
    }

    int32_t inputLength = u_strlen(input);

    /* Maximum output of NFD transformation is a factor of 4.
    ** See: https://unicode.org/faq/normalization.html#12
    */
    int32_t maxNormalizedLength = 4 * inputLength;
    UChar *normalized = (UChar *)malloc(maxNormalizedLength * sizeof(UChar));
    if (!normalized) {
        *status = U_MEMORY_ALLOCATION_ERROR;
        return;
    }

    int32_t normalizedLength = unorm2_normalize(normalizer, input, -1, normalized, maxNormalizedLength, status);
    if (U_FAILURE(*status)) {
        free(normalized);
        return;
    }

    *output = (UChar *)malloc((normalizedLength + 1) * sizeof(UChar));
    if (!(*output)) {
        *status = U_MEMORY_ALLOCATION_ERROR;
        free(normalized);
        return;
    }

    int32_t resultIndex = 0;
    for (int32_t i = 0; i < normalizedLength;) {
        UChar32 c;
        U16_NEXT(normalized, i, normalizedLength, c);
        /* See https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a6a2dbc531efce8d77fdb4c314e7fc25e */
        if (u_charType(c) != U_COMBINING_SPACING_MARK && u_charType(c) != U_NON_SPACING_MARK && u_charType(c) != U_ENCLOSING_MARK) {
            U16_APPEND(*output, resultIndex, normalizedLength, c, *status);
        }
    }
    (*output)[resultIndex] = 0;

    free(normalized);
}

static void unaccent_sqlite(sqlite3_context *context, int argc, sqlite3_value **argv) {
    if (argc != 1) {
        sqlite3_result_error(context, "Invalid number of arguments", -1);
        return;
    }

    const char *input = (const char *)sqlite3_value_text(argv[0]);
    if (input == NULL) {
        sqlite3_result_null(context);
        return;
    }

    /* FIXME: do a dynamic allocation like the for the output */
    UChar u_input[256];
    UErrorCode status = U_ZERO_ERROR;

    u_strFromUTF8(u_input, sizeof(u_input)/sizeof(u_input[0]), NULL, input, -1, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        return;
    }

    UChar *u_output = NULL;
    unaccent(u_input, &u_output, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        if (u_output) free(u_output);
        return;
    }

    char *output = (char *)malloc((u_strlen(u_output) * 4 + 1) * sizeof(char));
    if (!output) {
        sqlite3_result_error(context, "Memory allocation error", -1);
        free(u_output);
        return;
    }

    u_strToUTF8(output, u_strlen(u_output) * 4 + 1, NULL, u_output, -1, &status);
    if (U_FAILURE(status)) {
        sqlite3_result_error(context, u_errorName(status), -1);
        free(output);
        free(u_output);
        return;
    }

    sqlite3_result_text(context, output, -1, SQLITE_TRANSIENT);

    free(output);
    free(u_output);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_unaccent_init(
  sqlite3 *db,
  char **pzErrMsg,
  const sqlite3_api_routines *pApi
){
    SQLITE_EXTENSION_INIT2(pApi);
    return sqlite3_create_function(db, "unaccent", 1, SQLITE_UTF8, NULL, unaccent_sqlite, NULL, NULL);
}

I could definitely integrate it in icu.c, and if wanted apply unaccent() in the LIKE operator.

@mikiher
Copy link
Contributor

mikiher commented Aug 21, 2024

This is very nice, and (at a glance) seems to do what's needed (NFD normalization and removal of all accent characters).

At this point, since you're now implementing your own full-fledged extension, if you wish to continue with this effort, I'm happy to leave the rest to you.

From my perspective, I'd be happy to use this code if it's well tested, and if it doesn't degrade search performance significantly. I think it would also be a valuable extension for the SQLite user community in general.

I think this should be put in a separate project, and preferably provide pre-built binaries for the top platforms, like SQLean does - as I said above, I'd prefer not to require ABS developers to maintain C toolchains.

@advplyr, since you will need to eventually approve integrating this, it would be good to get early feedback on this.

I'm curious to see how either fares in terms of performances. I would guess not too far apart if we handle memory (re)allocations properly. An accent-/case-insensitive INDEX is what we would need to have actually good performances IMO. But since the tables usually have (I guess?) just a few thousands of rows it shouldn't be that bad.

I think we have some users with tens of thousands of books in their library.

Regarding an accent and case-insensitive INDEX - yes, I was intending to replace the existing NOCASE index with an aici index. As for how much it can improve performance (especially with LIKE comparisons), I really don't know - it needs to be tested.

@devnoname120
Copy link
Contributor

devnoname120 commented Aug 23, 2024

@mikiher Thank you for your thoughtful comment. After digging into alekseyt/nunicode I'm convinced that it's an amazing option and I'd advocate for going in that direction. I think it's much better/performant/resilient than I could do building my own SQLite extension using ICU.

Upsides:

  • Great performances.
  • Clear/clean code.
  • No dependencies.
  • 300 KB.

Downsides:

  • No accent-insensitive COLLATION, but it has an unaccent() function.
    • Should not be too complicated though to create a new collation that is the combination of NU1300_NOCASE + unaccent() if we feel the need to later on.
  • Not necessarily a downside but the LIKE operator is (surprisingly — at least for me) overridden to perform case-folding aka expansion-insensitive comparisons (see example SELECT 'masse' = 'maße'; below).
    • Quote from the README:

      The LIKE operation in nunicode extension support cases when strings might grow in size during case transformation. On demand this extension may be modified to be compatible with ICU extension.

  • No commits for 4 years.
    • But it looks mature/feature-complete.
    • We might want to shoot an email at the maintainer to gauge his interest into maintaining it if audiobookshelf uses it.

Now for the tests:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> SELECT 'masse' = 'Masse';
0
sqlite> SELECT 'masse' LIKE 'Masse';
1
sqlite> SELECT 'masse' = 'maße';
0
sqlite> SELECT 'masse' LIKE 'maße';
0

sqlite> .load libnusqlite3.dylib

sqlite> SELECT 'masse' = 'Masse';
0
sqlite> SELECT 'masse' LIKE 'Masse';
1
sqlite> SELECT 'masse' = 'maße';
0
sqlite> SELECT 'masse' LIKE 'maße';
1
sqlite> SELECT 'masse' = 'maße' COLLATE NU1300;
0
sqlite> SELECT 'masse' LIKE 'maße' COLLATE NU1300;
1
sqlite> SELECT 'masse' = 'maße' COLLATE NU1300_NOCASE;
1

sqlite> SELECT 'árbol' = 'arbol';
0
sqlite> SELECT unaccent('árbol') = 'arbol';
1
sqlite> SELECT unaccent('Árbol') = 'arbol';
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300;
0
sqlite> SELECT unaccent('Árbol') = 'arbol' COLLATE NU1300_NOCASE;
1
sqlite> SELECT lower(unaccent('Árbol')) = 'arbol';
1

sqlite> SELECT 'árbol' LIKE 'arbol';
0
sqlite> SELECT unaccent('árbol') LIKE 'arbol';
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol';
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300;
1
sqlite> SELECT unaccent('Árbol') LIKE 'arbol' COLLATE NU1300_NOCASE;
1
sqlite> SELECT lower(unaccent('Árbol')) LIKE 'arbol';
1

@mikiher Let me know your thoughts!

@mikiher
Copy link
Contributor

mikiher commented Sep 22, 2024

Hi, sorry for the long time it took - had many other things on my plate.

Wanted to give an update on my experiments with nunicode.

The first thing I needed to do was to built it. The project as it is currently only provides pre-built binaries for Windows x86 32-bit and Linux x86 32-bit and 64-bit. So:

  • I wrote toolchain files for cross compiling to win64 and arm64.
  • I had to slightly modify the main cmake configuration of the project
  • Wrote a github workflow to run the build on a linux machine and upload the generated libraries

The experiments are at https://github.com/mikiher/nunicode-sqlite

I tested the built binaries:

  • On my Windows machine
  • On WSL on my Windows (which has Ubuntu installed)
  • On a QEMU arm64 emulator with Alpine Linux (with gcompat)

I ran the same rudimentary tests as you did above with the sqlite3 CLI. the Windows and Ubuntu tests seem to work nicely, but the arm64 test has some issues, which I still need to figure out: It loads the library successfully and runs some of the test queries successfully, but fails on the following:

sqlite> SELECT 'masse' LIKE 'maße';
0
sqlite> SELECT upper('maße');
MAßÑא≈אאא

I have no idea why it fails - you're welcome to take a look at the workflow and if you have any suggestions, I'll be glad to hear.

@devnoname120
Copy link
Contributor

devnoname120 commented Sep 22, 2024

@mikiher Thank you for your tests. Weirdly enough the tests that I performed were done on arm64… Specifically on macOS where both M1 and M2 worked fine. I didn't (need to) cross-compile though.

Which version of sqlite3 do you use? Which options are enabled?
As far as I know you don't need gcompat for this one because it doesn't rely on glibc.

I just tested your example and here is what I see on my end:

❯ sqlite3
SQLite version 3.46.1 2024-08-13 09:16:08
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .load libnusqlite3.dylib
sqlite> SELECT 'masse' LIKE 'maße';
1
sqlite> SELECT upper('maße');
MASSE

I did this in order to compile the sqlite extension:

mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make

@mikiher
Copy link
Contributor

mikiher commented Sep 22, 2024

Which version of sqlite3 do you use?

SQLite version 3.45.3 2024-04-15 13:34:05
It's the one installed by apk add sqlite. I haven't enabled or disabled any particular options.

As far as I know you don't need gcompat for this one because it doesn't rely on glibc.

It doesn't? so it doesn't matter which compiler I used for cross compilation?

I did this in order to compile the sqlite extension

This is roughly what I did as well. You can see all of the actions I ran in the github workflow

As I said there were a few changes I had to make:

  • Added toolchain files for cross compiling to win64 and arm64 in nunicode/cmake
  • Slightly modified the main cmake configuration of the project
    • I removed the find_package(Sqlite3) directive (which sets SQLITE3_INCLUDE_DIR to /usr/include), and the SQLITE3_FOUND condition. Instead, I copied the sqlite3 headers to nunicode/include, and set SQLITE3_INCLUDE_DIR to that. I had to do this because otherwise the cross-compilers would be trying to use headers from /usr/include when compiling the extension, which is not desired.

I don't think any of these changes really matter, with (maybe) the exception of the choice of cross compiler - I used aarch64-linux-gnu-gcc. I mentioend gcompat because I know gcc uses glibc by default.

@devnoname120
Copy link
Contributor

@mikiher Hmm it seems that the .so is indeed linked against glibc which is not optimal. I'd suggest trying again to compile it but from Alpine this time around. I cannot dig further into it right now unfortunately — the next two weeks will be quite busy on my side.

Tangentially I think that it would make sense at some point to reach out to @alekseyt directly ([email protected]). The README of nunicode provides this email address and explicitly encourages people to contact him.

@mikiher
Copy link
Contributor

mikiher commented Sep 23, 2024

Thanks, I will also try to cross compile for musl and see if that helps.

The cross compilation is kind of important since I need a way to automatically build the required libraries for each architecture, and there are no github-hosted Alpine aarch64 runners.

@mikiher
Copy link
Contributor

mikiher commented Sep 24, 2024

So it turned out all of the issues were due to the crappy QEMU console, that seems to mess up non-ascii input :(

Once I connected to the QEMU emulator using ssh (which has proper UTF-8 support), all the issues went away.
I checked all 3 linux aarch64 libraries I built:

  • cross-compiled with glibc
  • cross-compiled with musl
  • native-compiled (on QEMU aarch64 emulator running Alpine) with musl

They all pass the simple tests above.

I think I'll stick with the original version I built because it's easiest to set-up the cross-compiler to build it.

Next I'm going to work on integrating the extenstions in ABS and testing.

@devnoname120
Copy link
Contributor

devnoname120 commented Sep 29, 2024

@mikiher For Docker I'd suggest compiling nunicode directly for each architecture from an Alpine container as part of a multi-stage build and not rely on gcompat or cross-compilation at all for those.

See here for an example of a multi-platform build. Docker makes it surprisingly easy, and you don't have to bother with QEMU by yourself:
https://github.com/code-inflation/cfspeedtest/pull/94/files

And the doc for multi-stage builds in case you are not familiar with them:
https://docs.docker.com/build/building/multi-stage/#use-multi-stage-builds

I can cobble up a PR if it helps!

@mikiher
Copy link
Contributor

mikiher commented Sep 29, 2024

But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?

In the future we can consider providing Nunicode as an apk (I admit the additional code in the dockerfile is somewhat ugly).

@devnoname120
Copy link
Contributor

devnoname120 commented Oct 5, 2024

But I need to build binaries and make them available anyway for non-docker servers. Why not use those artifacts in Audiobookshelf DockerFile since they're already available?

As far as I know Alpine is the only non-marginal Linux distribution that uses musl by default. It does so because it's designed to minimize the size to the strict minimum for use in containers and embedded systems. Virtually all the other Linux distributions use glibc.

IMO the library that is deployed in the Alpine containers should be built from a musl distro (e.g. in an Alpine container) to avoid any issues that may arise while using the gcompat compatibility layer. I'm worried that bugs that end up being actually caused by the compat layer will be very difficult to diagnose.

For all the other distributions the libraries that are distributed should be compiled against glibc. So you could either release two flavors of the nunicode library (musl and glibc), or only release the glibc flavor and build the musl flavor in a stage of the audiobookshelf Docker image.

@mikiher
Copy link
Contributor

mikiher commented Oct 6, 2024

I'll add the musl flavor to the nunicode-binaries release, and use it to build the arm64 Docker image.

@devnoname120
Copy link
Contributor

@mikiher Hmm why only the arm64 flavor?

@mikiher
Copy link
Contributor

mikiher commented Oct 7, 2024

Sorry, was a bit distracted. I meant both arm64 and amd64.

@mikiher
Copy link
Contributor

mikiher commented Oct 7, 2024

Submitted in PR #3488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sorting/filtering/searching Issues relating to sorting, filtering and searching
Projects
None yet
5 participants