Jaro-Winkler comparison in duckdb #551
-
Hi all, Migrating existing Splink 2.x code over to 3.x and testing with a duckdb backend on some smaller data. Unfortunately there is no native Jaro-Winkler function (https://duckdb.org/docs/sql/functions/char). Does anyone have any suggestions for implementing this (i.e. an efficient UDF that can be registered)? The .jar with Scala UDFs for this and other functions is an attractive feature for the Spark backend. -- |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi Eric, A few notes. First is that you may find the 2 to 3 converter useful. If you plug in the Second, on the issue of UDFs in DuckDB. The general principle here is that, in our move to multiple SQL backends, we can't offer blanket support for all functions (e.g. Jaro Winkler), and, in general, users will have to use functions that are available in their chosen backend (possibly by registering UDFs if they are available). This is probably the biggest drawback of duckdb right now. The functions available are enumerated under the text similarity heading here, as you say. First, there is partial support for Python UDFs, but there's no ability to register a Python function as a UDF. See here. Unfortunately this means they're currently useless for Splink version 3, because any UDF would have to be executed as part of a SQL statement. As far as I understand it, the only way to get a UDF into Python as it stands is writing a C++ extension and registering it with DuckDB. We'd like to have a go at this, but haven't had time. There's a discussion here with some pointers about how to do this. I believe 'full' Python UDFs are on the roadmap, but there's no set date when they will appear in DuckDB. See below for a discussion with one of the DuckDB maintainers, who suggests the extension route at present: A bit about custom functions in R here for reference Finally, my intuition on this is that a model that uses a combination of dmetaphone (which you can precompute) and levenstein and jaccard (which are availabile already in duckdb) will get pretty close to the accuracy of a model that contains jaro. |
Beta Was this translation helpful? Give feedback.
Hi Eric,
A few notes.
First is that you may find the 2 to 3 converter useful. If you plug in the
.json
of a trained v2 model, it will attempt to convert it into the corresponding Splink3 Spark code.Second, on the issue of UDFs in DuckDB.
The general principle here is that, in our move to multiple SQL backends, we can't offer blanket support for all functions (e.g. Jaro Winkler), and, in general, users will have to use functions that are available in their chosen backend (possibly by registering UDFs if they are available).
This is probably the biggest drawback of duckdb right now. The functions available are enumerated under the text similarity heading here, as you say.
First, there is p…