Jaro-Winkler comparison in duckdb #551

ericmanning · 2022-06-21T17:57:49Z

ericmanning
Jun 21, 2022

Hi all,

Migrating existing Splink 2.x code over to 3.x and testing with a duckdb backend on some smaller data.

Unfortunately there is no native Jaro-Winkler function (https://duckdb.org/docs/sql/functions/char).

Does anyone have any suggestions for implementing this (i.e. an efficient UDF that can be registered)? The .jar with Scala UDFs for this and other functions is an attractive feature for the Spark backend.

--
Edit: Presumably something like this C++ function (https://github.com/maxbachmann/jarowinkler-cpp) registered to duckdb (https://duckdb.org/docs/api/cpp).

Answered by RobinL

Jun 26, 2022

Hi Eric,

A few notes.

First is that you may find the 2 to 3 converter useful. If you plug in the .json of a trained v2 model, it will attempt to convert it into the corresponding Splink3 Spark code.

Second, on the issue of UDFs in DuckDB.

The general principle here is that, in our move to multiple SQL backends, we can't offer blanket support for all functions (e.g. Jaro Winkler), and, in general, users will have to use functions that are available in their chosen backend (possibly by registering UDFs if they are available).

This is probably the biggest drawback of duckdb right now. The functions available are enumerated under the text similarity heading here, as you say.

First, there is p…

View full answer

RobinL · 2022-06-26T13:06:55Z

RobinL
Jun 26, 2022
Maintainer

Hi Eric,

A few notes.

First is that you may find the 2 to 3 converter useful. If you plug in the .json of a trained v2 model, it will attempt to convert it into the corresponding Splink3 Spark code.

Second, on the issue of UDFs in DuckDB.

The general principle here is that, in our move to multiple SQL backends, we can't offer blanket support for all functions (e.g. Jaro Winkler), and, in general, users will have to use functions that are available in their chosen backend (possibly by registering UDFs if they are available).

This is probably the biggest drawback of duckdb right now. The functions available are enumerated under the text similarity heading here, as you say.

First, there is partial support for Python UDFs, but there's no ability to register a Python function as a UDF. See here. Unfortunately this means they're currently useless for Splink version 3, because any UDF would have to be executed as part of a SQL statement.

As far as I understand it, the only way to get a UDF into Python as it stands is writing a C++ extension and registering it with DuckDB. We'd like to have a go at this, but haven't had time. There's a discussion here with some pointers about how to do this.

I believe 'full' Python UDFs are on the roadmap, but there's no set date when they will appear in DuckDB.

See below for a discussion with one of the DuckDB maintainers, who suggests the extension route at present:

A bit about custom functions in R here for reference

Finally, my intuition on this is that a model that uses a combination of dmetaphone (which you can precompute) and levenstein and jaccard (which are availabile already in duckdb) will get pretty close to the accuracy of a model that contains jaro.

1 reply

RobinL Aug 3, 2022
Maintainer

@ericmanning thanks to the issue you raised, this is now in DuckDB (thanks very much!). Sample code (you need to run pip install --upgrade --pre duckdb to get latest pre release):

import duckdb
import pandas as pd 

df = pd.DataFrame([{"a": "robin", "b": "robyn"}])

con = duckdb.connect(":memory:")
con.register("df", df)

sql = """
select jaro_winkler_similarity(a, b) as jaro_winkler,
       jaro_similarity(a, b) as jaro
from df
"""

df = con.execute(sql).fetch_df()

print(f"{duckdb.__version__=}")
print(df.to_markdown())

which results in

duckdb.__version__='0.4.1-dev1214'

|    |   jaro_winkler |     jaro |
|---:|---------------:|---------:|
|  0 |       0.906667 | 0.866667 |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaro-Winkler comparison in duckdb #551

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Jaro-Winkler comparison in duckdb #551

ericmanning Jun 21, 2022

Replies: 1 comment · 1 reply

RobinL Jun 26, 2022 Maintainer

RobinL Aug 3, 2022 Maintainer

ericmanning
Jun 21, 2022

Replies: 1 comment 1 reply

RobinL
Jun 26, 2022
Maintainer

RobinL Aug 3, 2022
Maintainer