Array comparisons vs Single value comparisons #2550

p4pratikjain · 2024-12-10T13:10:45Z

p4pratikjain
Dec 10, 2024

We are trying to match two big datasets (300M actual size but for POC it is sampled to 1.5M). I can either run comparisons on exploded data which is 25M vs array comparisons on 1.5M dataset. For some reason when I do this on 25M linker.training.estimate_probability_two_random_records_match This function runs in few minutes. where as when I do this via array comparisons it takes about 6 hrs.

I have converted my deterministic rules to array based comparisons for this. for eg l.email = r. email becomes array_length(array_intersect(l.email,r.email))>=1.

Am I missing something or is there any better way to run this on grouped dataset ?

Answered by RobinL

Dec 25, 2024

It's because the array comparison is a filter condition, whereas the l.email=r.email comparison is an equi join condition.

You can read more about this here:
https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html?h=equi#equi-join-conditions

In your situation I'd recommend sorting your arrays and using something like l.email[1] = r.email[1] rather than the array comparison. You could adjust the recall down a bit to account for the fact this won't capture all matches.

View full answer

RobinL · 2024-12-25T14:08:57Z

RobinL
Dec 25, 2024
Maintainer

It's because the array comparison is a filter condition, whereas the l.email=r.email comparison is an equi join condition.

You can read more about this here:
https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html?h=equi#equi-join-conditions

In your situation I'd recommend sorting your arrays and using something like l.email[1] = r.email[1] rather than the array comparison. You could adjust the recall down a bit to account for the fact this won't capture all matches.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array comparisons vs Single value comparisons #2550

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Array comparisons vs Single value comparisons #2550

p4pratikjain Dec 10, 2024

Replies: 1 comment

RobinL Dec 25, 2024 Maintainer

p4pratikjain
Dec 10, 2024

RobinL
Dec 25, 2024
Maintainer