Array comparisons vs Single value comparisons #2550
-
We are trying to match two big datasets (300M actual size but for POC it is sampled to 1.5M). I can either run comparisons on exploded data which is 25M vs array comparisons on 1.5M dataset. For some reason when I do this on 25M linker.training.estimate_probability_two_random_records_match This function runs in few minutes. where as when I do this via array comparisons it takes about 6 hrs. I have converted my deterministic rules to array based comparisons for this. for eg l.email = r. email becomes array_length(array_intersect(l.email,r.email))>=1. Am I missing something or is there any better way to run this on grouped dataset ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
It's because the array comparison is a filter condition, whereas the l.email=r.email comparison is an equi join condition. You can read more about this here: In your situation I'd recommend sorting your arrays and using something like |
Beta Was this translation helpful? Give feedback.
It's because the array comparison is a filter condition, whereas the l.email=r.email comparison is an equi join condition.
You can read more about this here:
https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html?h=equi#equi-join-conditions
In your situation I'd recommend sorting your arrays and using something like
l.email[1] = r.email[1]
rather than the array comparison. You could adjust the recall down a bit to account for the fact this won't capture all matches.