"Guiding" the weights towards the most important comparison #1975

V-Lamp · 2024-02-15T22:54:10Z

V-Lamp
Feb 15, 2024

I have 3 key comparisons: company name, address, location (postcode + town)

I have ended up having a system with more than 15 blocking rules and custom comparison levels... let me tell you why and what I may be doing wrong:

I always want company names to be matching to some degree. But the naive training gives very high probability to matching address & location, to the point that completely different names still receive a very high probability (99.9%).

So I said ok, will guard via blocking rules, so that there is a minimum match to the names (hence the 15-20 blocking rules). But then, because matching names become common, matching address get big weights similar/more than name matching, and the "else" case for address gets very big (much more than the name).

How can I "teach" or "guide" the algorithm to realise that name matching is more important, and address or location matches are more to support a weak name match?

I attach my weights diagram. Essentially would want the "else" cases reversed at least... And the positive matches on name to have a stronger effect. I think the imbalance is caused the guided blocking rules that make name matching appear way more common, but what else to do?

I think this kind of intuition oriented guidance is a gap in the otherwise amazing documentation.

Kameron-Eck · 2025-01-27T22:34:52Z

Kameron-Eck
Jan 27, 2025

Any update?

1 reply

zmbc Jan 27, 2025

I think the crux of the problem here was

the naive training gives very high probability to matching address & location, to the point that completely different names still receive a very high probability

and I have a feeling @V-Lamp eventually figured out this was because postcode and address strongly violate the "conditional independence" assumption of Fellegi-Sunter, such that when one of them matches randomly (e.g. two companies at the same address) the other is also very likely to match randomly (e.g. those two companies also have the same postcode). See #2413.

In general the best approach for this right now in Splink is to combine the two highly correlated columns into a single comparison, though this can lead to an extremely complicated comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Guiding" the weights towards the most important comparison #1975

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

"Guiding" the weights towards the most important comparison #1975

V-Lamp Feb 15, 2024

Replies: 1 comment · 1 reply

Kameron-Eck Jan 27, 2025

zmbc Jan 27, 2025

V-Lamp
Feb 15, 2024

Replies: 1 comment 1 reply

Kameron-Eck
Jan 27, 2025