-
[EDIT: This content is now available in the main docs here] |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Correct! |
Beta Was this translation helpful? Give feedback.
-
Thanks Robin, this really helps to explain the approach Splink presents to parameter estimations. I get that the m values aren't the be-all of the model as a whole, but would it be worthwhile extrapolating further on section 3 regarding the choices we make for blocking rules used in training the m values? Their values relative to other comparisons can be impactful, and estimating m is a sticking point in other discussions. At the moment, in the linked page, we have:
In another portion of the documentation on estimation of m, it's stated:
Both of these suggest that the main driver is to keep the number of comparisons low for performance, but should there be primary consideration for the proportion of positive matches in that block (and then secondary in reducing the block size)? Likewise, in the docstring for
This suggests more that the (relative) proportion of matches in the sample wants to be higher, but doesn't really indicate which end (if any) of that range is preferable. I feel the new page could be expanded to detail the blocking approach a bit more (but it might be the ML side of my brain driving this feeling). Given we are trying to estimate m as a measure of data quality and more specifically as probability of scenario given records are a match, are we looking for:
I feel that the choice can have an impact on the model, especially when we view the |
Beta Was this translation helpful? Give feedback.
Correct!