-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplication vignette without true/official records #6
Comments
In the example I do have the true value, but the example is meant to show how to handle the case where the true value is not available. I do use the true value to evaluate the deduplication and to determine an optimal threshold. When the true values are not available, the basic sequence as shown below should still be valid.
However, determining the optimal threshold (0.95 above) and the quality of the resulting linkage is more difficult. When the number of records or groups is not too large you can always manually inspect the result: look at various records that are put together in the same group and eyeball if the clustering looks ok. You can do this for various threshold values to select an appropriate value. This of course also depends on the use case: when deduplicating a customer database it might be better to leave some duplicate records in than to merge records that should not be merged; in that case a higher threshold is better. When manually inspecting the output you could also manually derive the true value for a subset of records. These can be used to select a proper value for the threshold and evaluate the quality. This is the more 'statistical' way And I didn't mention this in the vignette (perhaps I should) as it is a bit out-of-score for this package, but preprocessing the input can make a huge difference in quality. For example: removing accents from letter, removing (or not removing) punctuation, correcting common spelling errors/variants. de-abbreviating common abbreviations, stemming. Hope this helps. |
This is perfect. I understand much better now. I guess less is more. Maybe separating the vignette into a "with true/official records" and "without true official records" might help? Thank you for the package, and I really appreciate the explanation! |
Here is the kind of thing I was ultimately looking to do. It takes a redundant column, deduplicates, and reassigns the most common similar value as a new column. I'm not sure whether a function like this might be useful for your package ...
|
This function may be a little better:
|
The deduplication vignette you provided seems to be for the case in which you have a set of true/official records. What if I just wanted to deduplicate based on some kind of fuzzy matching criteria because I don't have access to any true/official records? This seems to be more common for most of what I'm doing. Any suggestions or direction is appreciated.
The text was updated successfully, but these errors were encountered: