deduplication vignette without true/official records #6

tedmoorman · 2022-04-01T12:54:19Z

The deduplication vignette you provided seems to be for the case in which you have a set of true/official records. What if I just wanted to deduplicate based on some kind of fuzzy matching criteria because I don't have access to any true/official records? This seems to be more common for most of what I'm doing. Any suggestions or direction is appreciated.

djvanderlaan · 2022-04-01T14:12:25Z

In the example I do have the true value, but the example is meant to show how to handle the case where the true value is not available. I do use the true value to evaluate the deduplication and to determine an optimal threshold. When the true values are not available, the basic sequence as shown below should still be valid.

pairs <- pair(town_names, deduplication = TRUE)
compare_pairs(pairs, on = "name", comparators = list(jaro_winkler()), inplace = TRUE)
select_threshold(pairs, "select", "name", threshold = 0.95, inplace = TRUE)
res <- deduplicate_equivalence(pairs, "group", "select")

However, determining the optimal threshold (0.95 above) and the quality of the resulting linkage is more difficult. When the number of records or groups is not too large you can always manually inspect the result: look at various records that are put together in the same group and eyeball if the clustering looks ok. You can do this for various threshold values to select an appropriate value. This of course also depends on the use case: when deduplicating a customer database it might be better to leave some duplicate records in than to merge records that should not be merged; in that case a higher threshold is better.

When manually inspecting the output you could also manually derive the true value for a subset of records. These can be used to select a proper value for the threshold and evaluate the quality. This is the more 'statistical' way

And I didn't mention this in the vignette (perhaps I should) as it is a bit out-of-score for this package, but preprocessing the input can make a huge difference in quality. For example: removing accents from letter, removing (or not removing) punctuation, correcting common spelling errors/variants. de-abbreviating common abbreviations, stemming.

Hope this helps.

tedmoorman · 2022-04-01T14:25:10Z

This is perfect. I understand much better now. I guess less is more. Maybe separating the vignette into a "with true/official records" and "without true official records" might help? Thank you for the package, and I really appreciate the explanation!

tedmoorman · 2022-04-01T18:29:45Z

Here is the kind of thing I was ultimately looking to do. It takes a redundant column, deduplicates, and reassigns the most common similar value as a new column.

I'm not sure whether a function like this might be useful for your package ...

library(dplyr)
library(reclin)

data(town_names)

deduped_column <- function(df_column, closeness = 0.94) {
  original_column <- df_column
  
  pairs <- pair(original_column, deduplication = TRUE) #Find all possible pairs in data
  
  compare_pairs(pairs, on = "x", comparators = list(jaro_winkler()), inplace = TRUE) #compare pairs with string distance
  
  select_threshold(pairs, "matching", "x", threshold = closeness, inplace = TRUE) #select matches above threshold
  
  grouped <- deduplicate_equivalence(pairs, "group_id", "matching") %>% 
    rename(original_column = x)
  
  deduped <- grouped %>% 
    group_by(original_column) %>% 
    mutate(first_last_count = n()) %>% 
    ungroup() %>% 
    arrange(desc(first_last_count)) %>% 
    distinct(group_id, .keep_all = TRUE) %>% 
    rename(deduped_column = original_column) %>% 
    select(group_id, deduped_column)
  
  merge(grouped,
        deduped,
        by = "group_id") %>% 
    select(-group_id)
}

deduped_town_names <- deduped_column(town_names$name)

tedmoorman · 2022-04-04T17:34:51Z

This function may be a little better:

library(dplyr)
library(reclin)

data(town_names)

deduped_column <- function(df, col, col_deduped, closeness = 0.94) {
  
  df <- df
  
  col_char_ref <- as.character(substitute(col))
  
  pairs <- pair(df, deduplication = TRUE) #Find all possible pairs in data
  
  compare_pairs(pairs, on = col_char_ref, comparators = list(jaro_winkler()), inplace = TRUE) #compare pairs with string distance
  
  select_threshold(pairs, "matching", col_char_ref, threshold = closeness, inplace = TRUE) #select matches above threshold
  
  grouped <- deduplicate_equivalence(pairs, "group_id", "matching")
  
  grouped <- eval(substitute(rename(grouped, original_column = col)))
  
  deduped <- grouped %>% 
    group_by(original_column) %>% 
    mutate(column_count = n()) %>% 
    ungroup() %>% 
    arrange(desc(column_count)) %>% 
    distinct(group_id, .keep_all = TRUE) %>% 
    rename(deduped_column = original_column) %>% 
    select(group_id, deduped_column)
  
  merge(grouped,
        deduped,
        by = "group_id") %>% 
    select(-group_id) %>% 
    rename({{ col }} := original_column,
           {{ col_deduped }} := deduped_column) 
}

deduped_town_names <- deduped_column(town_names, name, name_deduped)

djvanderlaan added the question Further information is requested label Apr 1, 2022

tedmoorman closed this as completed Apr 1, 2022

tedmoorman reopened this Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplication vignette without true/official records #6

deduplication vignette without true/official records #6

tedmoorman commented Apr 1, 2022

djvanderlaan commented Apr 1, 2022

tedmoorman commented Apr 1, 2022

tedmoorman commented Apr 1, 2022

tedmoorman commented Apr 4, 2022

deduplication vignette without true/official records #6

deduplication vignette without true/official records #6

Comments

tedmoorman commented Apr 1, 2022

djvanderlaan commented Apr 1, 2022

tedmoorman commented Apr 1, 2022

tedmoorman commented Apr 1, 2022

tedmoorman commented Apr 4, 2022