Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplication vignette without true/official records #6

Open
tedmoorman opened this issue Apr 1, 2022 · 4 comments
Open

deduplication vignette without true/official records #6

tedmoorman opened this issue Apr 1, 2022 · 4 comments
Labels
question Further information is requested

Comments

@tedmoorman
Copy link

The deduplication vignette you provided seems to be for the case in which you have a set of true/official records. What if I just wanted to deduplicate based on some kind of fuzzy matching criteria because I don't have access to any true/official records? This seems to be more common for most of what I'm doing. Any suggestions or direction is appreciated.

@djvanderlaan
Copy link
Owner

In the example I do have the true value, but the example is meant to show how to handle the case where the true value is not available. I do use the true value to evaluate the deduplication and to determine an optimal threshold. When the true values are not available, the basic sequence as shown below should still be valid.

pairs <- pair(town_names, deduplication = TRUE)
compare_pairs(pairs, on = "name", comparators = list(jaro_winkler()), inplace = TRUE)
select_threshold(pairs, "select", "name", threshold = 0.95, inplace = TRUE)
res <- deduplicate_equivalence(pairs, "group", "select")

However, determining the optimal threshold (0.95 above) and the quality of the resulting linkage is more difficult. When the number of records or groups is not too large you can always manually inspect the result: look at various records that are put together in the same group and eyeball if the clustering looks ok. You can do this for various threshold values to select an appropriate value. This of course also depends on the use case: when deduplicating a customer database it might be better to leave some duplicate records in than to merge records that should not be merged; in that case a higher threshold is better.

When manually inspecting the output you could also manually derive the true value for a subset of records. These can be used to select a proper value for the threshold and evaluate the quality. This is the more 'statistical' way

And I didn't mention this in the vignette (perhaps I should) as it is a bit out-of-score for this package, but preprocessing the input can make a huge difference in quality. For example: removing accents from letter, removing (or not removing) punctuation, correcting common spelling errors/variants. de-abbreviating common abbreviations, stemming.

Hope this helps.

@djvanderlaan djvanderlaan added the question Further information is requested label Apr 1, 2022
@tedmoorman
Copy link
Author

This is perfect. I understand much better now. I guess less is more. Maybe separating the vignette into a "with true/official records" and "without true official records" might help? Thank you for the package, and I really appreciate the explanation!

@tedmoorman
Copy link
Author

Here is the kind of thing I was ultimately looking to do. It takes a redundant column, deduplicates, and reassigns the most common similar value as a new column.

I'm not sure whether a function like this might be useful for your package ...

library(dplyr)
library(reclin)

data(town_names)

deduped_column <- function(df_column, closeness = 0.94) {
  original_column <- df_column
  
  pairs <- pair(original_column, deduplication = TRUE) #Find all possible pairs in data
  
  compare_pairs(pairs, on = "x", comparators = list(jaro_winkler()), inplace = TRUE) #compare pairs with string distance
  
  select_threshold(pairs, "matching", "x", threshold = closeness, inplace = TRUE) #select matches above threshold
  
  grouped <- deduplicate_equivalence(pairs, "group_id", "matching") %>% 
    rename(original_column = x)
  
  deduped <- grouped %>% 
    group_by(original_column) %>% 
    mutate(first_last_count = n()) %>% 
    ungroup() %>% 
    arrange(desc(first_last_count)) %>% 
    distinct(group_id, .keep_all = TRUE) %>% 
    rename(deduped_column = original_column) %>% 
    select(group_id, deduped_column)
  
  merge(grouped,
        deduped,
        by = "group_id") %>% 
    select(-group_id)
}

deduped_town_names <- deduped_column(town_names$name)

@tedmoorman tedmoorman reopened this Apr 1, 2022
@tedmoorman
Copy link
Author

This function may be a little better:

library(dplyr)
library(reclin)

data(town_names)

deduped_column <- function(df, col, col_deduped, closeness = 0.94) {
  
  df <- df
  
  col_char_ref <- as.character(substitute(col))
  
  pairs <- pair(df, deduplication = TRUE) #Find all possible pairs in data
  
  compare_pairs(pairs, on = col_char_ref, comparators = list(jaro_winkler()), inplace = TRUE) #compare pairs with string distance
  
  select_threshold(pairs, "matching", col_char_ref, threshold = closeness, inplace = TRUE) #select matches above threshold
  
  grouped <- deduplicate_equivalence(pairs, "group_id", "matching")
  
  grouped <- eval(substitute(rename(grouped, original_column = col)))
  
  deduped <- grouped %>% 
    group_by(original_column) %>% 
    mutate(column_count = n()) %>% 
    ungroup() %>% 
    arrange(desc(column_count)) %>% 
    distinct(group_id, .keep_all = TRUE) %>% 
    rename(deduped_column = original_column) %>% 
    select(group_id, deduped_column)
  
  merge(grouped,
        deduped,
        by = "group_id") %>% 
    select(-group_id) %>% 
    rename({{ col }} := original_column,
           {{ col_deduped }} := deduped_column) 
}

deduped_town_names <- deduped_column(town_names, name, name_deduped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants