The goal of {dfuzz} is to help you cleaning up a messy column of
strings of characters in your tibble
or data.frame
.
This package is highly experimental and is not yet ready for being used for real applications.
It is build around two dependencies which themselves have no dependencies:
-
{stringdist}, and it is possible to use the full power of the function
stringdist()
from this excellent package.
{dfuzz} aims at being compatible with both tidyverse and base R dialects.
You can install this package using {remotes} (or {devtools}):
remotes::install_github("courtiol/dfuzz")
library(dfuzz)
## a toy example:
test_df <- data.frame(fruit = c("banana", "blueberry", "limon", "pinapple",
"aple", "apple", "ApplE", "bonana"))
test_df
#> fruit
#> 1 banana
#> 2 blueberry
#> 3 limon
#> 4 pinapple
#> 5 aple
#> 6 apple
#> 7 ApplE
#> 8 bonana
## fast and dirty workflow:
clean_df1 <- fuzzy_tidy(test_df, fruit)
clean_df1
#> fruit fruit.clean fruit.cleaned fruit.tidy
#> 1 banana <NA> banana banana
#> 2 blueberry blueberry <NA> blueberry
#> 3 limon limon <NA> limon
#> 4 pinapple pinapple <NA> pinapple
#> 5 aple <NA> aple aple
#> 6 apple <NA> aple aple
#> 7 ApplE ApplE <NA> ApplE
#> 8 bonana <NA> banana banana
## more subtle workflow:
template_fruit <- fuzzy_match(test_df, fruit)
template_fruit
#> selected syn_1 syn_2
#> 1 aple aple apple
#> 2 banana banana bonana
template_fruit$selected[1] <- "apple"
clean_df2 <- fuzzy_tidy(test_df, fruit, template_fruit)
clean_df2
#> fruit fruit.clean fruit.cleaned fruit.tidy
#> 1 banana <NA> banana banana
#> 2 blueberry blueberry <NA> blueberry
#> 3 limon limon <NA> limon
#> 4 pinapple pinapple <NA> pinapple
#> 5 aple <NA> apple apple
#> 6 apple <NA> apple apple
#> 7 ApplE ApplE <NA> ApplE
#> 8 bonana <NA> banana banana
## fast and dirty workflow with {tidyverse}:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
#> ✓ tibble 3.0.4 ✓ dplyr 1.0.2
#> ✓ tidyr 1.1.2 ✓ stringr 1.4.0
#> ✓ readr 1.4.0 ✓ forcats 0.5.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
test_df %>%
fuzzy_tidy(fruit) %>%
mutate(fruit = fruit.tidy) %>%
select(-contains("fruit."))
#> # A tibble: 8 x 1
#> fruit
#> <chr>
#> 1 banana
#> 2 blueberry
#> 3 limon
#> 4 pinapple
#> 5 aple
#> 6 aple
#> 7 ApplE
#> 8 banana
## more subtle workflow with {tidyverse}:
test_df %>%
mutate(fruit = str_to_title(fruit)) %>%
fuzzy_match(fruit) -> template_fruit
template_fruit
#> # A tibble: 2 x 3
#> selected syn_1 syn_2
#> <chr> <chr> <chr>
#> 1 Aple Aple Apple
#> 2 Banana Banana Bonana
template_fruit %>%
mutate(selected = fct_recode(selected, Apple = "Aple")) -> better_template_fruit
better_template_fruit
#> # A tibble: 2 x 3
#> selected syn_1 syn_2
#> <fct> <chr> <chr>
#> 1 Apple Aple Apple
#> 2 Banana Banana Bonana
test_df %>%
mutate(fruit = str_to_title(fruit)) %>%
fuzzy_tidy(fruit, better_template_fruit) -> clean_df3
clean_df3
#> # A tibble: 8 x 4
#> fruit fruit.clean fruit.cleaned fruit.tidy
#> <chr> <chr> <chr> <chr>
#> 1 Banana <NA> Banana Banana
#> 2 Blueberry Blueberry <NA> Blueberry
#> 3 Limon Limon <NA> Limon
#> 4 Pinapple Pinapple <NA> Pinapple
#> 5 Aple <NA> Apple Apple
#> 6 Apple <NA> Apple Apple
#> 7 Apple <NA> Apple Apple
#> 8 Bonana <NA> Banana Banana
clean_df3 %>%
mutate(fruit = fruit.tidy) %>%
select(-contains("fruit."))
#> # A tibble: 8 x 1
#> fruit
#> <chr>
#> 1 Banana
#> 2 Blueberry
#> 3 Limon
#> 4 Pinapple
#> 5 Apple
#> 6 Apple
#> 7 Apple
#> 8 Banana
If you find that this package is an idea worth pursuing, please let me know. Developing is always more fun when it becomes a collaborative work. So please also email me (or leave an issue) if you want to get involved!