Proposal: keep_levels() with validation #3514

jonocarroll · 2018-04-15T11:14:00Z

An issue I've faced many times is filtering a data.frame/tibble based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, which filter then accepts

filter(iris, Species %in% c("setosa"))

The problem with this is that while filter and %in% are both doing their jobs correctly, together they can introduce an unexpected result

filter(iris, Species %in% c("virginica", "samosa"))

One might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that select fails loudly if a column is not present

select(iris, Species, Spaceship)
#> Error in FUN(X[[i]], ...) : object 'Spaceship' not found

The easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result

filter(iris, Species %in% c("selosa", "virginica")) %>% {unique(.$Species)}
#> [1] virginica
#> Levels: setosa versicolor virginica

Neither filter nor %in% is at fault here, but the intent of the code is lost because the implementation is not specific enough: filter does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition is TRUE" and delegates that responsibility of identifying those to %in%. %in% knows nothing of the intent so it responds faithfully with "X is not in this vector".

I propose a new function keep_levels (and its counterpart discard_levels) which takes three main arguments rather than two: the levels to be kept/discarded are explicitly provided rather than providing a logical vector upfront. This alleviates the unexpected result by validating that the levels are indeed present

keep_levels(iris, Species, c("selosa", "virginica"))
#> Error: Level(s) not present in Species: "selosa"

I have a prototype in the following gist (or expand below), though it is unlikely to be constructed the way dplyr would implement it and is not optimised to the C-level

https://gist.github.com/jonocarroll/7f35eefb35f77aaefce527f0b16b034f

It does however perform unquoting with rlang and provides informative error messages on failure.

Would this prototype be welcome as a PR? If so, I will write the tests formally and submit it.

Original motivation:

Source of today's mild heart-attack: I have categories W, X_Y, and Z in some data. Intending to keep only the second two:

data %>% filter(g %in% c("X Y", "Z")

Did you spot that I used a space instead of an underscore? I sure as heck didn't, and filtered excessively to just Z.
— Jonathan Carroll (@carroll_jono) March 6, 2018

Prototype:

#' Keep only certain groups/levels of a column
#'
#' @param .d a data.frame or tibble
#' @param .g column containing groups/levels to be kept or discarded
#' @param .l groups/levels to be kept or discarded as character vector
#'
#' @return a data.frame or tibble with the same or fewer rows after filtering
#' @export
#' @importFrom rlang enquo
#'
#' @seealso discard_levels
#'
#' @examples
#' keep_levels(iris, Species, "setosa")
keep_levels <- function(.d, .g, .l) {
  g <- validate_level_inputs(.d, rlang::enquo(.g), .l)
  .d[.d[[g]] %in% .l, ]
}


#' Discard certain groups/levels of a column
#'
#' @param .d a data.frame or tibble
#' @param .g column containing groups/levels to be kept or discarded
#' @param .l groups/levels to be kept or discarded as character vector
#'
#' @return a data.frame or tibble with the same or fewer rows after filtering
#' @export
#' @importFrom rlang enquo
#'
#' @seealso keep_levels
#'
#' @examples
#' discard_levels(iris, Species, "setosa")
discard_levels <- function(.d, .g, .l) {
  g <- validate_level_inputs(.d, rlang::enquo(.g), .l)
  .d[! .d[[g]] %in% .l, ]
}


#' Validate levels to be kept/discarded
#'
#' Performs the following validations:
#'  - input .d is data.frame-alike
#'  - input .g is a column of .d
#'  - .l is/are present in .d$.g
#'  
#'  Failure of any of these validations results in an error.
#'
#' @inheritParams keep_levels
#' @importFrom rlang quo_name
#'
#' @return column to be filtered as a character string (name)
validate_level_inputs <- function(.d, .g, .l) {
  stopifnot(inherits(.d, "data.frame"))
  g <- rlang::quo_name(.g)
  if(! g %in% names(.d)) {
    stop(paste0("\"", g, "\" column not present in input data"), 
         call. = FALSE)
  }
  levels_present <- .l %in% unique(.d[[g]])
  if (!all(levels_present)) {
    stop(paste0("Level(s) not present in ", g, ": \"", 
                toString(.l[which(!levels_present)]), 
                "\""), 
         call. = FALSE)
  }
  g
}

The text was updated successfully, but these errors were encountered:

krlmlr · 2018-04-18T15:17:18Z

Thanks. I'm sympathetic to the problem you describe, but I think this should live in forcats, perhaps in a new fct_in() function which would raise an error (or at least warn) if you're filtering unknown levels.

hadley · 2018-04-18T20:52:34Z

Yeah, we definitely won't be adding more vector oriented functions to dplyr.

lock · 2018-10-15T21:16:39Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

hadley closed this as completed Apr 18, 2018

This was referenced Apr 20, 2018

Proposal: fct_match with validation tidyverse/forcats#126

Closed

Proposal: fct_match with validation tidyverse/forcats#127

Merged

lock bot locked and limited conversation to collaborators Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: keep_levels() with validation #3514

Proposal: keep_levels() with validation #3514

jonocarroll commented Apr 15, 2018

krlmlr commented Apr 18, 2018

hadley commented Apr 18, 2018

lock bot commented Oct 15, 2018

Proposal: keep_levels() with validation #3514

Proposal: keep_levels() with validation #3514

Comments

jonocarroll commented Apr 15, 2018

krlmlr commented Apr 18, 2018

hadley commented Apr 18, 2018

lock bot commented Oct 15, 2018