Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: keep_levels() with validation #3514

Closed
jonocarroll opened this issue Apr 15, 2018 · 3 comments
Closed

Proposal: keep_levels() with validation #3514

jonocarroll opened this issue Apr 15, 2018 · 3 comments

Comments

@jonocarroll
Copy link

An issue I've faced many times is filtering a data.frame/tibble based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, which filter then accepts

filter(iris, Species %in% c("setosa"))

The problem with this is that while filter and %in% are both doing their jobs correctly, together they can introduce an unexpected result

filter(iris, Species %in% c("virginica", "samosa"))

One might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that select fails loudly if a column is not present

select(iris, Species, Spaceship)
#> Error in FUN(X[[i]], ...) : object 'Spaceship' not found

The easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result

filter(iris, Species %in% c("selosa", "virginica")) %>% {unique(.$Species)}
#> [1] virginica
#> Levels: setosa versicolor virginica

Neither filter nor %in% is at fault here, but the intent of the code is lost because the implementation is not specific enough: filter does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition is TRUE" and delegates that responsibility of identifying those to %in%. %in% knows nothing of the intent so it responds faithfully with "X is not in this vector".

I propose a new function keep_levels (and its counterpart discard_levels) which takes three main arguments rather than two: the levels to be kept/discarded are explicitly provided rather than providing a logical vector upfront. This alleviates the unexpected result by validating that the levels are indeed present

keep_levels(iris, Species, c("selosa", "virginica"))
#> Error: Level(s) not present in Species: "selosa"

I have a prototype in the following gist (or expand below), though it is unlikely to be constructed the way dplyr would implement it and is not optimised to the C-level

https://gist.github.com/jonocarroll/7f35eefb35f77aaefce527f0b16b034f

It does however perform unquoting with rlang and provides informative error messages on failure.

Would this prototype be welcome as a PR? If so, I will write the tests formally and submit it.

Original motivation:

Source of today's mild heart-attack: I have categories W, X_Y, and Z in some data. Intending to keep only the second two:

data %>% filter(g %in% c("X Y", "Z")

Did you spot that I used a space instead of an underscore? I sure as heck didn't, and filtered excessively to just Z.

— Jonathan Carroll (@carroll_jono) March 6, 2018
Prototype:
#' Keep only certain groups/levels of a column
#'
#' @param .d a data.frame or tibble
#' @param .g column containing groups/levels to be kept or discarded
#' @param .l groups/levels to be kept or discarded as character vector
#'
#' @return a data.frame or tibble with the same or fewer rows after filtering
#' @export
#' @importFrom rlang enquo
#'
#' @seealso discard_levels
#'
#' @examples
#' keep_levels(iris, Species, "setosa")
keep_levels <- function(.d, .g, .l) {
  g <- validate_level_inputs(.d, rlang::enquo(.g), .l)
  .d[.d[[g]] %in% .l, ]
}


#' Discard certain groups/levels of a column
#'
#' @param .d a data.frame or tibble
#' @param .g column containing groups/levels to be kept or discarded
#' @param .l groups/levels to be kept or discarded as character vector
#'
#' @return a data.frame or tibble with the same or fewer rows after filtering
#' @export
#' @importFrom rlang enquo
#'
#' @seealso keep_levels
#'
#' @examples
#' discard_levels(iris, Species, "setosa")
discard_levels <- function(.d, .g, .l) {
  g <- validate_level_inputs(.d, rlang::enquo(.g), .l)
  .d[! .d[[g]] %in% .l, ]
}


#' Validate levels to be kept/discarded
#'
#' Performs the following validations:
#'  - input .d is data.frame-alike
#'  - input .g is a column of .d
#'  - .l is/are present in .d$.g
#'  
#'  Failure of any of these validations results in an error.
#'
#' @inheritParams keep_levels
#' @importFrom rlang quo_name
#'
#' @return column to be filtered as a character string (name)
validate_level_inputs <- function(.d, .g, .l) {
  stopifnot(inherits(.d, "data.frame"))
  g <- rlang::quo_name(.g)
  if(! g %in% names(.d)) {
    stop(paste0("\"", g, "\" column not present in input data"), 
         call. = FALSE)
  }
  levels_present <- .l %in% unique(.d[[g]])
  if (!all(levels_present)) {
    stop(paste0("Level(s) not present in ", g, ": \"", 
                toString(.l[which(!levels_present)]), 
                "\""), 
         call. = FALSE)
  }
  g
}
@krlmlr
Copy link
Member

krlmlr commented Apr 18, 2018

Thanks. I'm sympathetic to the problem you describe, but I think this should live in forcats, perhaps in a new fct_in() function which would raise an error (or at least warn) if you're filtering unknown levels.

@hadley
Copy link
Member

hadley commented Apr 18, 2018

Yeah, we definitely won't be adding more vector oriented functions to dplyr.

@lock
Copy link

lock bot commented Oct 15, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants