You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An issue I've faced many times is filtering a data.frame/tibble based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, which filter then accepts
filter(iris, Species%in% c("setosa"))
The problem with this is that while filter and %in% are both doing their jobs correctly, together they can introduce an unexpected result
One might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that select fails loudly if a column is not present
select(iris, Species, Spaceship)
#> Error in FUN(X[[i]], ...) : object 'Spaceship' not found
The easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result
Neither filter nor %in% is at fault here, but the intent of the code is lost because the implementation is not specific enough: filter does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition is TRUE" and delegates that responsibility of identifying those to %in%. %in% knows nothing of the intent so it responds faithfully with "X is not in this vector".
I propose a new function keep_levels (and its counterpart discard_levels) which takes three main arguments rather than two: the levels to be kept/discarded are explicitly provided rather than providing a logical vector upfront. This alleviates the unexpected result by validating that the levels are indeed present
keep_levels(iris, Species, c("selosa", "virginica"))
#> Error: Level(s) not present in Species: "selosa"
I have a prototype in the following gist (or expand below), though it is unlikely to be constructed the way dplyr would implement it and is not optimised to the C-level
#' Keep only certain groups/levels of a column#'#' @param .d a data.frame or tibble#' @param .g column containing groups/levels to be kept or discarded#' @param .l groups/levels to be kept or discarded as character vector#'#' @return a data.frame or tibble with the same or fewer rows after filtering#' @export#' @importFrom rlang enquo#'#' @seealso discard_levels#'#' @examples#' keep_levels(iris, Species, "setosa")keep_levels<-function(.d, .g, .l) {
g<- validate_level_inputs(.d, rlang::enquo(.g), .l)
.d[.d[[g]] %in%.l, ]
}
#' Discard certain groups/levels of a column#'#' @param .d a data.frame or tibble#' @param .g column containing groups/levels to be kept or discarded#' @param .l groups/levels to be kept or discarded as character vector#'#' @return a data.frame or tibble with the same or fewer rows after filtering#' @export#' @importFrom rlang enquo#'#' @seealso keep_levels#'#' @examples#' discard_levels(iris, Species, "setosa")discard_levels<-function(.d, .g, .l) {
g<- validate_level_inputs(.d, rlang::enquo(.g), .l)
.d[!.d[[g]] %in%.l, ]
}
#' Validate levels to be kept/discarded#'#' Performs the following validations:#' - input .d is data.frame-alike#' - input .g is a column of .d#' - .l is/are present in .d$.g#' #' Failure of any of these validations results in an error.#'#' @inheritParams keep_levels#' @importFrom rlang quo_name#'#' @return column to be filtered as a character string (name)validate_level_inputs<-function(.d, .g, .l) {
stopifnot(inherits(.d, "data.frame"))
g<-rlang::quo_name(.g)
if(!g%in% names(.d)) {
stop(paste0("\"", g, "\" column not present in input data"),
call.=FALSE)
}
levels_present<-.l%in% unique(.d[[g]])
if (!all(levels_present)) {
stop(paste0("Level(s) not present in ", g, ": \"",
toString(.l[which(!levels_present)]),
"\""),
call.=FALSE)
}
g
}
The text was updated successfully, but these errors were encountered:
Thanks. I'm sympathetic to the problem you describe, but I think this should live in forcats, perhaps in a new fct_in() function which would raise an error (or at least warn) if you're filtering unknown levels.
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/
lockbot
locked and limited conversation to collaborators
Oct 15, 2018
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
An issue I've faced many times is filtering a
data.frame
/tibble
based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, whichfilter
then acceptsThe problem with this is that while
filter
and%in%
are both doing their jobs correctly, together they can introduce an unexpected resultOne might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that
select
fails loudly if a column is not presentThe easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result
Neither
filter
nor%in%
is at fault here, but the intent of the code is lost because the implementation is not specific enough:filter
does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition isTRUE
" and delegates that responsibility of identifying those to%in%
.%in%
knows nothing of the intent so it responds faithfully with "X is not in this vector".I propose a new function
keep_levels
(and its counterpartdiscard_levels
) which takes three main arguments rather than two: the levels to be kept/discarded are explicitly provided rather than providing a logical vector upfront. This alleviates the unexpected result by validating that the levels are indeed presentI have a prototype in the following gist (or expand below), though it is unlikely to be constructed the way
dplyr
would implement it and is not optimised to the C-levelhttps://gist.github.com/jonocarroll/7f35eefb35f77aaefce527f0b16b034f
It does however perform unquoting with
rlang
and provides informative error messages on failure.Would this prototype be welcome as a PR? If so, I will write the tests formally and submit it.
Original motivation:
Prototype:
The text was updated successfully, but these errors were encountered: