You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An issue I've faced many times is filtering a data.frame/tibble based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, which filter then accepts
filter(iris, Species%in% c("setosa"))
The problem with this is that while filter and %in% are both doing their jobs correctly, together they can introduce an unexpected result
One might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that select fails loudly if a column is not present
select(iris, Species, Spaceship)
#> Error in FUN(X[[i]], ...) : object 'Spaceship' not found
The easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result
Neither filter nor %in% is at fault here, but the intent of the code is lost because the implementation is not specific enough: filter does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition is TRUE" and delegates that responsibility of identifying those to %in%. %in% knows nothing of the intent so it responds faithfully with "X is not in this vector".
I propose a new function fct_match (and its counterpart fct_exclude) which performs validation that the requested levels are indeed contained in the vector prior to generating the logical result of which elements correspond to these levels
fct_match(iris$Species, "selosa", "virginica")
#> Error: Level(s) not present in factor: "selosa"
and otherwise generates the logical vector, e.g. to be passed to filter
(Redesigned from tidyverse/dplyr#3514)
An issue I've faced many times is filtering a
data.frame
/tibble
based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, whichfilter
then acceptsThe problem with this is that while
filter
and%in%
are both doing their jobs correctly, together they can introduce an unexpected resultOne might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that
select
fails loudly if a column is not presentThe easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result
Neither
filter
nor%in%
is at fault here, but the intent of the code is lost because the implementation is not specific enough:filter
does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition isTRUE
" and delegates that responsibility of identifying those to%in%
.%in%
knows nothing of the intent so it responds faithfully with "X is not in this vector".I propose a new function
fct_match
(and its counterpartfct_exclude
) which performs validation that the requested levels are indeed contained in the vector prior to generating the logical result of which elements correspond to these levelsand otherwise generates the logical vector, e.g. to be passed to
filter
I will submit a PR prototype (with testing) to accompany this Issue.
The text was updated successfully, but these errors were encountered: