Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA in levels vs NA in values #337

Merged
merged 15 commits into from
Jan 9, 2023
Merged
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,16 @@ Imports:
lifecycle,
magrittr,
rlang (>= 1.0.0),
tibble,
withr
tibble
Suggests:
covr,
dplyr,
ggplot2,
knitr,
readr,
rmarkdown,
testthat (>= 3.0.0)
testthat (>= 3.0.0),
withr
VignetteBuilder:
knitr
Config/Needs/website: tidyverse/tidytemplate
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ export(fct_lump_min)
export(fct_lump_n)
export(fct_lump_prop)
export(fct_match)
export(fct_na_level_to_value)
export(fct_na_value_to_level)
export(fct_other)
export(fct_recode)
export(fct_relabel)
Expand Down
9 changes: 9 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# forcats (development version)

* `fct_explicit_na()` is deprecated in favour of `fct_na_value_to_level()`.

* New `fct_na_value_to_level()` and `fct_na_level_to_value()` to convert
NA values to NA levels and vice versa (#337).

* `fct_collapse()` can now use `other_level = NA` (#291).

* `fct_count()` works with factors that contain `NA`s in levels.

* `fct_lump_prop()` and friends now work correctly if you supply weights
and have empty levels (#292).

Expand Down
20 changes: 7 additions & 13 deletions R/collapse.R
Original file line number Diff line number Diff line change
Expand Up @@ -31,21 +31,15 @@ fct_collapse <- function(.f, ..., other_level = NULL, group_other = "DEPRECATED"
}
}

new <- rlang::list2(...)
levs <- as.list(unlist(new, use.names = FALSE))
dots <- rlang::list2(...)

old <- unlist(dots, use.names = FALSE) %||% character()
new <- rep(names(dots), lengths(dots))
out <- lvls_revalue(f, lvls_rename(f, set_names(old, new)))

if (!is.null(other_level)) {
levels <- levels(f)
new[[other_level]] <- levels[!levels %in% levs]
levs <- c(levs, new[[other_level]])
out <- lvls_other(out, levels(out) %in% new, other_level)
}

names(levs) <- names(new)[rep(seq_along(new), vapply(new, length, integer(1)))]
out <- fct_recode(.f, !!!levs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since fct_recode() disappeared does that mean this weirdness went away? #291 (comment)

I was expecting to see a test for that or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added #339 to think about this more. I can't figure out if it deserves a test or not. On the one hand, it was arguably a bug, so it should get a test. On the other hand, thinking about fct_collapse() from first principles, it seems strange to test specific values of other_level.


if (any(levels(out) == other_level)) {
fct_relevel(out, other_level, after = Inf)
} else {
out
}
out
}
8 changes: 5 additions & 3 deletions R/count.R
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@
#' fct_count(f, sort = TRUE)
#' fct_count(f, sort = TRUE, prop = TRUE)
fct_count <- function(f, sort = FALSE, prop = FALSE) {
f2 <- check_factor(f)
f <- check_factor(f)
n_na <- sum(is.na(f))

n <- c(tabulate(f, nlevels(f)), if (n_na > 0) n_na)

df <- tibble::tibble(
f = fct_inorder(c(levels(f2), if (n_na > 0) NA)),
n = c(tabulate(f2, nlevels(f)), if (n_na > 0) n_na)
f = fct_unique(f),
n = n
)

if (sort) {
Expand Down
23 changes: 21 additions & 2 deletions R/explicit_na.R
Original file line number Diff line number Diff line change
@@ -1,21 +1,40 @@
#' Make missing values explicit
#'
#' @description
#' `r lifecycle::badge("deprecated")`
#'
#' This function is deprecated because the terminology is confusing;
#' please use [fct_na_value_to_level()] instead.
#'
#' This gives missing values an explicit factor level, ensuring that they
#' appear in summaries and on plots.
#'
#' @param f A factor (or character vector).
#' @param na_level Level to use for missing values: this is what `NA`s will be
#' changed to.
#' @export
#' @keywords internal
#' @examples
#' f1 <- factor(c("a", "a", NA, NA, "a", "b", NA, "c", "a", "c", "b"))
#' fct_count(f1)
#' table(is.na(f1))
#' table(f1)
#' sum(is.na(f1))
#'
#' # previously
#' f2 <- fct_explicit_na(f1)
#' # now
#' f2 <- fct_na_value_to_level(f1)
#'
#' fct_count(f2)
#' table(is.na(f2))
#' table(f2)
#' sum(is.na(f2))
fct_explicit_na <- function(f, na_level = "(Missing)") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to specify the na_level that you wanted to materialize the NA values as seems like it was probably useful?

Maybe you can default na_value = NA in fct_na_value_to_level(), which makes that and fct_na_level_to_value() reversible by default, while also allowing for alternate level names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look below.

lifecycle::deprecate_warn(
when = "1.0.0",
what = "fct_explicit_na()",
with = "fct_na_value_to_level()"
)

f <- check_factor(f)

is_missing <- is.na(f)
Expand Down
42 changes: 8 additions & 34 deletions R/lump.R
Original file line number Diff line number Diff line change
Expand Up @@ -95,14 +95,7 @@ fct_lump_min <- function(f, min, w = NULL, other_level = "Other") {
cli::cli_abort("{.arg min} must be a positive number")
}

new_levels <- ifelse(level_w >= min, levels(f), other_level)

if (other_level %in% new_levels) {
f <- lvls_revalue(f, new_levels)
fct_relevel(f, other_level, after = Inf)
} else {
f
}
lvls_other(f, level_w >= min, other_level)
}

#' @export
Expand All @@ -123,20 +116,12 @@ fct_lump_prop <- function(f, prop, w = NULL, other_level = "Other") {
}

if (prop < 0) {
new_levels <- ifelse(prop_n <= -prop, levels(f), other_level)
} else {
new_levels <- ifelse(prop_n > prop, levels(f), other_level)
}

if (other_level %in% new_levels) {
f <- lvls_revalue(f, new_levels)
fct_relevel(f, other_level, after = Inf)
lvls_other(f, prop_n <= -prop, other_level)
} else {
f
lvls_other(f, prop_n > prop, other_level)
}
}


#' @export
#' @rdname fct_lump
fct_lump_n <- function(f, n, w = NULL, other_level = "Other",
Expand All @@ -156,14 +141,7 @@ fct_lump_n <- function(f, n, w = NULL, other_level = "Other",
rank <- rank(-level_w, ties.method = ties.method)
}

new_levels <- ifelse(rank <= n, levels(f), other_level)

if (other_level %in% new_levels) {
f <- lvls_revalue(f, new_levels)
fct_relevel(f, other_level, after = Inf)
} else {
f
}
lvls_other(f, rank <= n, other_level)
}

#' @export
Expand All @@ -172,16 +150,12 @@ fct_lump_lowfreq <- function(f, w = NULL, other_level = "Other") {
f <- check_factor(f)
level_w <- compute_weights(f, w)

new_levels <- ifelse(!in_smallest(level_w), levels(f), other_level)

if (other_level %in% new_levels) {
f <- lvls_revalue(f, new_levels)
fct_relevel(f, other_level, after = Inf)
} else {
f
}
lvls_other(f, !in_smallest(level_w), other_level)
}


# helpers -----------------------------------------------------------------

compute_weights <- function(f, w = NULL, call = caller_env()) {
w <- check_weights(w, length(f), call = call)

Expand Down
72 changes: 72 additions & 0 deletions R/na.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
#' Convert between `NA` values and `NA` levels
#'
#' @description
#' There are two ways to represent missing values in factors: in the values
#' and in the levels. `NA`s in the values are most useful for data analysis
#' (since [is.na()] returns what you expect), but because the `NA` is not
#' explicitly recorded in the levels, there's no way to control its position
#' (it's almost always displayed last or not at all). Putting the `NA`s in the levels allows
#' you to control its display, at the cost of losing accurate `is.na()`
#' reporting.
#'
#' (It is possible to have a factor with missing values in both the values
#' and the levels but it requires some explicit gymnastics and we don't
#' recommend it.)
#'
#' @param f A factor (or character vector).
#' @param level Optionally, instead of converting the `NA` values to an
#' `NA` level, convert it to a level with this value.
#' @export
#' @examples
#' # Most factors store NAs in the values:
#' f1 <- fct(c("a", "b", NA, "c", "b", NA))
#' levels(f1)
#' as.integer(f1)
#' is.na(f1)
#'
#' # But it's also possible to store them in the levels
#' f2 <- fct_na_value_to_level(f1)
#' levels(f2)
#' as.integer(f2)
#' is.na(f2)
#'
#' # If needed, you can convert back to NAs in the values:
#' f3 <- fct_na_level_to_value(f2)
#' levels(f3)
#' as.integer(f3)
#' is.na(f3)
fct_na_value_to_level <- function(f, level = NA) {
if (!identical(level, NA) && !is_string(level)) {
cli::cli_abort(
"{.arg level} must be a string or {.code NA}, not {.obj_type_friendly {level}}."
)
}
f <- check_factor(f)

f <- fct_expand(f, NA)
new_levels <- levels(f)
new_levels[is.na(new_levels)] <- level

lvls_revalue(f, new_levels)
Comment on lines +46 to +50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you already have an NA level, this will rename it to level? Seems reasonable

}

#' @export
#' @rdname fct_na_value_to_level
#' @param extra_levels Optionally, a character vector giving additional levels
#' that should also be converted to `NA` values.
fct_na_level_to_value <- function(f, extra_levels = NULL) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this extra_levels because if you don't include NA, it's easy to create factors that have a mix of NAs in values and levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I was actually wondering if we even needed extra_levels at all, but there doesn't seem to be any other way to do this in forcats, so it seems useful to have

x <- factor(c("x", "missing"))

forcats::fct_na_level_to_value(x, extra_levels = "missing")
#> [1] x    <NA>
#> Levels: x

It is a nice way of declaring, "no, missing should really be a missing value"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually surprised this has never come up before. It seems pretty useful

f <- check_factor(f)
if (!is.null(extra_levels) && !is.character(extra_levels)) {
cli::cli_abort(
"{.arg extra_levels} must be a string or {.code NULL}, not {.obj_type_friendly {extra_levels}}."
)
}

new_levels <- setdiff(levels(f), union(NA, extra_levels))
idx <- match(levels(f), new_levels)

out <- idx[as.integer(f)]
attributes(out) <- attributes(f)
attr(out, "levels") <- new_levels
out
}
21 changes: 13 additions & 8 deletions R/other.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,22 @@ fct_other <- function(f, keep, drop, other_level = "Other") {
f <- check_factor(f)
check_exclusive(keep, drop)

levels <- levels(f)
if (!missing(keep)) {
levels[!levels %in% keep] <- other_level
lvls_other(f, levels(f) %in% keep, other_level)
} else {
levels[levels %in% drop] <- other_level
lvls_other(f, !levels(f) %in% drop, other_level)
}
}

if (!other_level %in% levels) {
return(f)
# Replace specified levels (if any), with other.
# @param keep A logical vector the same length as `levels(f)`
lvls_other <- function(f, keep, other_level = "Other") {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added this new helper function to ensure that the implementation of "othering" is identical for all code paths.

if (all(keep)) {
f
} else {
new_levels <- ifelse(keep, levels(f), other_level)
f <- lvls_revalue(f, new_levels)
fct_relevel(f, other_level, after = Inf)
}

f <- lvls_revalue(f, levels)
fct_relevel(f, other_level, after = Inf)
}

7 changes: 5 additions & 2 deletions R/recode.R
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ fct_recode <- function(.f, ...) {
new_levels <- new_levels[!nulls]
}

lvls_revalue(f, lvls_rename(f, new_levels))
}

lvls_rename <- function(f, new_levels) {
# Match old levels with new levels
old_levels <- levels(f)
idx <- match(new_levels, old_levels)
Expand All @@ -49,8 +53,7 @@ fct_recode <- function(.f, ...) {
}

old_levels[idx] <- names(new_levels)

lvls_revalue(f, old_levels)
old_levels
}

check_recode_levels <- function(..., call = caller_env()) {
Expand Down
2 changes: 1 addition & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ reference:
Leave existing data as is, but add or remove levels.
contents:
- fct_expand
- fct_explicit_na
- fct_drop
- fct_na_value_to_level
- fct_unify

- title: Combine multiple factors
Expand Down
16 changes: 14 additions & 2 deletions man/fct_explicit_na.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading