NA handling in slice_min / slice_max #6177

kmillar · 2022-02-08T17:20:10Z

slice_min and slice_max drop rows where the value in the order_by column is NA, returning fewer than the requested number of rows if n is large enough.

It's unclear to me whether this is a bug or just intended behavior that requires documenting.

library(tibble)
library(dplyr)

x <- tibble(a = 1:3, b = c(10, NA, 12))
slice_min(x, b, n = 3)

The text was updated successfully, but these errors were encountered:

hadley · 2022-04-15T23:32:07Z

Reprex:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(a = 1:3, b = c(10, NA, 12))
df |> slice_min(b, n = 3)
#> # A tibble: 2 × 2
#>       a     b
#>   <int> <dbl>
#> 1     1    10
#> 2     3    12

^{Created on 2022-04-15 by the reprex package (v2.0.1)}

I think this is just a documentation issue as an NA value can never be the smallest or largest value.

hadley · 2022-07-21T16:17:25Z

Combining with #6347:

Currently we have an inconsistency depending on the value of with_ties:

library(dplyr, warn.conflicts = FALSE)

df <- tribble(
  ~id, ~x,
  1, NA,
  1, NA,
  2, 1,
  2, NA,
  3, 1,
  3, 1,
  3, NA
)

df |> group_by(id) |> slice_min(x, n = 1, with_ties = TRUE)
#> # A tibble: 3 × 2
#> # Groups:   id [2]
#>      id     x
#>   <dbl> <dbl>
#> 1     2     1
#> 2     3     1
#> 3     3     1
df |> group_by(id) |> slice_min(x, n = 1, with_ties = FALSE)
#> # A tibble: 3 × 2
#> # Groups:   id [3]
#>      id     x
#>   <dbl> <dbl>
#> 1     1    NA
#> 2     2     1
#> 3     3     1

^{Created on 2022-07-21 by the reprex package (v2.0.1)}

What should happen when there are some NAs in a group? What should happen if there are only NAs in a group?

hadley · 2022-07-22T18:13:17Z

At a minimum, it feels like we also need an na.rm argument here.

hadley · 2022-07-31T13:21:14Z

If we think about these functions strictly, if there's even a single NA we can't actually know what row contains the actually the minimum or maximum value, and this would imply that we should always show NAs first, before the known values.

But that behaviour seems pretty annoying/useless, and it would be inconsistent with arrange() which always puts NA values at the end (regardless of whether you're performing an ascending or descending sort). So I think by default we should use missing values to "fill in" the number of requested rows if they're available, and otherwise provide a na.rm option to remove rows containing NAs from the output.

@eutwt @kmillar does this reasoning make sense to you?

eutwt · 2022-07-31T15:15:45Z

That makes sense to me. So the output of df %>% slice_min(x, n = slice_n) would be the same as df %>% arrange(x) %>% slice_head(n = slice_n) except possibly with extra rows at the end when there are ties in x.

The next question is, are rows with a missing value considered ties? i.e., what should this return?

df <- tibble(x = c(1, NA, NA))

df %>% slice_min(x, n = 2)

hadley · 2022-08-01T13:20:49Z

Yeah, I think we'd consider NA to be tied with NA in that case.

I think these are the main cases:

# Sufficient unique values; get n rows in output
df <- tibble(x = c(1, 2, 3, NA, NA))
df %>% slice_min(x, n = 2)

# Insufficient values; get fewer than n rows in output
df <- tibble(x = c(1))
df %>% slice_min(x, n = 2)

# Insufficient non-missing values, get NAs in output
df <- tibble(x = c(1, NA))
df %>% slice_min(x, n = 2, na.rm = TRUE)

# Ties; get more than n rows in output
df <- tibble(x = c(1, 2, 2))
df %>% slice_min(x, n = 2)

df <- tibble(x = c(1, NA, NA))
df %>% slice_min(x, n = 2)

# Ties and insufficient values happen cancel out
df <- tibble(x = c(1, 1))
df %>% slice_min(x, n = 2)

hadley · 2022-08-01T18:58:28Z

Ooops, I think I messed that up. I meant:

# Insufficient values; get fewer than n rows in output
df <- tibble(x = c(1))
df %>% slice_min(x, n = 2)

# Insufficient non-missing values, get fewer than n rows in output
df <- tibble(x = c(1, NA))
df %>% slice_min(x, n = 2, na.rm = TRUE)

# Insufficient non-missing values, get NAs in output
df <- tibble(x = c(1, NA))
df %>% slice_min(x, n = 2)

Fixes #6177

* Refactor slice tests * Remove tests of invariants now handled by vctrs * Reduce duplication by testing central code paths once * Update error testing style * Use `vec_rank()`. Fixes #6176. * Implement na.rm for slice_min()/slice_max(). Fixes #6177.

hadley added documentation rows ↕️ Operations on rows: filter(), slice(), arrange() labels Apr 15, 2022

hadley mentioned this issue Jul 21, 2022

Clarify semantics of slice_min() and friends when NAs are involved #6347

Closed

hadley added feature a feature request or enhancement and removed documentation labels Jul 21, 2022

This comment was marked as resolved.

Sign in to view

eutwt mentioned this issue Aug 2, 2022

Add na.rm argument to slice_min() and slice_max() #6383

Closed

hadley added a commit that referenced this issue Aug 2, 2022

Implement na.rm for slice_min()/slice_max()

cc8b410

Fixes #6177

hadley mentioned this issue Aug 2, 2022

Support na.rm in slice_min()/slice_max() #6384

Merged

hadley added a commit that referenced this issue Aug 3, 2022

Implement na.rm for slice_min()/slice_max()

3228f5f

Fixes #6177

hadley closed this as completed in #6384 Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NA handling in slice_min / slice_max #6177

NA handling in slice_min / slice_max #6177

kmillar commented Feb 8, 2022

hadley commented Apr 15, 2022

hadley commented Jul 21, 2022

hadley commented Jul 22, 2022

hadley commented Jul 31, 2022

eutwt commented Jul 31, 2022 •

edited

Loading

hadley commented Aug 1, 2022 •

edited

Loading

This comment was marked as resolved.

hadley commented Aug 1, 2022

This comment was marked as resolved.

NA handling in slice_min / slice_max #6177

NA handling in slice_min / slice_max #6177

Comments

kmillar commented Feb 8, 2022

hadley commented Apr 15, 2022

hadley commented Jul 21, 2022

hadley commented Jul 22, 2022

hadley commented Jul 31, 2022

eutwt commented Jul 31, 2022 • edited Loading

hadley commented Aug 1, 2022 • edited Loading

This comment was marked as resolved.

hadley commented Aug 1, 2022

This comment was marked as resolved.

eutwt commented Jul 31, 2022 •

edited

Loading

hadley commented Aug 1, 2022 •

edited

Loading