Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA handling in slice_min / slice_max #6177

Closed
kmillar opened this issue Feb 8, 2022 · 9 comments · Fixed by #6384
Closed

NA handling in slice_min / slice_max #6177

kmillar opened this issue Feb 8, 2022 · 9 comments · Fixed by #6384
Labels
feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange()

Comments

@kmillar
Copy link

kmillar commented Feb 8, 2022

slice_min and slice_max drop rows where the value in the order_by column is NA, returning fewer than the requested number of rows if n is large enough.

It's unclear to me whether this is a bug or just intended behavior that requires documenting.

library(tibble)
library(dplyr)

x <- tibble(a = 1:3, b = c(10, NA, 12))
slice_min(x, b, n = 3)
@hadley
Copy link
Member

hadley commented Apr 15, 2022

Reprex:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(a = 1:3, b = c(10, NA, 12))
df |> slice_min(b, n = 3)
#> # A tibble: 2 × 2
#>       a     b
#>   <int> <dbl>
#> 1     1    10
#> 2     3    12

Created on 2022-04-15 by the reprex package (v2.0.1)

I think this is just a documentation issue as an NA value can never be the smallest or largest value.

@hadley hadley added documentation rows ↕️ Operations on rows: filter(), slice(), arrange() labels Apr 15, 2022
@hadley
Copy link
Member

hadley commented Jul 21, 2022

Combining with #6347:

Currently we have an inconsistency depending on the value of with_ties:

library(dplyr, warn.conflicts = FALSE)

df <- tribble(
  ~id, ~x,
  1, NA,
  1, NA,
  2, 1,
  2, NA,
  3, 1,
  3, 1,
  3, NA
)

df |> group_by(id) |> slice_min(x, n = 1, with_ties = TRUE)
#> # A tibble: 3 × 2
#> # Groups:   id [2]
#>      id     x
#>   <dbl> <dbl>
#> 1     2     1
#> 2     3     1
#> 3     3     1
df |> group_by(id) |> slice_min(x, n = 1, with_ties = FALSE)
#> # A tibble: 3 × 2
#> # Groups:   id [3]
#>      id     x
#>   <dbl> <dbl>
#> 1     1    NA
#> 2     2     1
#> 3     3     1

Created on 2022-07-21 by the reprex package (v2.0.1)

What should happen when there are some NAs in a group? What should happen if there are only NAs in a group?

@hadley
Copy link
Member

hadley commented Jul 22, 2022

At a minimum, it feels like we also need an na.rm argument here.

@hadley
Copy link
Member

hadley commented Jul 31, 2022

If we think about these functions strictly, if there's even a single NA we can't actually know what row contains the actually the minimum or maximum value, and this would imply that we should always show NAs first, before the known values.

But that behaviour seems pretty annoying/useless, and it would be inconsistent with arrange() which always puts NA values at the end (regardless of whether you're performing an ascending or descending sort). So I think by default we should use missing values to "fill in" the number of requested rows if they're available, and otherwise provide a na.rm option to remove rows containing NAs from the output.

@eutwt @kmillar does this reasoning make sense to you?

@eutwt
Copy link
Contributor

eutwt commented Jul 31, 2022

That makes sense to me. So the output of df %>% slice_min(x, n = slice_n) would be the same as df %>% arrange(x) %>% slice_head(n = slice_n) except possibly with extra rows at the end when there are ties in x.

The next question is, are rows with a missing value considered ties? i.e., what should this return?

df <- tibble(x = c(1, NA, NA))

df %>% slice_min(x, n = 2)

@hadley
Copy link
Member

hadley commented Aug 1, 2022

Yeah, I think we'd consider NA to be tied with NA in that case.

I think these are the main cases:

# Sufficient unique values; get n rows in output
df <- tibble(x = c(1, 2, 3, NA, NA))
df %>% slice_min(x, n = 2)

# Insufficient values; get fewer than n rows in output
df <- tibble(x = c(1))
df %>% slice_min(x, n = 2)

# Insufficient non-missing values, get NAs in output
df <- tibble(x = c(1, NA))
df %>% slice_min(x, n = 2, na.rm = TRUE)

# Ties; get more than n rows in output
df <- tibble(x = c(1, 2, 2))
df %>% slice_min(x, n = 2)

df <- tibble(x = c(1, NA, NA))
df %>% slice_min(x, n = 2)

# Ties and insufficient values happen cancel out
df <- tibble(x = c(1, 1))
df %>% slice_min(x, n = 2)

@eutwt

This comment was marked as resolved.

@hadley
Copy link
Member

hadley commented Aug 1, 2022

Ooops, I think I messed that up. I meant:

# Insufficient values; get fewer than n rows in output
df <- tibble(x = c(1))
df %>% slice_min(x, n = 2)

# Insufficient non-missing values, get fewer than n rows in output
df <- tibble(x = c(1, NA))
df %>% slice_min(x, n = 2, na.rm = TRUE)

# Insufficient non-missing values, get NAs in output
df <- tibble(x = c(1, NA))
df %>% slice_min(x, n = 2)

@eutwt

This comment was marked as resolved.

hadley added a commit that referenced this issue Aug 2, 2022
hadley added a commit that referenced this issue Aug 3, 2022
hadley added a commit that referenced this issue Aug 9, 2022
* Refactor slice tests
  * Remove tests of invariants now handled by vctrs
  * Reduce duplication by testing central code paths once
  * Update error testing style

* Use `vec_rank()`. Fixes #6176. 

* Implement na.rm for slice_min()/slice_max(). Fixes #6177.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement rows ↕️ Operations on rows: filter(), slice(), arrange()
Projects
None yet
3 participants