-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652
Changes from all commits
7a93029
f36da90
ac69fba
1c470cc
151db3a
a89c1fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,11 +26,8 @@ test_that("distinct()", { | |
compare_dplyr_binding( | ||
.input %>% | ||
distinct(some_grouping, lgl) %>% | ||
collect() %>% | ||
# GH-14947: column output order changed in dplyr 1.1.0, so we need | ||
# to make the column order explicit until dplyr 1.1.0 is on CRAN | ||
select(some_grouping, lgl) %>% | ||
arrange(some_grouping, lgl), | ||
arrange(some_grouping, lgl) %>% | ||
collect(), | ||
tbl | ||
) | ||
}) | ||
|
@@ -60,11 +57,8 @@ test_that("distinct() can retain groups", { | |
.input %>% | ||
group_by(some_grouping, int) %>% | ||
distinct(lgl) %>% | ||
collect() %>% | ||
# GH-14947: column output order changed in dplyr 1.1.0, so we need | ||
# to make the column order explicit until dplyr 1.1.0 is on CRAN | ||
select(some_grouping, int, lgl) %>% | ||
arrange(lgl, int), | ||
arrange(lgl, int) %>% | ||
collect(), | ||
tbl | ||
) | ||
|
||
|
@@ -73,11 +67,8 @@ test_that("distinct() can retain groups", { | |
.input %>% | ||
group_by(y = some_grouping, int) %>% | ||
distinct(x = lgl) %>% | ||
collect() %>% | ||
# GH-14947: column output order changed in dplyr 1.1.0, so we need | ||
# to make the column order explicit until dplyr 1.1.0 is on CRAN | ||
select(y, int, x) %>% | ||
arrange(int), | ||
arrange(int) %>% | ||
collect(), | ||
tbl | ||
) | ||
}) | ||
|
@@ -95,11 +86,8 @@ test_that("distinct() can contain expressions", { | |
.input %>% | ||
group_by(lgl, int) %>% | ||
distinct(x = some_grouping + 1) %>% | ||
collect() %>% | ||
# GH-14947: column output order changed in dplyr 1.1.0, so we need | ||
# to make the column order explicit until dplyr 1.1.0 is on CRAN | ||
select(lgl, int, x) %>% | ||
arrange(int), | ||
arrange(int) %>% | ||
collect(), | ||
tbl | ||
) | ||
}) | ||
|
@@ -115,12 +103,57 @@ test_that("across() works in distinct()", { | |
}) | ||
|
||
test_that("distinct() can return all columns", { | ||
skip("ARROW-14045") | ||
compare_dplyr_binding( | ||
.input %>% | ||
distinct(lgl, .keep_all = TRUE) %>% | ||
collect() %>% | ||
arrange(int), | ||
tbl | ||
) | ||
# hash_one prefers to keep non-null values, which is different from .keep_all in dplyr | ||
# so we can't compare the result directly | ||
expected <- tbl %>% | ||
# Drop factor because of #44661: | ||
# NotImplemented: Function 'hash_one' has no kernel matching input types | ||
# (dictionary<values=string, indices=int8, ordered=0>, uint8) | ||
Comment on lines
+109
to
+111
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is 110-111 the error that someone would get if they tried We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that's the error message. I'd have to think about how/where best to catch that and translate that to R-speak. As it turns out, dictionary isn't the only unsupported type, it's just the only one we have in this test data frame. I think list types and other non-simple types are also not supported, IIRC from RTFS. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
select(-fct) %>% | ||
distinct(lgl, .keep_all = TRUE) %>% | ||
arrange(int) | ||
|
||
with_table <- tbl %>% | ||
arrow_table() %>% | ||
select(-fct) %>% | ||
distinct(lgl, .keep_all = TRUE) %>% | ||
arrange(int) %>% | ||
collect() | ||
|
||
expect_identical(dim(with_table), dim(expected)) | ||
expect_identical(names(with_table), names(expected)) | ||
|
||
# Test with some mutation in there | ||
expected <- tbl %>% | ||
select(-fct) %>% | ||
distinct(lgl, bigger = int * 10L, .keep_all = TRUE) %>% | ||
arrange(int) | ||
|
||
with_table <- tbl %>% | ||
arrow_table() %>% | ||
select(-fct) %>% | ||
distinct(lgl, bigger = int * 10, .keep_all = TRUE) %>% | ||
arrange(int) %>% | ||
collect() | ||
|
||
expect_identical(dim(with_table), dim(expected)) | ||
expect_identical(names(with_table), names(expected)) | ||
expect_identical(with_table$bigger, expected$bigger) | ||
|
||
# Mutation that overwrites | ||
expected <- tbl %>% | ||
select(-fct) %>% | ||
distinct(lgl, int = int * 10L, .keep_all = TRUE) %>% | ||
arrange(int) | ||
|
||
with_table <- tbl %>% | ||
arrow_table() %>% | ||
select(-fct) %>% | ||
distinct(lgl, int = int * 10, .keep_all = TRUE) %>% | ||
arrange(int) %>% | ||
collect() | ||
|
||
expect_identical(dim(with_table), dim(expected)) | ||
expect_identical(names(with_table), names(expected)) | ||
expect_identical(with_table$int, expected$int) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?).
Or maybe with a one-time warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is documented on the acero man page, that's the change to arrow-package.R. I'd rather not one-time warning; that's a slippery slope if we were going to be chatty about every subtle difference between how Acero works from dplyr on data.frames.