GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

nealrichardson · 2024-11-05T21:39:27Z

Rationale for this change

Support a missing feature, just wiring up some stuff from R to Acero, then adding docs and tests.

This is mostly picking up where #13934 started and finishing it out. Thanks @mopcup for the initial lift.

What changes are included in this PR?

An aggregation binding, some symbol manipulation, and tests. I also cleaned up some dplyr test shims from 2022.

Are these changes tested?

Yes, though if anyone knows of odd corners in distinct() that aren't covered by this, we can add more

Are there any user-facing changes?

Yes indeed.

GitHub Issue: [R] Support for .keep_all = TRUE with distinct() #29642

github-actions · 2024-11-05T21:39:56Z

⚠️ GitHub issue #29642 has been automatically assigned in GitHub to PR creator.

jonkeane

Thanks for this! Mostly questions about messaging + conveying some of the nuances

jonkeane · 2024-11-10T13:14:35Z

r/tests/testthat/test-dplyr-distinct.R

+    # Drop factor because of #44661:
+    # NotImplemented: Function 'hash_one' has no kernel matching input types
+    #   (dictionary<values=string, indices=int8, ordered=0>, uint8)


Is 110-111 the error that someone would get if they tried distinct(..., .keep_all = TRUE) with a factor in the table/data.frame?

We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind

Yeah that's the error message. I'd have to think about how/where best to catch that and translate that to R-speak. As it turns out, dictionary isn't the only unsupported type, it's just the only one we have in this test data frame. I think list types and other non-simple types are also not supported, IIRC from RTFS.

jonkeane · 2024-11-10T13:18:11Z

r/R/dplyr-distinct.R

+    # Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value.
+    # However, Acero's `hash_one` function prefers returning non-null values.
+    # So, you'll get the same shape of data, but the values may differ.


This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?).

Or maybe with a one-time warning?

It is documented on the acero man page, that's the change to arrow-package.R. I'd rather not one-time warning; that's a slippery slope if we were going to be chatty about every subtle difference between how Acero works from dplyr on data.frames.

rkrug · 2024-12-03T13:05:39Z

Is there any chance to get it merged? I would very much need to use it in my code!

Thanks.

jonkeane

My last review should have been an approve with comments rather than just a comment

rkrug · 2024-12-03T13:50:55Z

Thanks. So it will be in the next release?

jonkeane · 2024-12-07T15:03:59Z

It missed the latest release, but once it merges it'll be in the following one. Alternatively, you could use nightly builds starting the day after it merges.

conbench-apache-arrow · 2024-12-07T19:51:39Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 1b3caf6.

There were 132 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2024-12-07 16:34:43Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 130 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

nealrichardson added 4 commits November 5, 2024 16:03

Remove dplyr test shims from 2022

7a93029

Bring in logic from apache#13934 and add a basic test

f36da90

A couple more tests

ac69fba

Update doc note and comments

1c470cc

nealrichardson requested review from jonkeane and thisisnic as code owners November 5, 2024 21:39

github-actions bot added Component: R awaiting review Awaiting review labels Nov 5, 2024

💅

151db3a

nealrichardson mentioned this pull request Nov 6, 2024

[C++][Acero] hash_one not implemented for dictionary and other types #44661

Open

Add issue link to factor issue

a89c1fe

jonkeane reviewed Nov 10, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 10, 2024

nealrichardson mentioned this pull request Nov 15, 2024

[R] Provide helpful hints for NotImplemented kernel errors #44740

Open

jonkeane approved these changes Dec 3, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Dec 3, 2024

jonkeane merged commit 1b3caf6 into apache:main Dec 7, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

nealrichardson commented Nov 5, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 5, 2024

jonkeane left a comment

jonkeane Nov 10, 2024

nealrichardson Nov 15, 2024

nealrichardson Nov 15, 2024

jonkeane Nov 10, 2024

nealrichardson Nov 15, 2024

rkrug commented Dec 3, 2024

jonkeane left a comment

rkrug commented Dec 3, 2024

jonkeane commented Dec 7, 2024

conbench-apache-arrow bot commented Dec 7, 2024

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

Conversation

nealrichardson commented Nov 5, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Nov 5, 2024

jonkeane left a comment

Choose a reason for hiding this comment

jonkeane Nov 10, 2024

Choose a reason for hiding this comment

nealrichardson Nov 15, 2024

Choose a reason for hiding this comment

nealrichardson Nov 15, 2024

Choose a reason for hiding this comment

jonkeane Nov 10, 2024

Choose a reason for hiding this comment

nealrichardson Nov 15, 2024

Choose a reason for hiding this comment

rkrug commented Dec 3, 2024

jonkeane left a comment

Choose a reason for hiding this comment

rkrug commented Dec 3, 2024

jonkeane commented Dec 7, 2024

conbench-apache-arrow bot commented Dec 7, 2024

nealrichardson commented Nov 5, 2024 •

edited by github-actions bot

Loading