Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

Merged
merged 6 commits into from
Dec 7, 2024

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Nov 5, 2024

Rationale for this change

Support a missing feature, just wiring up some stuff from R to Acero, then adding docs and tests.

This is mostly picking up where #13934 started and finishing it out. Thanks @mopcup for the initial lift.

What changes are included in this PR?

An aggregation binding, some symbol manipulation, and tests. I also cleaned up some dplyr test shims from 2022.

Are these changes tested?

Yes, though if anyone knows of odd corners in distinct() that aren't covered by this, we can add more

Are there any user-facing changes?

Yes indeed.

Copy link

github-actions bot commented Nov 5, 2024

⚠️ GitHub issue #29642 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Mostly questions about messaging + conveying some of the nuances

Comment on lines +109 to +111
# Drop factor because of #44661:
# NotImplemented: Function 'hash_one' has no kernel matching input types
# (dictionary<values=string, indices=int8, ordered=0>, uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 110-111 the error that someone would get if they tried distinct(..., .keep_all = TRUE) with a factor in the table/data.frame?

We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the error message. I'd have to think about how/where best to catch that and translate that to R-speak. As it turns out, dictionary isn't the only unsupported type, it's just the only one we have in this test data frame. I think list types and other non-simple types are also not supported, IIRC from RTFS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +31 to +33
# Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value.
# However, Acero's `hash_one` function prefers returning non-null values.
# So, you'll get the same shape of data, but the values may differ.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?).

Or maybe with a one-time warning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented on the acero man page, that's the change to arrow-package.R. I'd rather not one-time warning; that's a slippery slope if we were going to be chatty about every subtle difference between how Acero works from dplyr on data.frames.

@rkrug
Copy link

rkrug commented Dec 3, 2024

Is there any chance to get it merged? I would very much need to use it in my code!

Thanks.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last review should have been an approve with comments rather than just a comment

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Dec 3, 2024
@rkrug
Copy link

rkrug commented Dec 3, 2024

Thanks. So it will be in the next release?

@jonkeane
Copy link
Member

jonkeane commented Dec 7, 2024

It missed the latest release, but once it merges it'll be in the following one. Alternatively, you could use nightly builds starting the day after it merges.

@jonkeane jonkeane merged commit 1b3caf6 into apache:main Dec 7, 2024
16 checks passed
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 1b3caf6.

There were 132 benchmark results with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants