-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35786: [C++] Add pairwise_diff function #35787
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
@lidavidm @pitrou @westonpace Hi guys, would any of you mind taking a look at this PR? It's been sitting around for almost a month. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly looks good to me. However, in light of the strong relationship between this Function and subtract I'd prefer to see more reuse of that function's existing logic. For example when retrieving the output type for pairwise_diff($in_type), couldn't we just as easily call subtract's DispatchExact with ($in_type, $in_type)? And instead of referencing Subtract's Ops directly, couldn't we use the kernel retrieved from subtract's DispatchExact- just passing the relevant slices of the input as the arguments to the subtract kernel? I think this would have negligible impact on performance and would greatly reduce the future maintenance burden for this function, since any new types added to subtract will then automatically be supported by pairwise_diff.
3f037b6
to
6741ba2
Compare
Hi @bkietz, do you mean something like this: ARROW_ASSIGN_OR_RAISE(auto subtract_func, registry->GetFunction("subtract"));
for (const auto& type : types) {
ARROW_ASSIGN_OR_RAISE(auto kernel,
subtract_func->DispatchExact({type, type}));
// reuse kernel's exec and signature
} |
@js8544 yes, exactly. I think we might be able to go a step further and loop over subtract's kernels directly, without the need to list input types explicitly and go through DispatchExact. (sorry for the spurious close; bumped the wrong button) |
That sounds better indeed! I'll try this now. |
Co-authored-by: Benjamin Kietzman <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
@bkietz I've changed it to looping over subtract's kernels and wrapping their signature and kernel exec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, thanks for refactoring!
Just a few nits
auto offset = abs(periods); | ||
offset = std::min(offset, input.length); | ||
auto exec_length = input.length - offset; | ||
// prepare bitmap | ||
auto null_start = periods > 0 ? 0 : exec_length; | ||
auto non_null_start = periods > 0 ? offset : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could we call these two regions "computed" and "margin"? I think that'll be more obvious for the next maintainer, and it'd help if we move all the start/length calculation to this preamble too
auto offset = abs(periods); | |
offset = std::min(offset, input.length); | |
auto exec_length = input.length - offset; | |
// prepare bitmap | |
auto null_start = periods > 0 ? 0 : exec_length; | |
auto non_null_start = periods > 0 ? offset : 0; | |
// We only compute values in the region where the input-with-offset overlaps | |
// the original input. The margin where these do not overlap gets filled with null. | |
auto margin_length = std::min(abs(periods), input.length); | |
auto computed_length = input.length - margin_length; | |
auto margin_start = periods > 0 ? 0 : computed_length; | |
auto left_start = periods > 0 ? margin_length : 0; | |
auto right_start = periods > 0 ? 0 : margin_length; | |
// ... | |
// prepare bitmap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. It's much clearer now. thanks for the suggestion!
Co-authored-by: Benjamin Kietzman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
macOs glib&python failures are network failures while communicating with brew https://github.com/apache/arrow/actions/runs/5408688637/jobs/9835499198?pr=35787 macOS c++ failure is a known issue #36329 ubuntu c++ failure is an s3 flake https://github.com/apache/arrow/actions/runs/5408688635/jobs/9828017525?pr=35787#step:7:4892 R test failure is a known issue #36346 |
Conbench analyzed the 6 benchmark runs on commit There were 4 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Rationale for this change
Add a
pairwise_diff
function similar to pandas' Series.Diff, the function computes the first order difference of an array.What changes are included in this PR?
I followed these instructions. The function is implemented for numerical, temporal and decimal types. Chuck arrays are not yet supported.
Are these changes tested?
Yes. They are tested in vector_pairwise_test.cc and in python/pyarrow/tests/compute.py.
Are there any user-facing changes?
Yes, and docs are also updated in this PR.