Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35786: [C++] Add pairwise_diff function #35787

Merged
merged 17 commits into from
Jun 29, 2023

Conversation

js8544
Copy link
Collaborator

@js8544 js8544 commented May 26, 2023

Rationale for this change

Add a pairwise_diff function similar to pandas' Series.Diff, the function computes the first order difference of an array.

What changes are included in this PR?

I followed these instructions. The function is implemented for numerical, temporal and decimal types. Chuck arrays are not yet supported.

Are these changes tested?

Yes. They are tested in vector_pairwise_test.cc and in python/pyarrow/tests/compute.py.

Are there any user-facing changes?

Yes, and docs are also updated in this PR.

@js8544 js8544 requested review from AlenkaF and westonpace as code owners May 26, 2023 10:53
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@js8544 js8544 changed the title Add pairwise diff function GH-35786: [C++] Add pairwise_diff function May 26, 2023
@github-actions github-actions bot added the awaiting review Awaiting review label May 26, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35786 has been automatically assigned in GitHub to PR creator.

@js8544
Copy link
Collaborator Author

js8544 commented Jun 20, 2023

@lidavidm @pitrou @westonpace Hi guys, would any of you mind taking a look at this PR? It's been sitting around for almost a month. Thanks in advance!

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Jun 22, 2023
@js8544
Copy link
Collaborator Author

js8544 commented Jun 26, 2023

@pitrou @bkietz Would you mind having some extra look at this PR? Thanks!

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly looks good to me. However, in light of the strong relationship between this Function and subtract I'd prefer to see more reuse of that function's existing logic. For example when retrieving the output type for pairwise_diff($in_type), couldn't we just as easily call subtract's DispatchExact with ($in_type, $in_type)? And instead of referencing Subtract's Ops directly, couldn't we use the kernel retrieved from subtract's DispatchExact- just passing the relevant slices of the input as the arguments to the subtract kernel? I think this would have negligible impact on performance and would greatly reduce the future maintenance burden for this function, since any new types added to subtract will then automatically be supported by pairwise_diff.

cpp/src/arrow/compute/api_vector.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_pairwise.cc Outdated Show resolved Hide resolved
docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the awaiting merge Awaiting merge label Jun 26, 2023
@js8544 js8544 force-pushed the jinshang/diff_kernel branch from 3f037b6 to 6741ba2 Compare June 28, 2023 06:52
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 28, 2023
@js8544
Copy link
Collaborator Author

js8544 commented Jun 28, 2023

This mostly looks good to me. However, in light of the strong relationship between this Function and subtract I'd prefer to see more reuse of that function's existing logic. For example when retrieving the output type for pairwise_diff($in_type), couldn't we just as easily call subtract's DispatchExact with ($in_type, $in_type)? And instead of referencing Subtract's Ops directly, couldn't we use the kernel retrieved from subtract's DispatchExact- just passing the relevant slices of the input as the arguments to the subtract kernel? I think this would have negligible impact on performance and would greatly reduce the future maintenance burden for this function, since any new types added to subtract will then automatically be supported by pairwise_diff.

Hi @bkietz, do you mean something like this:

  ARROW_ASSIGN_OR_RAISE(auto subtract_func, registry->GetFunction("subtract"));
  for (const auto& type : types) {
    ARROW_ASSIGN_OR_RAISE(auto kernel,
                          subtract_func->DispatchExact({type, type}));
    // reuse kernel's exec and signature
  }

@bkietz
Copy link
Member

bkietz commented Jun 28, 2023

@js8544 yes, exactly. I think we might be able to go a step further and loop over subtract's kernels directly, without the need to list input types explicitly and go through DispatchExact.

(sorry for the spurious close; bumped the wrong button)

@bkietz bkietz closed this Jun 28, 2023
@bkietz bkietz reopened this Jun 28, 2023
@js8544
Copy link
Collaborator Author

js8544 commented Jun 28, 2023

I think we might be able to go a step further and loop over subtract's kernels directly, without the need to list input types explicitly and go through DispatchExact.

That sounds better indeed! I'll try this now.

@js8544
Copy link
Collaborator Author

js8544 commented Jun 28, 2023

@bkietz I've changed it to looping over subtract's kernels and wrapping their signature and kernel exec.

@js8544 js8544 requested a review from bkietz June 28, 2023 14:53
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks for refactoring!
Just a few nits

cpp/src/arrow/compute/kernels/vector_pairwise.cc Outdated Show resolved Hide resolved
Comment on lines 63 to 68
auto offset = abs(periods);
offset = std::min(offset, input.length);
auto exec_length = input.length - offset;
// prepare bitmap
auto null_start = periods > 0 ? 0 : exec_length;
auto non_null_start = periods > 0 ? offset : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could we call these two regions "computed" and "margin"? I think that'll be more obvious for the next maintainer, and it'd help if we move all the start/length calculation to this preamble too

Suggested change
auto offset = abs(periods);
offset = std::min(offset, input.length);
auto exec_length = input.length - offset;
// prepare bitmap
auto null_start = periods > 0 ? 0 : exec_length;
auto non_null_start = periods > 0 ? offset : 0;
// We only compute values in the region where the input-with-offset overlaps
// the original input. The margin where these do not overlap gets filled with null.
auto margin_length = std::min(abs(periods), input.length);
auto computed_length = input.length - margin_length;
auto margin_start = periods > 0 ? 0 : computed_length;
auto left_start = periods > 0 ? margin_length : 0;
auto right_start = periods > 0 ? 0 : margin_length;
// ...
// prepare bitmap

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. It's much clearer now. thanks for the suggestion!

cpp/src/arrow/compute/kernels/vector_pairwise.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_pairwise.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/vector_pairwise.cc Outdated Show resolved Hide resolved
docs/source/cpp/compute.rst Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Jun 28, 2023
js8544 and others added 2 commits June 29, 2023 11:19
@js8544 js8544 requested a review from bkietz June 29, 2023 03:21
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jun 29, 2023
@bkietz
Copy link
Member

bkietz commented Jun 29, 2023

macOs glib&python failures are network failures while communicating with brew https://github.com/apache/arrow/actions/runs/5408688637/jobs/9835499198?pr=35787

macOS c++ failure is a known issue #36329

ubuntu c++ failure is an s3 flake https://github.com/apache/arrow/actions/runs/5408688635/jobs/9828017525?pr=35787#step:7:4892

R test failure is a known issue #36346

@bkietz bkietz merged commit 26c25d1 into apache:main Jun 29, 2023
@bkietz bkietz removed the awaiting merge Awaiting merge label Jun 29, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 26c25d1b.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Add pairwise_diff function
3 participants