Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute whole column variance using numerically stable approach #16448

Merged
merged 14 commits into from
Oct 8, 2024

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jul 31, 2024

Description

We use the pairwise approach of Chan, Golub, and LeVeque (1983).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

We use the pairwise approach of Chan, Golub, and LeVeque (1983).

- Closes rapidsai#16444
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 31, 2024
@wence- wence- added bug Something isn't working non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 31, 2024
We now need to compare to a tolerance, which was probably the case
before, except we were getting lucky.
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 1, 2024
@wence- wence- marked this pull request as ready for review August 1, 2024 11:31
@wence- wence- requested review from a team as code owners August 1, 2024 11:31
@wence- wence- requested review from shrshi and nvdbaranec August 1, 2024 11:31
@github-actions github-actions bot added the Java Affects Java cuDF API. label Aug 1, 2024
@wence- wence- marked this pull request as draft August 1, 2024 16:07
@wence-
Copy link
Contributor Author

wence- commented Aug 1, 2024

Back into draft while I figure out some edges cases around NAs.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 2, 2024
The ddof argument is meaningless for mean reductions, so we should not
consider it when determining whether the result will be valid.
Now that we don't launch a kernel for some std and variance
reductions, we might just get NA rather than nan. Match old behaviour
by producing a nan of the right type.
@github-actions github-actions bot added CMake CMake build issue cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars labels Sep 30, 2024
@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Sep 30, 2024
@wence- wence- changed the base branch from branch-24.10 to branch-24.12 September 30, 2024 17:38
@wence- wence- removed CMake CMake build issue cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Sep 30, 2024
@wence- wence- marked this pull request as ready for review September 30, 2024 17:39
@wence- wence- requested a review from a team as a code owner September 30, 2024 17:39
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion, otherwise LGTM. Thank you for the attention to details throughout this PR!

cpp/src/reductions/compound.cuh Outdated Show resolved Hide resolved
wence- added 2 commits October 1, 2024 10:01
The computed result is now more accurate.
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving latest changes.

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the java side of things, these tests are there just to verify that we plumbed thing up correctly and we are calling into the C++/cuda code correctly. Spark uses a totally different method to compute the population variance in a distributed way so we don't actually call into this code.

@wence-
Copy link
Contributor Author

wence- commented Oct 7, 2024

/merge

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of very small things.

cpp/include/cudf/reduction/detail/reduction_operators.cuh Outdated Show resolved Hide resolved
cpp/tests/reductions/reduction_tests.cpp Show resolved Hide resolved
@rapids-bot rapids-bot bot merged commit bcf9425 into rapidsai:branch-24.12 Oct 8, 2024
104 checks passed
@wence- wence- deleted the wence/fix/16444 branch October 8, 2024 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] whole-column variance calculation uses numerically unstable algorithm
7 participants