Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hypothesis-based tests for linear models #4952

Conversation

csadorf
Copy link
Contributor

@csadorf csadorf commented Oct 26, 2022

Closes #4943.

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Oct 26, 2022
@csadorf csadorf changed the title [WIP] Implement first prototype for hypothesis-based linear model testing. [WIP] Implement hypothesis-based tests for linear models Oct 26, 2022
@csadorf csadorf added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Oct 26, 2022
@csadorf csadorf force-pushed the fea-hypothesis-based-testing-for-cpu-gpu-comparison branch from 80ee4e8 to b4c3155 Compare October 26, 2022 17:01
@csadorf csadorf force-pushed the fea-hypothesis-based-testing-for-cpu-gpu-comparison branch from b4c3155 to fe6d204 Compare October 26, 2022 17:07
@csadorf
Copy link
Contributor Author

csadorf commented Oct 26, 2022

Issues I am currently observing:

  1. The test runtime appears to be highly variable without severely restricting the input size and maximum number of experiments.
  2. The cuml estimator converts dtypes and will sometimes fail with an error regarding lossy transformation (failed to copy the traceback, but should be relatively easy to reproduce...).

I will run some benchmarks, especially to address point one.

@csadorf csadorf requested a review from wphicks October 26, 2022 17:18
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Love the general approach and the details of dataset generation.

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved
python/cuml/tests/test_strategies.py Outdated Show resolved Hide resolved
python/cuml/testing/strategies.py Outdated Show resolved Hide resolved
python/cuml/testing/strategies.py Outdated Show resolved Hide resolved
@csadorf
Copy link
Contributor Author

csadorf commented Oct 27, 2022

Dumping some statistics for benchmarking:

cuml/tests/test_linear_model.py::test_linear_regression_model_default:

  - during generate phase (2.26 seconds):
    - Typical runtimes: 2-101 ms, ~ 49% in data generation
    - 36 passing examples, 1 failing examples, 29 invalid examples
    - Found 1 distinct error in this phase

  - during shrink phase (36.36 seconds):
    - Typical runtimes: 2-161 ms, ~ 58% in data generation
    - 130 passing examples, 298 failing examples, 612 invalid examples
    - Tried 1040 shrinks of which 285 were successful

  - Highest target score: 66931.9  (label='')
  - Stopped because nothing left to do

This was run with the array_equal assert disabled to not interfere with the test execution on failures.

@wphicks
Copy link
Contributor

wphicks commented Oct 27, 2022

Those benchmark times don't worry me too much. If the shrink phase takes a little while, that's fine. It means that Hypothesis has found something, and 30 seconds of CI time is a small price to pay to get us a really good reproducer to work with. We can continue to assess the test time as we roll this out more generally and see if we need to add e.g. additional pytest configuration options to control the impact (for both dev and CI).

Comment on lines 42 to 54
def array_difference(a, b, with_sign=True):
"""
Utility function to compute the difference between 2 arrays.
"""
a = to_nparray(a)
b = to_nparray(b)

if len(a) == 0 and len(b) == 0:
return 0

if not with_sign:
a, b = np.abs(a), np.abs(b)
return np.sum(np.abs(a - b))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common issue we've had is that sometimes it is very hard to diagnose the error or magnitude of errors when tests fail, adding some printing when failing here might be highly benefitial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not directly fail tests, but only computes the difference between arrays a and b similar to array_equal(). I use it as a target function to steer hypothesis towards larger errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in #4973 .

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved
@csadorf csadorf requested review from dantegd and wphicks October 28, 2022 10:42
@csadorf
Copy link
Contributor Author

csadorf commented Oct 28, 2022

Here is an example for the output of an actual failure case.

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved
@csadorf
Copy link
Contributor Author

csadorf commented Oct 28, 2022

@wphicks I guess we can't merge this without also addressing the failures? Should I create a longer-running feature branch?

@csadorf csadorf changed the title [WIP] Implement hypothesis-based tests for linear models Implement hypothesis-based tests for linear models Oct 28, 2022
@csadorf csadorf marked this pull request as ready for review October 28, 2022 17:08
@csadorf csadorf marked this pull request as ready for review November 4, 2022 19:20
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my one xfail comment, this looks great! Really like the current form of the dataset generation strategy.

At this point, I'd love to get this in ASAP so that we can go ahead and start applying this to the CPU/GPU algorithms that we're pulling in with this release.

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved
@csadorf csadorf requested a review from wphicks November 7, 2022 16:53
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the update. LGTM!

@wphicks
Copy link
Contributor

wphicks commented Nov 7, 2022

@dantegd Any final thoughts on this or do we feel good about merging?

@csadorf
Copy link
Contributor Author

csadorf commented Nov 8, 2022

@wphicks One of the tests appeared to be flaky with respect to the efficiency of example generation. I've suppressed the healthcheck for that particular test, but we will have to monitor whether those tests generally become flaky and consider to suppress these healthchecks globally to avoid surprises later on.

@wphicks
Copy link
Contributor

wphicks commented Nov 8, 2022

Yep, the healthchecks can become an issue with a lot of assumptions on the dataset generation. I'd say we shouldn't be afraid to suppress them, but we should address them in the generation strategy itself if it begins to impact CI time.

@wphicks
Copy link
Contributor

wphicks commented Nov 9, 2022

@gpucibot merge

@wphicks
Copy link
Contributor

wphicks commented Nov 9, 2022

@dantegd Could you dismiss your review when you get a moment? (Or re-review if you're still thinking this one over)

@cjnolet cjnolet dismissed dantegd’s stale review November 10, 2022 15:30

Review items taken care of.

@cjnolet
Copy link
Member

cjnolet commented Nov 10, 2022

rerun tests

@cjnolet
Copy link
Member

cjnolet commented Nov 10, 2022

Looks like we got a conda timeout error in one of the gpu test builds.

@csadorf
Copy link
Contributor Author

csadorf commented Nov 11, 2022

rerun tests

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.12@0beb45f). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.12    #4952   +/-   ##
===============================================
  Coverage                ?   79.38%           
===============================================
  Files                   ?      184           
  Lines                   ?    11698           
  Branches                ?        0           
===============================================
  Hits                    ?     9287           
  Misses                  ?     2411           
  Partials                ?        0           
Flag Coverage Δ
dask 45.93% <0.00%> (?)
non-dask 68.92% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rapids-bot rapids-bot bot merged commit df2fbbe into rapidsai:branch-22.12 Nov 11, 2022
@csadorf csadorf deleted the fea-hypothesis-based-testing-for-cpu-gpu-comparison branch November 14, 2022 10:11
jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add utility for Hypothesis-based comparisons of CPU and GPU algorithm implementations
5 participants