Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression testing #1233

Merged
merged 91 commits into from
Sep 3, 2021
Merged

Regression testing #1233

merged 91 commits into from
Sep 3, 2021

Conversation

eddiebergman
Copy link
Contributor

Regression Testing

This PR implements a github workflow to allow for some basic regression testing of auto-sklearn, tracking performance between new PR's that come into the development branch. The results of these are currently stored as 90 day artifacts, the maximum allowed by github's upload-action.

The testing is done on 3 classification and regression datasets which can be seen in .github/workflows/benchmarking-files/benchmarks/{classification/regression}.yaml. The benchmark for each lasts ~5h20m with 10 folds run on each dataset.
Longer benchmarks would run into the 6 hour limit of github action servers.

For comparison, the mean values over the folds are used with the results stored as an artifacts, located at the same page as the workflow run that is running the tests. A comment is also generated on the PR to give a graphical overview.

Workflows

  • generate-baselines.yml - On push to the development branch, an automatic workflow is set up to create two baseline files, one for regression and one for classification. This means the latest development is always used as the performance benchmark.
  • regressions.yml - On labeling a PR with 'regression-testing' the workflow is triggered, using the same datasets as generate-baselines.yml to enable a comparison. This will pull the latest successful results from generate-baselines and then perform a comparison between them.
    • If a baseline run was to fail on development, this means regressions will take the last successful baselines that were generated, even if they are not up to date. This behavior can be changed if required, causing regressions to fail if the latest workflow run for generate-baselines has failed. See doc of download-workflow-artifact.
  • Both of these workflows can also be triggered manually.
  • Both of these workflows rely on the directory .github/workflows/benchmarking-files to pull down relevant files for benchmarking.
  • regressions-util.py - This utility file is used in regressions.yaml to perform comparisons between the baseline and the targeted branch results. It also provides some utility for generating markdown that is used for the comment body that gives an overview of the results.

Future Considerations

  • The 90 day time limit is possibly prohibitive and the results must be downloaded. The alternatives to the current solution are:

    • To upload and download to a hosted remote storage server. Saved file sizes are small and most services should be sufficient.
    • Set up a automated job or manual script that will crawl or use github's API to automatically download runs.
    • Manually download runs of interest.

    Seeing as most result runs will not be of significant interest to keep long-term, the 90 day limit is not overly prohibitive as they can be rerun if really required in the future.

  • The current benchmark is limited to 3 regression and 3 classification datasets. These are done across 2 github runners, allowing for a cumulative 12 hours of benchmarking times spread across 2 servers running for 6 hours. This could be parallelized much further and probably advisable once we are confident of whether it provides sufficient utility.

  • The color coding for the comments relies on % change between the baseline result and the targeted branch result. While purely a visual indicator and numeric values are provided, this could bias decision making unless the boundaries between color codes are chosen smartly. These boundaries can be found as the dict metric_tolerances within regressions-util.py and should be updated so that only "significant" differences are strongly highlighted with color.

Sample Comment with 1 Benchmark Dataset


Hello @eddiebergman,

We are doing regression tests for

  • Branch new_branch41
  • Commit 1bd33a7

Progress and Artifacts

A summary of the results will be show in this comment once complete but the full results will be available as an artifact at the above link.

Results

Overall the targeted versions performance across 2 task(s) and 7 metric(s)

  • Equally on 4 comparisons
  • Better on 3 comparisons
  • Worse on 7 comparisons

There were 4 task(s) that could not be compared.

The average change for each metric is:

  • r2: #6fe600 0.0474 across 1 task(s)
  • acc: #6fe600 0.0290 across 1 task(s)
  • auc: #6fe600 0.0509 across 1 task(s)
  • rmse: #353536 -6.4844 across 1 task(s)
  • balacc: #6fe600 0.0874 across 1 task(s)
  • mae: #353536 -5.0851 across 1 task(s)
  • logloss: #353536 -0.0701 across 1 task(s)
task metric acc auc balacc logloss mae r2 rmse
credit-g auc B : 0.751
T : 0.780
#51a800: 0.029
B : 0.793
T : 0.844
#51a800: 0.051
B : 0.622
T : 0.710
#51a800: 0.087
B : 0.588
T : 0.517
#51a800: -0.070
B : nan
T : nan
#52544f: nan
B : nan
T : nan
#52544f: nan
B : nan
T : nan
#52544f: nan
cnae-9 logloss --- --- --- --- --- --- ---
kc1 auc --- --- --- --- --- --- ---
cholesterol neg_rmse B : nan
T : nan
#52544f: nan
B : nan
T : nan
#52544f: nan
B : nan
T : nan
#52544f: nan
B : nan
T : nan
#52544f: nan
B : 38.958
T : 33.873
#51a800: -5.085
B : -0.063
T : -0.015
#ff0000: 0.047
B : 50.837
T : 44.353
#51a800: -6.484
liver-disorders neg_rmse --- --- --- --- --- --- ---
house-prices-nominal neg_rmse --- --- --- --- --- --- ---

worse #ff0000#bd0000#800000#353536#306300#51a800#6fe600 better || B - Baseline || T - Target Version || Bold - Training Metric || / - Missing Value || --- - Missing Task || NaN - #52544f || Neutral - #353536


@eddiebergman eddiebergman changed the base branch from master to development August 30, 2021 05:23
@codecov
Copy link

codecov bot commented Aug 30, 2021

Codecov Report

Merging #1233 (ae39295) into development (a38a5c3) will increase coverage by 0.10%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #1233      +/-   ##
===============================================
+ Coverage        88.11%   88.22%   +0.10%     
===============================================
  Files              139      139              
  Lines            10993    11037      +44     
===============================================
+ Hits              9687     9737      +50     
+ Misses            1306     1300       -6     
Impacted Files Coverage Δ
autosklearn/estimators.py 93.42% <0.00%> (+0.09%) ⬆️
autosklearn/util/logging_.py 88.96% <0.00%> (+0.68%) ⬆️
autosklearn/automl.py 87.06% <0.00%> (+1.08%) ⬆️
...eline/components/feature_preprocessing/fast_ica.py 97.82% <0.00%> (+4.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a38a5c3...ae39295. Read the comment docs.

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but I guess this mostly looks good.

@eddiebergman
Copy link
Contributor Author

Don't merge yet, I'm testing on my fork to see that it actually works. I will let you know if it's functional

@eddiebergman
Copy link
Contributor Author

Okay it seems to work correctly now

@mfeurer mfeurer merged commit 19a9573 into automl:development Sep 3, 2021
github-actions bot pushed a commit that referenced this pull request Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants