Regression testing #1233

eddiebergman · 2021-08-30T05:23:34Z

Regression Testing

This PR implements a github workflow to allow for some basic regression testing of auto-sklearn, tracking performance between new PR's that come into the development branch. The results of these are currently stored as 90 day artifacts, the maximum allowed by github's upload-action.

The testing is done on 3 classification and regression datasets which can be seen in .github/workflows/benchmarking-files/benchmarks/{classification/regression}.yaml. The benchmark for each lasts ~5h20m with 10 folds run on each dataset.
Longer benchmarks would run into the 6 hour limit of github action servers.

For comparison, the mean values over the folds are used with the results stored as an artifacts, located at the same page as the workflow run that is running the tests. A comment is also generated on the PR to give a graphical overview.

Workflows

generate-baselines.yml - On push to the development branch, an automatic workflow is set up to create two baseline files, one for regression and one for classification. This means the latest development is always used as the performance benchmark.
regressions.yml - On labeling a PR with 'regression-testing' the workflow is triggered, using the same datasets as generate-baselines.yml to enable a comparison. This will pull the latest successful results from generate-baselines and then perform a comparison between them.
- If a baseline run was to fail on development, this means regressions will take the last successful baselines that were generated, even if they are not up to date. This behavior can be changed if required, causing regressions to fail if the latest workflow run for generate-baselines has failed. See doc of download-workflow-artifact.
Both of these workflows can also be triggered manually.
Both of these workflows rely on the directory .github/workflows/benchmarking-files to pull down relevant files for benchmarking.
regressions-util.py - This utility file is used in regressions.yaml to perform comparisons between the baseline and the targeted branch results. It also provides some utility for generating markdown that is used for the comment body that gives an overview of the results.

Future Considerations

The 90 day time limit is possibly prohibitive and the results must be downloaded. The alternatives to the current solution are:
- To upload and download to a hosted remote storage server. Saved file sizes are small and most services should be sufficient.
- Set up a automated job or manual script that will crawl or use github's API to automatically download runs.
- Manually download runs of interest.
Seeing as most result runs will not be of significant interest to keep long-term, the 90 day limit is not overly prohibitive as they can be rerun if really required in the future.
The current benchmark is limited to 3 regression and 3 classification datasets. These are done across 2 github runners, allowing for a cumulative 12 hours of benchmarking times spread across 2 servers running for 6 hours. This could be parallelized much further and probably advisable once we are confident of whether it provides sufficient utility.
The color coding for the comments relies on % change between the baseline result and the targeted branch result. While purely a visual indicator and numeric values are provided, this could bias decision making unless the boundaries between color codes are chosen smartly. These boundaries can be found as the dict metric_tolerances within regressions-util.py and should be updated so that only "significant" differences are strongly highlighted with color.

Sample Comment with 1 Benchmark Dataset

Hello @eddiebergman,

We are doing regression tests for

Branch new_branch41
Commit 1bd33a7

Progress and Artifacts

A summary of the results will be show in this comment once complete but the full results will be available as an artifact at the above link.

Results

Overall the targeted versions performance across 2 task(s) and 7 metric(s)

Equally on 4 comparisons
Better on 3 comparisons
Worse on 7 comparisons

There were 4 task(s) that could not be compared.

The average change for each metric is:

r2: 0.0474 across 1 task(s)
acc: 0.0290 across 1 task(s)
auc: 0.0509 across 1 task(s)
rmse: -6.4844 across 1 task(s)
balacc: 0.0874 across 1 task(s)
mae: -5.0851 across 1 task(s)
logloss: -0.0701 across 1 task(s)

task	metric	acc	auc	balacc	logloss	mae	r2	rmse
credit-g	auc	B : 0.751 T : 0.780 : 0.029	B : 0.793 T : 0.844 : 0.051	B : 0.622 T : 0.710 : 0.087	B : 0.588 T : 0.517 : -0.070	B : nan T : nan : nan	B : nan T : nan : nan	B : nan T : nan : nan
cnae-9	logloss	---	---	---	---	---	---	---
kc1	auc	---	---	---	---	---	---	---
cholesterol	neg_rmse	B : nan T : nan : nan	B : nan T : nan : nan	B : nan T : nan : nan	B : nan T : nan : nan	B : 38.958 T : 33.873 : -5.085	B : -0.063 T : -0.015 : 0.047	B : 50.837 T : 44.353 : -6.484
liver-disorders	neg_rmse	---	---	---	---	---	---	---
house-prices-nominal	neg_rmse	---	---	---	---	---	---	---

worse better || B - Baseline || T - Target Version || Bold - Training Metric || / - Missing Value || --- - Missing Task || NaN - || Neutral -

codecov · 2021-08-30T06:04:34Z

Codecov Report

Merging #1233 (ae39295) into development (a38a5c3) will increase coverage by 0.10%.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           development    #1233      +/-   ##
===============================================
+ Coverage        88.11%   88.22%   +0.10%     
===============================================
  Files              139      139              
  Lines            10993    11037      +44     
===============================================
+ Hits              9687     9737      +50     
+ Misses            1306     1300       -6

Impacted Files	Coverage Δ
autosklearn/estimators.py	`93.42% <0.00%> (+0.09%)`	⬆️
autosklearn/util/logging_.py	`88.96% <0.00%> (+0.68%)`	⬆️
autosklearn/automl.py	`87.06% <0.00%> (+1.08%)`	⬆️
...eline/components/feature_preprocessing/fast_ica.py	`97.82% <0.00%> (+4.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a38a5c3...ae39295. Read the comment docs.

.github/workflows/generate-baselines.yml

.github/workflows/regression-testing/regressions-util.py

mfeurer

Some minor comments, but I guess this mostly looks good.

.github/workflows/benchmarking-files/regressions-util.py

eddiebergman · 2021-09-02T11:52:38Z

Don't merge yet, I'm testing on my fork to see that it actually works. I will let you know if it's functional

eddiebergman · 2021-09-02T12:01:31Z

Okay it seems to work correctly now

eddiebergman added 30 commits August 25, 2021 15:36

m

481c5fb

csvs

5362fdc

util file

d931b39

Added .gitattributes:

82e52e4

Added generate-baselines

a2145f8

update

6ba871c

Fixed branch envvar

1282a9a

Removed excess path part

c953d58

Fix branch extract

a446f24

Typo fix

4f63357

Typo fixes

ba17c2a

branch extract?

2e62261

filename fix

74aad5c

path fix

b3db992

path fix

bcd9d9a

fix tadodedoo

f1b1266

1 step closer to going home

7797520

sigh...typo

2d496e6

regression workflow stuff

f9c2057

Updated to new flow

325becd

switched to baseline off development

e4d8773

Fix yaml

251d1da

fix again ...

d80c8a0

first sigh

ffedbad

Event fixes, second sigh

3c2fe1d

third sigh

ea25e8d

Finding issues

7ccdb50

message

8ea8fe4

narrow down?

16d5150

maybe multiple jobs?

d9e4f11

eddiebergman added 13 commits August 26, 2021 11:51

Almost there

649224f

new branch33

c9895cc

Some fixes and convience in artifacts

947b530

view content changes

35eb0db

Fix path names

3d597db

.

44aee77

...

cd76486

fix paths

90165dc

path fixes

9449b10

More fixes

84aee4b

Fix output

f4909ac

Re-included full tests

d6ecbd4

Cleanup

66942db

eddiebergman changed the base branch from master to development August 30, 2021 05:23

cleanup old test file

2ad1d8d

mfeurer reviewed Sep 1, 2021

View reviewed changes

eddiebergman added 2 commits September 1, 2021 17:53

Deleted old test files

91b3b26

Changed param types

e0763e1

mfeurer reviewed Sep 1, 2021

View reviewed changes

eddiebergman added 2 commits September 1, 2021 18:53

Updates

04cd350

Update to use system python

6be7fe4

mfeurer approved these changes Sep 2, 2021

View reviewed changes

eddiebergman added 2 commits September 2, 2021 13:49

test method of passing python version to setup python

285e2c1

Update all of them

9d9c6c8

Grab just numeric part of version

ae39295

mfeurer merged commit 19a9573 into automl:development Sep 3, 2021

github-actions bot pushed a commit that referenced this pull request Sep 3, 2021

Eddie Bergman: Regression testing (#1233)

1011aee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression testing #1233

Regression testing #1233

eddiebergman commented Aug 30, 2021

codecov bot commented Aug 30, 2021 •

edited

Loading

mfeurer left a comment

eddiebergman commented Sep 2, 2021

eddiebergman commented Sep 2, 2021

Regression testing #1233

Regression testing #1233

Conversation

eddiebergman commented Aug 30, 2021

Regression Testing

Workflows

Future Considerations

Sample Comment with 1 Benchmark Dataset

Results

codecov bot commented Aug 30, 2021 • edited Loading

Codecov Report

mfeurer left a comment

Choose a reason for hiding this comment

eddiebergman commented Sep 2, 2021

eddiebergman commented Sep 2, 2021

codecov bot commented Aug 30, 2021 •

edited

Loading