-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression testing #1233
Regression testing #1233
Conversation
Codecov Report
@@ Coverage Diff @@
## development #1233 +/- ##
===============================================
+ Coverage 88.11% 88.22% +0.10%
===============================================
Files 139 139
Lines 10993 11037 +44
===============================================
+ Hits 9687 9737 +50
+ Misses 1306 1300 -6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, but I guess this mostly looks good.
Don't merge yet, I'm testing on my fork to see that it actually works. I will let you know if it's functional |
Okay it seems to work correctly now |
Regression Testing
This PR implements a github workflow to allow for some basic regression testing of auto-sklearn, tracking performance between new PR's that come into the development branch. The results of these are currently stored as 90 day artifacts, the maximum allowed by github's upload-action.
The testing is done on 3 classification and regression datasets which can be seen in
.github/workflows/benchmarking-files/benchmarks/{classification/regression}.yaml
. The benchmark for each lasts ~5h20m with 10 folds run on each dataset.Longer benchmarks would run into the 6 hour limit of github action servers.
For comparison, the mean values over the folds are used with the results stored as an artifacts, located at the same page as the workflow run that is running the tests. A comment is also generated on the PR to give a graphical overview.
Workflows
generate-baselines.yml
- On push to the development branch, an automatic workflow is set up to create two baseline files, one for regression and one for classification. This means the latest development is always used as the performance benchmark.regressions.yml
- On labeling a PR with 'regression-testing' the workflow is triggered, using the same datasets asgenerate-baselines.yml
to enable a comparison. This will pull the latest successful results fromgenerate-baselines
and then perform a comparison between them.regressions
will take the last successful baselines that were generated, even if they are not up to date. This behavior can be changed if required, causingregressions
to fail if the latest workflow run forgenerate-baselines
has failed. See doc of download-workflow-artifact..github/workflows/benchmarking-files
to pull down relevant files for benchmarking.regressions-util.py
- This utility file is used inregressions.yaml
to perform comparisons between the baseline and the targeted branch results. It also provides some utility for generating markdown that is used for the comment body that gives an overview of the results.Future Considerations
The 90 day time limit is possibly prohibitive and the results must be downloaded. The alternatives to the current solution are:
Seeing as most result runs will not be of significant interest to keep long-term, the 90 day limit is not overly prohibitive as they can be rerun if really required in the future.
The current benchmark is limited to 3 regression and 3 classification datasets. These are done across 2 github runners, allowing for a cumulative 12 hours of benchmarking times spread across 2 servers running for 6 hours. This could be parallelized much further and probably advisable once we are confident of whether it provides sufficient utility.
The color coding for the comments relies on % change between the baseline result and the targeted branch result. While purely a visual indicator and numeric values are provided, this could bias decision making unless the boundaries between color codes are chosen smartly. These boundaries can be found as the dict
metric_tolerances
withinregressions-util.py
and should be updated so that only "significant" differences are strongly highlighted with color.Sample Comment with 1 Benchmark Dataset
Hello @eddiebergman,
We are doing regression tests for
Progress and Artifacts
A summary of the results will be show in this comment once complete but the full results will be available as an artifact at the above link.
Results
Overall the targeted versions performance across 2 task(s) and 7 metric(s)
There were 4 task(s) that could not be compared.
The average change for each metric is:
T : 0.780
T : 0.844
T : 0.710
T : 0.517
T : nan
T : nan
T : nan
T : nan
T : nan
T : nan
T : nan
T : 33.873
T : -0.015
T : 44.353
worse





better || B - Baseline || T - Target Version || Bold - Training Metric || / - Missing Value || --- - Missing Task || NaN -
|| Neutral - 