Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ§ͺπŸš‡ Add integration test result validation #754

Merged
merged 29 commits into from
Sep 5, 2021

Conversation

s-weigand
Copy link
Member

@s-weigand s-weigand commented Jul 17, 2021

Right now the integration tests only show breaking API changes, but checking validity is still a manual task.
This PR adds a way to compare the results to a "gold standard" (tag, branch, commit) set by this workflow on the pyglotaran-examples repo.

Change summary

  • Adds numerical comparison to "gold standard" results defined by pyglotaran-examples

Checklist

  • βœ”οΈ Passing the tests (mandatory for all PR's)
  • πŸ§ͺ Adds new tests for the feature (mandatory for ✨ feature and 🩹 bug fix PR's)

Closes issues

closes #753

@s-weigand s-weigand requested a review from a team as a code owner July 17, 2021 18:15
@github-actions
Copy link
Contributor

Binder πŸ‘ˆ Launch a binder notebook on branch s-weigand/pyglotaran/compare-results

@s-weigand s-weigand added the Type: Tooling Tools used for the project (CI, CD, docs etc.) label Jul 17, 2021
@codecov
Copy link

codecov bot commented Jul 17, 2021

Codecov Report

Merging #754 (dd6ae0c) into staging (2a6b3e6) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           staging    #754   +/-   ##
=======================================
  Coverage     84.5%   84.5%           
=======================================
  Files           75      75           
  Lines         4200    4200           
  Branches       756     756           
=======================================
  Hits          3549    3549           
  Misses         518     518           
  Partials       133     133           

Continue to review full report at Codecov.

Legend - Click here to learn more
Ξ” = absolute <relative> (impact), ΓΈ = not affected, ? = missing data
Powered by Codecov. Last update 2a6b3e6...dd6ae0c. Read the comment docs.

@s-weigand
Copy link
Member Author

For this workflow to pass we might need to fix how problem.save_parameters_for_history is implemented in optimize and make a fix release 0.4.1, since it is done wrong according to @joernweissenborn πŸ€·β€β™€οΈ.
At least we now have a way to compare the results across versions.

@s-weigand s-weigand marked this pull request as draft July 17, 2021 19:22
@s-weigand s-weigand force-pushed the compare-results branch 2 times, most recently from 12a381c to 17c55bb Compare July 20, 2021 21:45
Copy link
Member

@jsnel jsnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@s-weigand s-weigand marked this pull request as ready for review August 8, 2021 18:00
@jsnel jsnel linked an issue Aug 13, 2021 that may be closed by this pull request
@s-weigand
Copy link
Member Author

The clp_label swapping for ex_two_datasets in this CI run
might be related to the implementation of involved_compartments

compartments = list(set(compartments))

Since set does not preserve order.

Using frozenset might solve this, or even better using the user-provided order from the defined model (not sure if we still have access to that on this point).

@s-weigand s-weigand force-pushed the compare-results branch 2 times, most recently from dc6bfb9 to bb7c55e Compare August 15, 2021 17:37
@s-weigand s-weigand marked this pull request as draft August 16, 2021 00:52
@s-weigand
Copy link
Member Author

s-weigand commented Aug 16, 2021

@jsnel and I did a long debugging session today where we found the following:

  • equal area penalties are applied differently (compartment order) between 0.4.1 and staging (failing ex_two_datasets and possibly simultaneous_analysis_6d_disp); because clps and clp_labels are not aligned the areas of the wrong spectra are calculated (in some cases where e.g. s1,s2,s3 -> s1,s3,s2 or similar)
  • the missing coordinate spectral is due to different code paths when auto determining index dependency UngroupedProblem.create_index_dependent_result_dataset vs. UngroupedProblem.create_index_independent_result_dataset (also this is kinda random or maybe I messed the tests up Missing coordinate: 'spectral' in 'dataset1.nc', data_var 'matrix' vs. spectral is apparently present)
  • problems in simultaneous_analysis_6d_disp might be due to missing noise and/or the differences in how weights are applied

@s-weigand
Copy link
Member Author

Updated the "gold standard".

Error summary:

  • optimized_parameters (simultaneous_analysis_6d_disp)
  • matrix (simultaneous_analysis_6d_disp)
  • weighted_residual (simultaneous_analysis_3d_weight) ~1e-8
  • weighted_residual (ex_two_datasets) ~1e-8
  • weighted_residual (simultaneous_analysis_3d_nodisp) ~1e-8

s-weigand and others added 19 commits September 4, 2021 02:57
Testing 0.4.1 against 0.4.1: results should actually be identical.
For further quantification of difference use:
```python
diff = current_data.data - expected_var_value.data
diff_abs_sum = np.sum(np.abs(diff))
if diff_abs_sum>0:
    print(f"{expected_var_name:<42s}: {diff_abs_sum:.4g}")
```
Before the cwd of the subprocess was changed which might have lead to problems when using the debugger, this method should be saver not to mess with the pyglotaran repository itself.
…_dataset'

Currently this example is under review for changes and show numerical instability.
Ref:
glotaran/pyglotaran-examples PR42
This example needs to be reviewed since it shows numerical instability.
weighted_data were always calculated and now will only be calculated when weights are applied
This improves sensible usage of errors caused by floating-point inaccuracy and take SVD vector scaling with SV into account
This way we can see failing tests for all data_vars instead of  a failing test aborting the loop, hiding other failing tests.
Using the environment variable COMPARE_RESULTS_LOCAL which can be set in tox.ini a local path can be specified to look for the results folder to use as a reference in lieu of the (gold standard) comparison-results branch in the pyglotaran-examples repository

Added some comments about which tolerances are used and their origin.
The spectral coordinate is (logically) missing from some data variables
for the (right) singular vectors, if they are transposed,
*but* throw a warning.
@sonarcloud
Copy link

sonarcloud bot commented Sep 5, 2021

Kudos, SonarCloud Quality Gate passed!Β  Β  Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2021

Benchmark is done. Checkout the benchmark result page.
Benchmark differences below 5% might be due to CI noise.

Benchmark diff

Parametrized benchmark signatures:

BenchmarkOptimize.time_optimize(index_dependent, grouped, weight)

All benchmarks:

       before           after         ratio
     [dc00e6da]       [dd6ae0c4]
     <v0.4.0>                   
-        68.9Β±1ms       49.2Β±0.7ms     0.71  BenchmarkOptimize.time_optimize(False, False, False)
-         375Β±4ms         62.9Β±2ms     0.17  BenchmarkOptimize.time_optimize(False, False, True)
         97.1Β±3ms         81.5Β±2ms    ~0.84  BenchmarkOptimize.time_optimize(False, True, False)
         97.4Β±2ms         92.1Β±6ms     0.95  BenchmarkOptimize.time_optimize(False, True, True)
         68.1Β±1ms       67.3Β±0.7ms     0.99  BenchmarkOptimize.time_optimize(True, False, False)
-         372Β±3ms        78.7Β±50ms     0.21  BenchmarkOptimize.time_optimize(True, False, True)
         96.9Β±3ms          105Β±2ms     1.08  BenchmarkOptimize.time_optimize(True, True, False)
         99.2Β±2ms         123Β±40ms    ~1.24  BenchmarkOptimize.time_optimize(True, True, True)
             182M             179M     0.98  IntegrationTwoDatasets.peakmem_create_result
             197M             197M     1.00  IntegrationTwoDatasets.peakmem_optimize
-         304Β±5ms         261Β±10ms     0.86  IntegrationTwoDatasets.time_create_result
        6.80Β±0.1s       2.21Β±0.07s    ~0.32  IntegrationTwoDatasets.time_optimize

@jsnel jsnel marked this pull request as ready for review September 5, 2021 15:01
@jsnel jsnel merged commit 5238409 into glotaran:staging Sep 5, 2021
@jsnel jsnel deleted the compare-results branch September 5, 2021 18:59
jsnel added a commit that referenced this pull request Sep 16, 2021
* πŸ§ͺ Added result consistency test script

* πŸš‡ Added result consistency testing to the integration test workflow

* πŸ‘Œ Propagate git interaction errors to inform users

* πŸ”§ Add `pytest-allclose` for more usefull error reporting

* πŸ‘Œ Use colored pytest output for better readability

* πŸ‘Œ Show the commits used to create the results

* πŸ‘Œ Added tests for optimized parameters

* 🧹 Renamed variables used for comparing to expected and current

* πŸ‘Œ Print up to 20 different values if "allclose" fails

* Adjust tolerances to pass on CI if reference and current are the same

* πŸ‘Œ Make git interactions more save by passing the folder to use to git

* πŸ‘Œ Implemented 'EXAMPLE_BLOCKLIST' and added 'transient_absorption_two_dataset'

* πŸ‘Œ Only lower absolute tolerance for SVD comparison

* πŸ‘Œ Added 'ex_spectral_guidance' to EXAMPLE_BLOCKLIST

* πŸ“š Added instructions to locally run result consistency test

* πŸ‘Œ Added dataset file name to report on failing test

* πŸ‘Œ Added data_var name to error reporting

* πŸ‘Œ Special cased missing weighted_data

* πŸ‘Œ Added label displaying on test_result_parameter_consistency fail

* 🩹 Fixed line length issue

* πŸ‘Œ Use value difference to check data_vars

* πŸ‘Œ Swap abs_diff and float_resolution, that way rtol has some effect

* πŸ‘Œ Use float32 precision as absolute tolerance

* πŸ‘Œ Improved error reporting on failure by showing mean difference

* ♻️ Refactored data_var tests not to run in a loop but using fixtures

* πŸ‘Œ Made epsilon for residual scale with original data

* πŸ‘Œ Add option to specify path for local comparison

* πŸ‘Œ Allow missing coords in some variables

* 🩹 Reorder dimensions before comparison

The same PR was merged to the v0.4.1 maintenance branch for comparison, see #760 



Co-authored-by: Joris Snellenburg <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Tooling Tools used for the project (CI, CD, docs etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

πŸš‡Add result validation step to CI
2 participants