Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset Groups #851

Merged
merged 17 commits into from
Oct 18, 2021
Merged

Conversation

joernweissenborn
Copy link
Member

@joernweissenborn joernweissenborn commented Oct 8, 2021

Gist

This PR adds the feature to have multiple dataset groups in a model. This allows for e.g. using non-negative least squares (nnls) on one (group of) dataset(s) and variable projection (varpro) on another (group of) dataset(s) within the same optimization. It is also now possible, to only link the CLP of certain datasets instead of having to link all datasets.

Dataset Groups

This is achieved by introducing a new level of indirection called dataset groups. There are 2 new objects:

glotaran.model.DatasetGroupModel represents basically an entry in the model spec. It tells whether to use nnls and whether to link the clp. Datasets refer to a group model with the new group property. This defaults to default, a group which will be always added if it is not specified/overridden.

glotaran.model.DatasetGroup contains a DatasetGroupModel and all DatasetModels beloging to that group. This groups are generate by the model via Model.get_dataset_groups and are fed into optimization, see below for details.

Changes to glotaran.analysis

To adapt the analysis package to work with dataset groups, major refactor of the Problem class was necessary.

Summary:

  • Problem has been renamed to OptimizationGroup
  • A new class glotaran.analysis.OptimizationGroupCalculator has been added to separate out actual calculation of linked/unlinked of the common structure (matrices, residuals, clp...)
  • The optimization works now on N optimization groups instead of 1 problem, so various things needed to be adapted (like parameter history is now a variable instead of member of the optimization group

Example

Imagine a model like so

...
dataset:
  d1:
    ... 
  d2:
    ...  
  d3:
    ...  
  d4:
    ...

This 4 datasets are currently linked in a way that you either use NNLS on all of them or none. The same goes with linking CLP, either all get linked or none.

With this PR one can do the following:

...
dataset_groups:
  default:
    residual_function: non_negative_least_squares
  mygroup:
    link_clp: true
dataset:
  d1:
    group: mygroup
    ... 
  d2:
    group: mygroup
    ...  
  d3:
    group: mygroup
    ...  
  d4:
    ...

This means that d1 to d3 will be linked and the residual will be calculated with varpro (default). d4 doesn't specify a group, thus is part of 'default' which we override to use nnls instead of varpro.

Deprecation summary

  • Scheme
    -- group deprecated-> now part of models dataset groups
    -- group_tolerance -> renamed to clp_link_tolerance
  • Problem -> replaced by OptimzationGroup
    -- This was never public API

Checklist

  • ✔️ Passing the tests (mandatory for all PR's)
  • 👌 Closes issue (mandatory for ✨ feature and 🩹 bug fix PR's)
  • 🧪 Adds new tests for the feature (mandatory for ✨ feature and 🩹 bug fix PR's)
  • 📚 Adds documentation of the feature

Closes issues

closes #XXXX

@joernweissenborn joernweissenborn requested review from jsnel, s-weigand and a team as code owners October 8, 2021 19:36
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2021

Binder 👈 Launch a binder notebook on branch joernweissenborn/pyglotaran/feature/datasetgroups

@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2021

Benchmark is done. Checkout the benchmark result page.
Benchmark differences below 5% might be due to CI noise.

Benchmark diff v0.4.1 vs. main

Parametrized benchmark signatures:

BenchmarkOptimize.time_optimize(index_dependent, grouped, weight)

All benchmarks:

       before           after         ratio
     [21ba272a]       [c24d1438]
     <v0.4.1>                   
         67.7±1ms         73.1±3ms     1.08  BenchmarkOptimize.time_optimize(False, False, False)
-         419±2ms         92.4±7ms     0.22  BenchmarkOptimize.time_optimize(False, False, True)
-        93.3±1ms         76.2±1ms     0.82  BenchmarkOptimize.time_optimize(False, True, False)
         95.5±1ms        94.9±20ms     0.99  BenchmarkOptimize.time_optimize(False, True, True)
+        69.8±2ms         92.6±2ms     1.33  BenchmarkOptimize.time_optimize(True, False, False)
-         421±3ms         99.9±3ms     0.24  BenchmarkOptimize.time_optimize(True, False, True)
       93.9±0.8ms         93.0±3ms     0.99  BenchmarkOptimize.time_optimize(True, True, False)
         92.5±1ms          105±4ms    ~1.14  BenchmarkOptimize.time_optimize(True, True, True)
             195M             192M     0.98  IntegrationTwoDatasets.peakmem_optimize
        5.62±0.2s        2.01±0.1s    ~0.36  IntegrationTwoDatasets.time_optimize

Benchmark diff main vs. PR

Parametrized benchmark signatures:

BenchmarkOptimize.time_optimize(index_dependent, grouped, weight)

All benchmarks:

       before           after         ratio
     [d1e36a93]       [c24d1438]
+      41.3±0.6ms         73.1±3ms     1.77  BenchmarkOptimize.time_optimize(False, False, False)
         53.2±1ms         92.4±7ms    ~1.74  BenchmarkOptimize.time_optimize(False, False, True)
         66.4±2ms         76.2±1ms    ~1.15  BenchmarkOptimize.time_optimize(False, True, False)
         79.5±9ms        94.9±20ms    ~1.19  BenchmarkOptimize.time_optimize(False, True, True)
+        53.9±2ms         92.6±2ms     1.72  BenchmarkOptimize.time_optimize(True, False, False)
       59.6±0.6ms         99.9±3ms    ~1.68  BenchmarkOptimize.time_optimize(True, False, True)
         83.1±4ms         93.0±3ms    ~1.12  BenchmarkOptimize.time_optimize(True, True, False)
        95.2±40ms          105±4ms    ~1.11  BenchmarkOptimize.time_optimize(True, True, True)
             192M             192M     1.00  IntegrationTwoDatasets.peakmem_optimize
       1.86±0.05s        2.01±0.1s     1.08  IntegrationTwoDatasets.time_optimize

@codecov
Copy link

codecov bot commented Oct 9, 2021

Codecov Report

Merging #851 (c24d143) into main (d1e36a9) will increase coverage by 0.1%.
The diff coverage is 93.1%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #851     +/-   ##
=======================================
+ Coverage   84.5%   84.7%   +0.1%     
=======================================
  Files         79      81      +2     
  Lines       4522    4581     +59     
  Branches     826     848     +22     
=======================================
+ Hits        3824    3882     +58     
- Misses       556     558      +2     
+ Partials     142     141      -1     
Impacted Files Coverage Δ
glotaran/analysis/optimization_group.py 87.6% <78.2%> (ø)
glotaran/model/model.py 84.9% <92.3%> (+1.7%) ⬆️
glotaran/analysis/optimize.py 86.4% <92.8%> (+1.4%) ⬆️
glotaran/project/scheme.py 92.5% <93.1%> (+1.7%) ⬆️
...analysis/optimization_group_calculator_unlinked.py 94.3% <96.5%> (ø)
...n/analysis/optimization_group_calculator_linked.py 96.5% <96.7%> (ø)
glotaran/analysis/optimization_group_calculator.py 100.0% <100.0%> (ø)
glotaran/model/__init__.py 100.0% <100.0%> (ø)
glotaran/model/dataset_group.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d1e36a9...c24d143. Read the comment docs.

@s-weigand
Copy link
Member

🗑️ Removed arguments:

  • Scheme.non_negative_least_squares
  • Scheme.group_tolerance

Those need to be properly deprecated!

This is mainly a reminder that we need to properly deprecate the missing attributes.
s-weigand and others added 3 commits October 15, 2021 01:11
This bug led to result creation crashing, because of missing labels.

Co-authored-by: Jörn Weißenborn <[email protected]>
It was nice to see how much time and memory the result creation needed compared to the whole optimization, but loading a pickeled OptimizeResult wasn't nice from the start.
Whith the changes in this PR and the overhead to keep benchamarks working across versions, IMHO the extra information about result creation details isn't worth it.
@s-weigand
Copy link
Member

@joernweissenborn please add tests for cc4beb9 yourself (#clean-up-your-own-code-booboo 😛 ).

Copy link
Member

@s-weigand s-weigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes are required for this PR to be merged, others can be postponed to additional cleanup PRs.

Mandatory changes are:

  • The deprecations and their tests need to be fixed.
  • The DatasetGroupIndexModel naming issue
  • More helpful error message for wrong residual_function

glotaran/analysis/optimization_group.py Show resolved Hide resolved
glotaran/analysis/optimization_group.py Outdated Show resolved Hide resolved
glotaran/analysis/optimization_group.py Outdated Show resolved Hide resolved
glotaran/analysis/optimization_group_calculator_linked.py Outdated Show resolved Hide resolved
glotaran/analysis/optimize.py Outdated Show resolved Hide resolved
glotaran/analysis/test/test_optimization_group.py Outdated Show resolved Hide resolved
glotaran/analysis/test/test_penalties.py Outdated Show resolved Hide resolved
glotaran/project/scheme.py Outdated Show resolved Hide resolved
glotaran/project/test/test_scheme.py Outdated Show resolved Hide resolved
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 18, 2021

Sourcery Code Quality Report

❌  Merging this PR will decrease code quality in the affected files by 3.43%.

Quality metrics Before After Change
Complexity 3.95 ⭐ 4.15 ⭐ 0.20 👎
Method Length 74.56 🙂 87.99 🙂 13.43 👎
Working memory 9.10 🙂 9.53 🙂 0.43 👎
Quality 64.90% 🙂 61.47% 🙂 -3.43% 👎
Other metrics Before After Change
Lines 4009 2642 -1367
Changed files Quality Before Quality After Quality Change
benchmark/benchmarks/integration/ex_two_datasets/benchmark.py 85.49% ⭐ 89.66% ⭐ 4.17% 👍
glotaran/analysis/optimize.py 46.86% 😞 39.79% 😞 -7.07% 👎
glotaran/analysis/test/test_constraints.py 57.15% 🙂 57.48% 🙂 0.33% 👍
glotaran/analysis/test/test_grouping.py 50.01% 🙂 49.67% 😞 -0.34% 👎
glotaran/analysis/test/test_optimization.py 34.60% 😞 34.53% 😞 -0.07% 👎
glotaran/analysis/test/test_penalties.py 64.87% 🙂 67.27% 🙂 2.40% 👍
glotaran/analysis/test/test_relations.py 57.85% 🙂 58.25% 🙂 0.40% 👍
glotaran/builtin/io/yml/test/test_save_model.py 94.31% ⭐ 94.31% ⭐ 0.00%
glotaran/builtin/io/yml/test/test_save_scheme.py 78.73% ⭐ 78.73% ⭐ 0.00%
glotaran/builtin/megacomplexes/decay/test/test_decay_megacomplex.py 59.16% 🙂 59.54% 🙂 0.38% 👍
glotaran/deprecation/modules/test/init.py 80.32% ⭐ 75.41% ⭐ -4.91% 👎
glotaran/deprecation/modules/test/test_project_scheme.py 71.23% 🙂 75.33% ⭐ 4.10% 👍
glotaran/model/init.py % % %
glotaran/model/model.py 73.26% 🙂 72.40% 🙂 -0.86% 👎
glotaran/model/test/test_model.py 70.89% 🙂 72.38% 🙂 1.49% 👍
glotaran/project/scheme.py 89.88% ⭐ 72.83% 🙂 -17.05% 👎
glotaran/project/test/test_scheme.py 80.82% ⭐ 80.83% ⭐ 0.01% 👍
glotaran/test/test_spectral_decay.py 51.45% 🙂 53.88% 🙂 2.43% 👍
glotaran/test/test_spectral_decay_full_model.py 55.21% 🙂 55.92% 🙂 0.71% 👍
glotaran/test/test_spectral_penalties.py 46.60% 😞 45.01% 😞 -1.59% 👎

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation
glotaran/analysis/optimize.py _create_result 17 🙂 279 ⛔ 26 ⛔ 23.55% ⛔ Try splitting into smaller methods. Extract out complex expressions
glotaran/analysis/test/test_optimization.py test_optimization 19 😞 514 ⛔ 15 😞 25.11% 😞 Refactor to reduce nesting. Try splitting into smaller methods. Extract out complex expressions
glotaran/test/test_spectral_penalties.py test_equal_area_penalties 5 ⭐ 783 ⛔ 17 ⛔ 34.81% 😞 Try splitting into smaller methods. Extract out complex expressions
glotaran/analysis/optimize.py optimize 5 ⭐ 163 😞 17 ⛔ 46.65% 😞 Try splitting into smaller methods. Extract out complex expressions
glotaran/analysis/test/test_grouping.py test_multi_dataset_overlap 0 ⭐ 485 ⛔ 12 😞 46.67% 😞 Try splitting into smaller methods. Extract out complex expressions

Legend and Explanation

The emojis denote the absolute quality of the code:

  • ⭐ excellent
  • 🙂 good
  • 😞 poor
  • ⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

@sonarcloud
Copy link

sonarcloud bot commented Oct 18, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 5 Code Smells

No Coverage information No Coverage information
1.5% 1.5% Duplication

Copy link
Member

@s-weigand s-weigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some fix commits and from my side, this is ok to merge.
@joernweissenborn and @jsnel Please have a look at my changes.

@joernweissenborn
Copy link
Member Author

lgtm, hit that merge button :)

Copy link
Member

@jsnel jsnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Some minor issues to be picked up in a subsequent cleanup / re-naming things PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants