Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add parameter IO support for more formats supported by pandas #896

Merged
merged 32 commits into from
Jan 30, 2022

Conversation

patrickhaetti
Copy link
Contributor

@patrickhaetti patrickhaetti commented Nov 10, 2021

Change summary

  • added xlsx & ods parameter:
    • added xlsx.py in builtin/io/pandas. Replace inf / -inf with NaN when saving and replace NaN with inf in 'maximum' and -inf in 'minimum 'column of Dataframe ✨ Use ∞ instead of inf in excel parameter files #888
    • setup.cfg: add xlsx = glotaran.builtin.io.pandas.xlsx
    • added test_pandas_parameters.py and add 'data' subfolder which includes parameter.xlsx and parameter.yaml
    • added tests with parameter files *.xlsx and *.ods
    • added openpyxl>=3.0.9 to setup.cfg / install_requires
    • added odfpy>=1.4.1 to setup.cfg / install_requires
    • add openpyxl==3.0.9 to requirements_dev.txt to read/write xlsx
  • Added test files to pandas/tests/data with parameter.yaml and create csv, tsv, xlsx parameter files including one type of each parameter definition.

  • added tsv parameter:

    • add tsv.py in builtin/io/pandas for tab-separated-value parameter
    • setup.cfg: add tsv = glotaran.builtin.io.pandas.tsv
    • added tsv.py to builtin/io/pandas.
    • added parameter.tsv in test/data

Checklist

  • ✔️ Passing the tests (mandatory for all PR's)
  • 👌 Closes issue (mandatory for ✨ feature and 🩹 bug fix PR's)
  • 🧪 Adds new tests for the feature (mandatory for ✨ feature and 🩹 bug fix PR's)
  • 📚 Adds documentation of the feature

@patrickhaetti patrickhaetti requested review from s-weigand and a team as code owners November 10, 2021 18:05
@github-actions
Copy link
Contributor

Binder 👈 Launch a binder notebook on branch patrickhaetti/pyglotaran/create-xlsx-parameter

@s-weigand s-weigand changed the title Create xlsx parameter #844 Add parameter IO support for more formats supported by pandas Nov 10, 2021
@codecov
Copy link

codecov bot commented Nov 11, 2021

Codecov Report

Merging #896 (9602f53) into main (491d5f2) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #896     +/-   ##
=======================================
+ Coverage   85.6%   85.8%   +0.1%     
=======================================
  Files         90      92      +2     
  Lines       4909    4959     +50     
  Branches     927     933      +6     
=======================================
+ Hits        4207    4258     +51     
  Misses       552     552             
+ Partials     150     149      -1     
Impacted Files Coverage Δ
glotaran/builtin/io/pandas/csv.py 100.0% <100.0%> (ø)
glotaran/builtin/io/pandas/tsv.py 100.0% <100.0%> (ø)
glotaran/builtin/io/pandas/xlsx.py 100.0% <100.0%> (ø)
glotaran/parameter/parameter_group.py 88.1% <100.0%> (ø)
glotaran/utils/io.py 100.0% <100.0%> (ø)
glotaran/parameter/parameter.py 95.9% <0.0%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 491d5f2...9602f53. Read the comment docs.

Copy link
Member

@s-weigand s-weigand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, but there are some changes needed (see review).

In addition:

  • It would also be useful to have a *.csv file for testing that it looks like it should.
  • Having an option for writing to csv e.g. strip_default=True would be useful when csv files are edited in excel (to reduce code duplication best factor the replacing out in a separate function which then can be unit tested).
  • Please also add a plugin for tsv (tab separated values) the only difference to csv should be sep=+\t in the pandas method.

@jsnel @joernweissenborn

  • What do you think about adding openpyxl as a runtime dependency?
  • What do you think about setting all default parameter values to empty strings, this would reduce cognitive load and make parameter files easier to read.

glotaran/builtin/io/pandas/test/data/parameter.yaml Outdated Show resolved Hide resolved
glotaran/builtin/io/pandas/test/test_pandas_parameters.py Outdated Show resolved Hide resolved
glotaran/builtin/io/pandas/test/test_pandas_parameters.py Outdated Show resolved Hide resolved
requirements_dev.txt Show resolved Hide resolved
@joernweissenborn
Copy link
Member

Thanks for the contribution, but there are some changes needed (see review).

In addition:

  • It would also be useful to have a *.csv file for testing that it looks like it should.
  • Having an option for writing to csv e.g. strip_default=True would be useful when csv files are edited in excel (to reduce code duplication best factor the replacing out in a separate function which then can be unit tested).
  • Please also add a plugin for tsv (tab separated values) the only difference to csv should be sep=+\t in the pandas method.

@jsnel @joernweissenborn

  • What do you think about adding openpyxl as a runtime dependency?
  • What do you think about setting all default parameter values to empty strings, this would reduce cognitive load and make parameter files easier to read.

I am ok with both.

@jsnel
Copy link
Member

jsnel commented Nov 17, 2021

  • What do you think about adding openpyxl as a runtime dependency?
  • What do you think about setting all default parameter values to empty strings, this would reduce cognitive load and make parameter files easier to read.

Yes to both.

@joernweissenborn
Copy link
Member

@patrickhaetti Looking at the commit mess I suggest you close this PR and and open a new one, with clean commit history. I think @glotaran/maintainers would also prefer it.

@s-weigand
Copy link
Member

@patrickhaetti Looking at the commit mess I suggest you close this PR and and open a new one, with clean commit history. I think @glotaran/maintainers would also prefer it.

Let's keep the PR but I will remove the mess 😉

@s-weigand s-weigand force-pushed the create-xlsx-parameter branch from f2da31e to 631dca8 Compare November 19, 2021 11:04
@github-actions
Copy link
Contributor

github-actions bot commented Nov 19, 2021

Benchmark is done. Checkout the benchmark result page.
Benchmark differences below 5% might be due to CI noise.

Benchmark diff v0.5.1 vs. main

Parametrized benchmark signatures:

BenchmarkOptimize.time_optimize(index_dependent, grouped, weight)

All benchmarks:

       before           after         ratio
     [96b42630]       [9602f531]
     <v0.5.1>                   
       51.9±0.3ms       52.1±0.4ms     1.01  BenchmarkOptimize.time_optimize(False, False, False)
         56.7±6ms       56.5±0.4ms     1.00  BenchmarkOptimize.time_optimize(False, False, True)
       52.1±0.3ms       52.5±0.3ms     1.01  BenchmarkOptimize.time_optimize(False, True, False)
       55.5±0.3ms        59.0±10ms     1.06  BenchmarkOptimize.time_optimize(False, True, True)
       62.8±0.4ms       63.9±0.7ms     1.02  BenchmarkOptimize.time_optimize(True, False, False)
        66.5±30ms        71.4±10ms     1.07  BenchmarkOptimize.time_optimize(True, False, True)
       63.0±0.4ms       64.1±0.4ms     1.02  BenchmarkOptimize.time_optimize(True, True, False)
        68.3±30ms        67.4±10ms     0.99  BenchmarkOptimize.time_optimize(True, True, True)
             200M             203M     1.01  IntegrationTwoDatasets.peakmem_optimize
       1.75±0.05s       1.65±0.08s     0.94  IntegrationTwoDatasets.time_optimize

Benchmark diff main vs. PR

Parametrized benchmark signatures:

BenchmarkOptimize.time_optimize(index_dependent, grouped, weight)

All benchmarks:

       before           after         ratio
     [491d5f24]       [9602f531]
        54.0±20ms       52.1±0.4ms     0.97  BenchmarkOptimize.time_optimize(False, False, False)
       55.3±0.5ms       56.5±0.4ms     1.02  BenchmarkOptimize.time_optimize(False, False, True)
       51.5±0.3ms       52.5±0.3ms     1.02  BenchmarkOptimize.time_optimize(False, True, False)
         56.0±2ms        59.0±10ms     1.05  BenchmarkOptimize.time_optimize(False, True, True)
       62.7±0.5ms       63.9±0.7ms     1.02  BenchmarkOptimize.time_optimize(True, False, False)
        69.7±40ms        71.4±10ms     1.02  BenchmarkOptimize.time_optimize(True, False, True)
       63.0±0.2ms       64.1±0.4ms     1.02  BenchmarkOptimize.time_optimize(True, True, False)
        67.3±30ms        67.4±10ms     1.00  BenchmarkOptimize.time_optimize(True, True, True)
             200M             203M     1.01  IntegrationTwoDatasets.peakmem_optimize
       1.75±0.08s       1.65±0.08s     0.94  IntegrationTwoDatasets.time_optimize

@s-weigand
Copy link
Member

@patrickhaetti I cleaned up the commit history please make sure to reset your local branch to the remote state.

E.g. if your remote is called origin

git reset --hard orgin/create-xlsx-parameter

If you have uncommitted changes just stash them before (git stash) and pop the stash afterward (git stash pop).
If you have local commits you didn't push yet just create a branch at the current state(git branch save-xlsx-branch) and cherry-pick the commits over after you reset to the remote branch.

@patrickhaetti
Copy link
Contributor Author

@s-weigand thank you for cleaning up, looks way more better now

@s-weigand s-weigand added this to the v0.6.0 milestone Nov 23, 2021
@s-weigand
Copy link
Member

I will implement the strip_default=True functionality on the parameter level, since we will also need this for yaml round trips.

@s-weigand s-weigand force-pushed the create-xlsx-parameter branch 3 times, most recently from 0467af6 to a44cb82 Compare December 1, 2021 08:38
@jsnel jsnel force-pushed the create-xlsx-parameter branch from a44cb82 to 04b5a0e Compare December 1, 2021 23:17
@jsnel
Copy link
Member

jsnel commented Dec 12, 2021

In this PR result comparison was failing on a singular point in the residual. This is not a problem with this PR but reason to further improve the github actions that does the results comparison, see issue #937 which will be fixed in addressed in #937.

FAILED .github/test_result_consistency.py::test_result_data_var_consistency[study_transient_absorption_target_analysis-residual]

E AssertionError: Result data_var data mismatch: 'residual' in 'dataset1.nc'.
E With sum of absolute difference: 2.1183739743033033e-07 and shape: (335, 346)
E Mean difference: 1.827602428007336e-12
E
E assert False
E + where False = <function allclose.._allclose at 0x7efc69aaa280>(<xarray.DataArray 'residual' (time: 335, spectral: 346)>\narray([[-4.030000e-03, 1.100000e-04, -6.014408e-15, ..., 2....95 0.02343 0.04457 ... 869.9 889.9 909.9\n * spectral (spectral) float64 378.0 379.4 380.9 382.3 ... 851.8 853.1 854.4, <xarray.DataArray 'residual' (time: 335, spectral: 346)>\narray([[-4.030000e-03, 1.100000e-04, -6.017322e-15, ..., 2....95 0.02343 0.04457 ... 869.9 889.9 909.9\n * spectral (spectral) float64 378.0 379.4 380.9 382.3 ... 851.8 853.1 854.4, atol=array([[1.934e-10, 1.934e-10, 1.934e-10, ..., 1.934e-10, 1.934e-10,\n 1.934e-10],\n [1.934e-10, 1.934e-10, ...1.934e-10,\n 1.934e-10],\n [1.934e-10, 1.934e-10, 1.934e-10, ..., 1.934e-10, 1.934e-10,\n 1.934e-10]]), rtol=1e-05, print_fail=20)

.github/test_result_consistency.py:221: AssertionError
----------------------------- Captured stdout call -----------------------------
allclose first 1 failures:
(87, 344): 4.579833638334098e-06 4.579591726559538e-06
=========================== short test summary info ============================
FAILED .github/test_result_consistency.py::test_result_data_var_consistency[study_transient_absorption_target_analysis-residual]
======================== 1 failed, 296 passed in 5.59s =========================
Error: Process completed with exit code 1.

jsnel added a commit to jsnel/pyglotaran that referenced this pull request Dec 12, 2021
Refine calculation of epsilon for residual check

Should resolve issue signaled in glotaran#896
jsnel added a commit to jsnel/pyglotaran that referenced this pull request Dec 12, 2021
Refine calculation of epsilon for residual check

Should resolve issue signaled in glotaran#896
jsnel added a commit that referenced this pull request Dec 12, 2021
* 🚇Update test_result_consistency
Update float_resolution for comparison of spectral values.
Should resolve issues with result comparison signaled in #927

* 🚇👌 Further refinement of test_result_consistency
Refine calculation of epsilon for residual check
Should resolve issues with result comparison signaled in #896

* 🚇👌 Failing tests now also show the tolerances used for the comparison
patrickhaetti and others added 14 commits January 30, 2022 19:12
Functions added it utils/io.py and applied in csv.py. Function safe_parameters_fillna for loading and safe_parameters_replace for saving parameters. Now it will be checked if columns exist before replacing inf/-inf and empty strings.
Take out 'parameter_diff_param.yaml' from builtin/io/pandas/test/ which had varied values of 'parameter.yaml'.
Taken out test functions from 'test_pandas_parameters.py' which were based on  'parameter_diff_param.yaml'.
Added ods parameter to xlsx.py

Added 'ods' to /glotaran/builtin/io/pandas/xlsx.py.

tests added to 'test_pandas_parameters.py'

add odfpy>=1.4.1 to setup.cfg

Append 'ods' to docstrings.
…x.py

Use functions from glotaran.utils.io.
Corrected spelling mistake in csv.py
For functions safe_parameters_fillna and safe_parameters_replace
Improved readability (`parameter` -> `parameters`, 'testparameters' -> 'test_parameters')
More distinctive filenames from just 'parameter.xlsx' to 'reference_parameter.xlsx' so it doesn't pollute the parameter* file namespace when searching through the project.
Copy link
Member

@joernweissenborn joernweissenborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-weigand Only one Q for before my lgtm, it is in the review.

Copy link
Member

@joernweissenborn joernweissenborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

✨ Added 'replace_infinfinity' optional argument to keep inf values in csv and tsv files
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 30, 2022

Sourcery Code Quality Report

✅  Merging this PR will increase code quality in the affected files by 0.04%.

Quality metrics Before After Change
Complexity 2.29 ⭐ 2.24 ⭐ -0.05 👍
Method Length 49.87 ⭐ 51.65 ⭐ 1.78 👎
Working memory 6.12 ⭐ 6.09 ⭐ -0.03 👍
Quality 78.38% 78.42% 0.04% 👍
Other metrics Before After Change
Lines 1690 1780 90
Changed files Quality Before Quality After Quality Change
glotaran/parameter/parameter_group.py 75.47% ⭐ 75.47% ⭐ 0.00%
glotaran/parameter/test/test_parameter_group.py 74.28% 🙂 74.28% 🙂 0.00%
glotaran/plugin_system/test/test_project_io_registration.py 90.16% ⭐ 90.16% ⭐ 0.00%
glotaran/utils/io.py 76.31% ⭐ 77.49% ⭐ 1.18% 👍
glotaran/utils/test/test_io.py 77.70% ⭐ 78.03% ⭐ 0.33% 👍

Here are some functions in these files that still need a tune-up:

File Function Complexity Length Working Memory Quality Recommendation
glotaran/utils/io.py _load_datasets 15 🙂 156 😞 12 😞 44.60% 😞 Try splitting into smaller methods. Extract out complex expressions
glotaran/parameter/parameter_group.py ParameterGroup.from_dataframe 14 🙂 134 😞 10 😞 51.47% 🙂 Try splitting into smaller methods. Extract out complex expressions
glotaran/parameter/parameter_group.py ParameterGroup.markdown 3 ⭐ 128 😞 11 😞 60.51% 🙂 Try splitting into smaller methods. Extract out complex expressions
glotaran/parameter/parameter_group.py ParameterGroup.__repr__ 6 ⭐ 72 🙂 12 😞 63.49% 🙂 Extract out complex expressions
glotaran/utils/test/test_io.py test_load_datasets_single_dataset 0 ⭐ 208 ⛔ 7 🙂 64.44% 🙂 Try splitting into smaller methods

Legend and Explanation

The emojis denote the absolute quality of the code:

  • ⭐ excellent
  • 🙂 good
  • 😞 poor
  • ⛔ very poor

The 👍 and 👎 indicate whether the quality has improved or gotten worse with this pull request.


Please see our documentation here for details on how these metrics are calculated.

We are actively working on this report - lots more documentation and extra metrics to come!

Help us improve this quality report!

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
2.0% 2.0% Duplication

@s-weigand s-weigand merged commit f59b92c into glotaran:main Jan 30, 2022
@patrickhaetti
Copy link
Contributor Author

patrickhaetti commented Jan 31, 2022

Hi everybody, I am happy to hear the merge happened!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement Feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants