Feature/unsequa multiprocessing #763

chahank · 2023-08-03T12:16:00Z

Changes proposed in this PR:
This PR is a small maintenance of the unsequa module.
~~Note: merge conflict is dependent on changes in PR #762 . Will be solved after the later is merged.~~ SOLVED.

Replace Pathos with Multiprocess for parallel computing in unsequa module
Make the computation to explictly chunk the samples and distribute full chunks on nodes if parallelized.
Fix bugs with pandas 2.0 (iteritems -> items, append -> concat)
Remove the matplotlib parameters
Remove the to be deprecated impact.tot_value (optional, can be reversed if this should be done when the deprecated method is removed)

This PR fixes

upgrade pathos 0.3.0 -> 0.3.1 #761 (for unsequa module)
upgrade pandas 1.5 -> 2.0 #700 (for unsequa module)
calculated chunksizes greater than zero #762 (for unsequa module)
Importing the unsequa module unconditionally adjusts matplotlib style #758

PR Author Checklist

PR Reviewer Checklist

This fixes previously not working parallel computations and introduces a simpler handling of the pool. The users now only need to specify the number of processes.

# Conflicts: # climada/engine/unsequa/calc_cost_benefit.py # climada/engine/unsequa/calc_impact.py # climada/engine/unsequa/test/test_unsequa.py

* Allow to set loglevel * Add method to sort samples * Update CHANGELOG.md * Add advanced examples for unsequa * Remove logging control * Update changelog * Remove tipo * Clarify docstring * Correct docstring * Update t.o.c. * Remove unecessary output prints * Remove linter issues * Update changelog --------- Co-authored-by: Chahan Kropf <[email protected]>

peanutfun

It looks good to me but I cannot really assess if this is a clear improvement. My main trouble is the amount of doubled code and the lack of documentation regarding the parallelization process. Other than that, I see no major issues 👍

climada/engine/unsequa/calc_impact.py

climada/engine/unsequa/calc_base.py

climada/engine/unsequa/calc_cost_benefit.py

peanutfun · 2023-08-25T10:00:43Z

climada/engine/unsequa/calc_cost_benefit.py

-                                      samples_df.iterrows(),
-                                      chunksize=chunksize)
-
+            p_iterator = self._sample_parallel_iterator(


Suggestion: Make the entire iterator-computation-datamerge part a (local?) function that only takes a dataframe and the process number as arguments. Then you can pass the first DF row with processes=1 to estimate the compute time and afterwards pass the rest of the dataframe with processes=processes. No need to double the code.

Done in 261415c

With "local" I meant a function that is defined (only) in the scope of uncertainty. This way, it would need fewer parameters. But it works this way nonetheless

Co-authored-by: Lukas Riedel <[email protected]>

climada/engine/unsequa/calc_base.py

climada/engine/unsequa/calc_cost_benefit.py

Co-authored-by: Lukas Riedel <[email protected]>

climada/engine/unsequa/calc_impact.py

Co-authored-by: Lukas Riedel <[email protected]>

climada/engine/unsequa/calc_impact.py

peanutfun · 2023-08-28T10:34:38Z

None of the "private" module functions (e.g. _multiprocess_chunksize) appear in the compiled docs. Either explicitly tell Sphinx to document private functions of the modules or make the functions public.

Edit: Follow-up issue: #774

peanutfun

Well done!!

Chahan Kropf added 4 commits August 3, 2023 14:06

Update pandas iteritems to items

8a325ae

Change pd append to concat

b3d2f00

Replace pathos with multiprocess for unsequa

0456873

This fixes previously not working parallel computations and introduces a simpler handling of the pool. The users now only need to specify the number of processes.

Fix minimal chunksize at 1

4ab0972

chahank requested a review from peanutfun August 3, 2023 12:16

Chahan Kropf added 5 commits August 3, 2023 15:50

Remove global matplotlib styles

85c5a28

Remove deprecated tot_value

9bc4589

Adjust tests remove tot_valu

fb0e361

Remove tot_value from unsequa

91c4f1e

Add docstring for processes

c36f0f6

chahank mentioned this pull request Aug 3, 2023

calculated chunksizes greater than zero #762

Merged

13 tasks

Chahan Kropf and others added 2 commits August 4, 2023 11:40

Add changelog details

416e949

Merge branch 'develop' into feature/unsequa_multiprocessing

301c0b0

# Conflicts: # climada/engine/unsequa/calc_cost_benefit.py # climada/engine/unsequa/calc_impact.py # climada/engine/unsequa/test/test_unsequa.py

peanutfun mentioned this pull request Aug 17, 2023

Use GitHub actions for unit testing #745

Merged

13 tasks

Chahan Kropf added 2 commits August 21, 2023 15:47

Add chunked version of parallel computing.

429fe76

Make chunksize more efficient by default

13a4a9f

timschmi95 mentioned this pull request Aug 21, 2023

Feature/order samples unsequa #766

Merged

13 tasks

Chahan Kropf added 2 commits August 22, 2023 15:18

Update docstring

8c45b49

Change cost benefit to use chunks in parallel

4e77b86

chahank requested a review from emanuel-schmid August 22, 2023 13:19

Chahan Kropf and others added 3 commits August 23, 2023 13:05

Merge branch 'develop' into feature/unsequa_multiprocessing

c678791

Remove deprecated numpy vstack of objects

2d6130c

peanutfun requested changes Aug 25, 2023

View reviewed changes

chahank and others added 6 commits August 25, 2023 12:09

Update climada/engine/unsequa/calc_impact.py

e869f95

Co-authored-by: Lukas Riedel <[email protected]>

Update climada/engine/unsequa/calc_cost_benefit.py

7fb0835

Co-authored-by: Lukas Riedel <[email protected]>

Make sample iterator to single function

6ad26a0

Add comment on copy statement

d67651e

Remove not needed parallel pool chunksize argument

afb8147

Make chunksize base function

9ca7081

Chahan Kropf added 6 commits August 25, 2023 14:20

Make transpose data function

ede1c56

Import pathos instead of multiprocessing

6e7d24f

Make method for uncertainty computation step

261415c

Improve docstrings

f90a0b4

Improve docstring

15b5579

Add description of parallelization logic

abcc548