Test examples in parallel #417

JDBetteridge · 2023-08-02T14:12:08Z

Given that most users will probably try and run an example from the examples directory in parallel, enable parallel testing for all examples.

I do not anticipate this adding significant time to CI runs as the parallel tests will benefit from having a hot cache.

Any test in unit-tests or integration-tests can be made to run in parallel by using the Python decorator:

@pytest.mark.parallel(nprocs=4)

The above example would run the test with 4 MPI ranks.

Fixes #377.

Requires #416 to be addressed.

JDBetteridge · 2023-08-02T16:30:39Z

Tests take ~3min longer, seems like a win

tommbendall

This is great, thanks! It raises the question of which of the other unit tests or integration tests should be in parallel. The nice thing about the examples is that they cover most of the capabilities so this should give us reasonable coverage.

Just one small suggestion...

gusto/io.py

JDBetteridge · 2023-08-03T10:56:02Z

I agree, it's probably worth having a discussion at some point which of the unit/integration tests should be run in parallel. Using the pytest marker makes it very easy to update existing tests.

A short term suggestion would be: If you find any parallel bugs add a parallel test, or update an existing test to run in parallel.

…_tests

JDBetteridge · 2023-10-09T10:09:15Z

Twice in a row! Nice!

I will pull out my debugging code and then this is ready for review

tommbendall · 2024-03-13T09:00:26Z

I'd really like us to start parallel testing by default, so I thought it would be helpful to update the branch.

But as we saw a few months ago, we can still get failures. On this latest run it was the skamarock_klemp_hydrostatic test, and it looks like other failures we've seen:

On the first call to the mixed solver, one rank seems to fall behind the others and not begin the ImplicitSolver_solve. Then all the ranks hang. Output from most log files:

2024-03-13 08:36:20,335 INFO     Semi-implicit Quasi-Newton: Mixed solve (0, 1)
2024-03-13 08:36:20,335 INFO     Compressible linear solver: rho average solve
2024-03-13 08:36:20,340 INFO     Compressible linear solver: Exner average solve
2024-03-13 08:36:20,345 INFO     Compressible linear solver: hybridized solve
2024-03-13 08:36:20,359 DEBUG            Residual norms for ImplicitSolver_condensed_field_ solve
2024-03-13 08:36:20,359 DEBUG            0 KSP unpreconditioned resid norm 4.413507124086e+07 true resid norm 4.413507124086e+07 ||r(i)||/||b|| 1.000000000000e+00
2024-03-13 08:36:20,362 DEBUG            1 KSP unpreconditioned resid norm 5.900637824684e+01 true resid norm 5.900637824491e+01 ||r(i)||/||b|| 1.336949881035e-06
2024-03-13 08:36:20,364 DEBUG            2 KSP unpreconditioned resid norm 5.041852127746e-05 true resid norm 5.042033608880e-05 ||r(i)||/||b|| 1.142409758753e-12
2024-03-13 08:36:20,376 DEBUG        Residual norms for ImplicitSolver_ solve
2024-03-13 08:36:20,376 DEBUG        0 KSP no resid norm 4.765315244582e+07 true resid norm 5.046385477193e-05 ||r(i)||/||b|| 1.058982505498e-12

Output from one log file:

2024-03-13 08:36:20,335 INFO     Semi-implicit Quasi-Newton: Mixed solve (0, 1)
2024-03-13 08:36:20,335 INFO     Compressible linear solver: rho average solve
2024-03-13 08:36:20,340 INFO     Compressible linear solver: Exner average solve
2024-03-13 08:36:20,345 INFO     Compressible linear solver: hybridized solve
2024-03-13 08:36:20,359 DEBUG            Residual norms for ImplicitSolver_condensed_field_ solve
2024-03-13 08:36:20,359 DEBUG            0 KSP unpreconditioned resid norm 4.413507124086e+07 true resid norm 4.413507124086e+07 ||r(i)||/||b|| 1.000000000000e+00
2024-03-13 08:36:20,362 DEBUG            1 KSP unpreconditioned resid norm 5.900637824684e+01 true resid norm 5.900637824491e+01 ||r(i)||/||b|| 1.336949881035e-06
2024-03-13 08:36:20,364 DEBUG            2 KSP unpreconditioned resid norm 5.041852127746e-05 true resid norm 5.042033608880e-05 ||r(i)||/||b|| 1.142409758753e-12

tommbendall · 2024-03-13T09:06:58Z

In the failing example that I posted above, the hanging comes between the condensed_field solve (the preconditioning step for the hybridized solver) -- which suggests to me that it could be worth reviewing whether the solver options are valid in parallel for our domains:

    solver_parameters = {'mat_type': 'matfree',
                         'ksp_type': 'preonly',
                         'pc_type': 'python',
                         'pc_python_type': 'firedrake.SCPC',
                         'pc_sc_eliminate_fields': '0, 1',
                         # The reduced operator is not symmetric
                         'condensed_field': {'ksp_type': 'fgmres',
                                             'ksp_rtol': 1.0e-8,
                                             'ksp_atol': 1.0e-8,
                                             'ksp_max_it': 100,
                                             'pc_type': 'gamg',
                                             'pc_gamg_sym_graph': None,
                                             'mg_levels': {'ksp_type': 'gmres',
                                                           'ksp_max_it': 5,
                                                           'pc_type': 'bjacobi',
                                                           'sub_pc_type': 'ilu'}}}

JDBetteridge · 2024-03-13T10:21:28Z

I wonder if we're caching something we shouldn't on a mesh hierarchy. We recently fixed failing tests in Firedrakeland by ensuring a fresh mesh hierarchy each time. That approach wouldn't work here but might be a good place to start. It would be good to whittle this down to a minimal failing example to work on.

tommbendall · 2024-03-13T10:32:32Z

I don't think we actually use a mesh hierarchy in any tests -- we only use algebraic multigrid and not geometric, although this is definitely something that we need to do to improve performance.

I agree about the minimum failing example, the hard thing is that sometimes it will pass!!

JDBetteridge · 2024-07-11T11:00:29Z

Some recent changes in Firedrake my have fixed some parallel bugs. I'm running CI again and crossing my fingers 🤞 .

If this still doesn't work we should try tackling this at the hackathon.

…_tests

resolved

This reverts commit 01d6a2f.

This reverts commit 3c9c361.

…ject/gusto into JDBetteridge/parallel_tests

tommbendall

This has been talked about extensively offline. Thanks Jack and I approve!

JDBetteridge marked this pull request as ready for review August 2, 2023 16:29

JDBetteridge requested a review from tommbendall August 2, 2023 16:30

tommbendall previously requested changes Aug 2, 2023

View reviewed changes

gusto/io.py Outdated Show resolved Hide resolved

JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 4882454 to 5ad6f86 Compare August 3, 2023 10:45

JDBetteridge marked this pull request as draft August 7, 2023 13:57

JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 29987b9 to 6d477e7 Compare August 11, 2023 11:15

JDBetteridge and others added 4 commits August 29, 2023 16:37

Test examples in parallel

61b08cd

Add test timeout in case parallel tests deadlock

ad13f33

Fix failing Skamarock Klemp test and add PointDataOutput warning/error

52e18ee

Use GustoIOError throughout io.py

08b96da

JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 39206c6 to 30b8f5d Compare August 29, 2023 15:38

JDBetteridge and others added 6 commits August 29, 2023 16:44

Investigate

4b2c64e

Move artifact upload to seperate step

0303bf7

Try to raise exceptions in parallel

6475897

Re-enable all tests and cross fingers

ef8419e

Add a bunch of debug log messages to SIQN timestepper

c051c5b

Print fewer decimal places in timestep output

4004100

JDBetteridge force-pushed the JDBetteridge/parallel_tests branch from 30b8f5d to 4004100 Compare August 29, 2023 15:44

JDBetteridge mentioned this pull request Aug 30, 2023

Parallel test failure (on CI) #433

Closed

JDBetteridge mentioned this pull request Sep 19, 2023

Alternative parallel log fix #445

Merged

Merge remote-tracking branch 'origin/main' into JDBetteridge/parallel…

1bc0b4a

…_tests

JDBetteridge marked this pull request as ready for review October 9, 2023 10:08

JDBetteridge and others added 3 commits October 9, 2023 12:49

Remove superfluous debug log line and remove SIQN acronym

70bae42

Disable pytest stdout, increase artefact retention time

13f33a7

Merge branch 'main' into JDBetteridge/parallel_tests

fce815c

tommbendall and others added 2 commits April 30, 2024 18:42

Merge branch 'main' into JDBetteridge/parallel_tests

e31896c

Merge branch 'main' into JDBetteridge/parallel_tests

9b8cecd

JDBetteridge added 4 commits July 22, 2024 16:04

Merge branch 'main' into JDBetteridge/parallel_tests

8cd30d7

Take maximum of h across all ranks

4049fb4

Merge remote-tracking branch 'origin/main' into JDBetteridge/parallel…

9090bba

…_tests

Disable coverage as it is known to cause issues with Python<3.12.4

2d4753c

JDBetteridge and others added 16 commits July 23, 2024 17:43

Revert me: Temporarily disable all tests but examples

3142f3f

Revert me: Serialise the example tests

a72bcaa

Remove redundant setup Python

af37c3d

Tidying

48cfd94

Change solver parameters

3c9c361

Enough! Just use LU for everything

01d6a2f

Revert "Enough! Just use LU for everything"

24028bf

This reverts commit 01d6a2f.

Revert "Change solver parameters"

b478ada

This reverts commit 3c9c361.

Increase timeouts

988d5cf

Re-enable xdist parallel pytest

ebe055b

Re-enable full test suite

169a31e

Shorten shallow water 1d

1f7befb

Merge branch 'main' into JDBetteridge/parallel_tests

c1de44d

Try turning off PYOP2 compiler optimisation flags for all examples

d9ab409

Merge branch 'JDBetteridge/parallel_tests' of github.com:firedrakepro…

4da1c00

…ject/gusto into JDBetteridge/parallel_tests

Fix environment

b32a08f

tommbendall approved these changes Jul 26, 2024

View reviewed changes

tommbendall merged commit d4622d6 into main Jul 26, 2024
4 checks passed

tommbendall deleted the JDBetteridge/parallel_tests branch August 26, 2024 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test examples in parallel #417

Test examples in parallel #417

JDBetteridge commented Aug 2, 2023

JDBetteridge commented Aug 2, 2023

tommbendall left a comment

JDBetteridge commented Aug 3, 2023

JDBetteridge commented Oct 9, 2023

tommbendall commented Mar 13, 2024

tommbendall commented Mar 13, 2024

JDBetteridge commented Mar 13, 2024

tommbendall commented Mar 13, 2024

JDBetteridge commented Jul 11, 2024

tommbendall left a comment

Test examples in parallel #417

Test examples in parallel #417

Conversation

JDBetteridge commented Aug 2, 2023

JDBetteridge commented Aug 2, 2023

tommbendall left a comment

Choose a reason for hiding this comment

JDBetteridge commented Aug 3, 2023

JDBetteridge commented Oct 9, 2023

tommbendall commented Mar 13, 2024

tommbendall commented Mar 13, 2024

JDBetteridge commented Mar 13, 2024

tommbendall commented Mar 13, 2024

JDBetteridge commented Jul 11, 2024

tommbendall left a comment

Choose a reason for hiding this comment