Framework for statistical tests #3883

jngrad · 2020-09-02T12:57:46Z

Currently we have frameworks for unit testing, integration testing and sample testing. They are all deterministic, either because the functionality being tested is deterministic, or RNG seeds (both in numpy and espresso) have been set such that the functionality is deterministic. We currently do not have a suitable framework for testing functionality that is stochastic in nature and for which a statistic (e.g. the mean) is known to converge for long simulation times, forcing us to use workarounds, e.g. disable tests in CMake for specific build types, increase test timeout in CI, or reduce the number of integration loops at the expense of accuracy.

Problem statement
Several python integration tests are actually not integration tests, but stochastic tests. They do not contribute to code coverage and do not benefit from C++ static assertions and sanitizers, since the underlying functionality already has integration tests. Yet they suffer from significant slowdowns (2-5 times larger runtime) in coverage and sanitizer builds. The Clang sanitizer CI job for example has crossed the 1 hour runtime threshold on multiple occasions over the last few months, prompting us to refactor slow tests to make them run faster at the expense of larger tolerance values in the checks, as well as refactor unrelated python tests to make them run faster and gain a few minutes to compensate for slow tests. Statistical tests also tend to undergo large runtime fluctuations on coyote runners that are under load from multiple CI jobs.

Although we should always prefer deterministic tests over statistical tests, there is a need for statistical tests that will not go away. Here are a few candidates for a statistical test framework (times in seconds; RelWithAssert builds are compiled with -O3 -g):

Test name	RelWithAssert	Coverage	Sanitizers	Remarks
`langevin_thermostat`	12.47	46.44	82.87
`brownian_dynamics`	6.57	14.24	23.92
`dpd`	8.28	31.25	43.49
`integrator_npt`	15.65	77.96	122.07
`lb`	10.76	21.44	47.73	mass & momentum conservation
`lb_pressure_tensor_acf`	21.60	24.73	41.54	CPU LB is untested (too slow)
`lb_stokes_sphere`	3.37	39.41	99.10
`virtual_sites_tracers`	5.41	49.39	111.55
`mass-and-rinertia_per_particle`	11.45	53.08	58.30
`rotational-diffusion-aniso`	18.86	97.10	81.72
`coulomb_tuning`	16.97	50.51	88.34	P3M tuning is already tested in `p3m_tuning_exceptions`
`wang_landau_reaction_ensemble`	115.69	>300.0	532.04	removed from Coverage builds

Prior work
The sample and tutorial tests were stochastic from January 2019 to July 2020 using randomized numpy seeds and espresso seeds (in 4.1). These tests were too unstable and were eventually converted to deterministic test due to the high failure rate in CI. Generating reliable tolerance intervals was also problematic due to the variety of probability distributions underlying the statistics being tested, forcing us to run new tests in a loop to bootstrap the tolerance interval. This strategy doesn't work for us, so we cannot simply convert statistical tests to pseudo-sample tests to get improved runtimes from Release builds.

Proposed solution
Re-purpose the unused (and no longer up-to-date) check_python_skip_long CMake target to skip the tests mentioned above in coverage and sanitizer builds. When a statistical test provides coverage, move the relevant part into a self-contained integration test. This should remove 11 min (= 22 min * cores) from the Clang job and provide a framework for statistical tests.

Shortcomings
One drawback is that integration tests and statistical tests will still run together under the umbrella term of python tests. Moving statistical tests to their own CMake target would offer better separation and prevent statistical tests from running if the underlying features raised e.g. ASAN or UBSAN errors, at the expense of more CMake code and a bottleneck in CTest (right now CTest schedules GPU tests to run in serial and CPU tests to take the remaining available cores; introducing a synchronization barrier for statistical tests makes resources allocation less efficient).

The text was updated successfully, but these errors were encountered:

jngrad · 2020-09-30T13:01:22Z

The wang_landau_reaction_ensemble test is timing out again in CI in sanitizer builds (logfile).
The lb_pressure_tensor_acf test on CPU required the introduction of a new workaround in the CMake logic (1c16301).

Both of these tests have introduced unmaintainable CMake workarounds that will have to be removed once we decide on a framework for statistical tests.

jngrad · 2020-10-22T12:03:24Z

Builds are now timing out on ICP workstations. On giraffe with sanitizers:

 70/159 Test #107: wang_landau_reaction_ensemble .....................***Timeout 300.06 sec
 71/159 Test #150: stokesian_dynamics_cpu ............................***Timeout 300.08 sec

On lama without sanitizers and without coverage:

 52/159 Test #107: wang_landau_reaction_ensemble .....................***Timeout 300.04 sec
 89/159 Test  #67: lb_pressure_tensor ................................***Timeout 300.08 sec

This causes CI to fail in PRs (#3960) and nightly builds (#3951, #3963). I have removed lama from the list of active CI runners.

RudolfWeeber · 2020-10-24T13:19:19Z

The following tests are deterministic:

lb_stokes_sphere
virtual_sites_tracers

The criterion for "deterministic", here is
There is a non-0 probability that the test fails, even if the implementation is correct.

There are quite a few tests using random numbers which nevertheless have to pass always.
E.g., the inerita ensor has to be correct what ever random particle positions are chosen.

IMO, the non-deterministic parts of the affected tests should be moved into separate files. As suggested, they can then be excluded from slow CI runs.
We should make a list of the individaul test functions before beginning with implementation.

RudolfWeeber · 2020-10-26T15:38:10Z

Langevin:

test_global_langevin*
test_langevin_per_particle*
test_diffusion
test_noise_correlation

Brownian fully

dpd:

test_single*
test_binary

dpd:

test_single*
test_binary

dpd:

integrator_npt: fully

lb:

mass_momentum_thermostat
integrator_npt: fully

lb_pressure_tensor_acf: fully

lb_stokes_sphere: fully, but for speed, not because it is statistical

virtual_sites_tracers: fully, but for speed, not because it is statistical

mass-and-rinertia-per-particle: fully

mass-and-rinertia-per-particle:

rotational-diffusion-aniso: fully

coulomb_tuning: fully

wang_landau_reaction_ensemble: fully

RudolfWeeber · 2020-10-26T15:39:18Z

virtual_sites_tracers:
The various surface restoring forces can be tested without lb. Then, the runtime is no longer an issue. See test in the Walberla branch.

Separate C++ coverage from python coverage. This way, statistical tests (#3883) can appear in the coverage report even though they are excluded from C++ coverage builds, and tutorial/sample tests can also appear in the coverage report.

Fixes #3883 Description of changes: - write more thorough RNG checks for thermostats - skip slow/statistical tests in coverage and sanitizer builds - fix broken ifdefs for Brownian Dynamics - cleanup IBM code

jngrad mentioned this issue Sep 15, 2020

Fix regression in CPU LB thermalization. #3847

Merged

RudolfWeeber added this to the Espresso 4.2 milestone Oct 26, 2020

KaiSzuttor added Documentation Python and removed Documentation labels Nov 2, 2020

jngrad mentioned this issue Nov 10, 2020

Coverage report of scripts #3988

Merged

jngrad self-assigned this Nov 17, 2020

This was referenced Nov 19, 2020

New integration tests framework #3999

Merged

Beginners Tasks and Developement Transparency #3998

Closed

kodiakhq bot closed this as completed in #3999 Nov 24, 2020

This was referenced Oct 11, 2023

{2023.06}[foss/2022a] ESPResSo v4.2.1 EESSI/software-layer#331

Merged

failing tests in ESPResSo v4.2.1 due to timeouts EESSI/software-layer#363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framework for statistical tests #3883

Framework for statistical tests #3883

jngrad commented Sep 2, 2020

jngrad commented Sep 30, 2020

jngrad commented Oct 22, 2020

RudolfWeeber commented Oct 24, 2020

RudolfWeeber commented Oct 26, 2020

RudolfWeeber commented Oct 26, 2020

Framework for statistical tests #3883

Framework for statistical tests #3883

Comments

jngrad commented Sep 2, 2020

jngrad commented Sep 30, 2020

jngrad commented Oct 22, 2020

RudolfWeeber commented Oct 24, 2020

RudolfWeeber commented Oct 26, 2020

RudolfWeeber commented Oct 26, 2020