Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework for statistical tests #3883

Closed
jngrad opened this issue Sep 2, 2020 · 5 comments · Fixed by #3999
Closed

Framework for statistical tests #3883

jngrad opened this issue Sep 2, 2020 · 5 comments · Fixed by #3999
Assignees
Labels

Comments

@jngrad
Copy link
Member

jngrad commented Sep 2, 2020

Currently we have frameworks for unit testing, integration testing and sample testing. They are all deterministic, either because the functionality being tested is deterministic, or RNG seeds (both in numpy and espresso) have been set such that the functionality is deterministic. We currently do not have a suitable framework for testing functionality that is stochastic in nature and for which a statistic (e.g. the mean) is known to converge for long simulation times, forcing us to use workarounds, e.g. disable tests in CMake for specific build types, increase test timeout in CI, or reduce the number of integration loops at the expense of accuracy.

Problem statement
Several python integration tests are actually not integration tests, but stochastic tests. They do not contribute to code coverage and do not benefit from C++ static assertions and sanitizers, since the underlying functionality already has integration tests. Yet they suffer from significant slowdowns (2-5 times larger runtime) in coverage and sanitizer builds. The Clang sanitizer CI job for example has crossed the 1 hour runtime threshold on multiple occasions over the last few months, prompting us to refactor slow tests to make them run faster at the expense of larger tolerance values in the checks, as well as refactor unrelated python tests to make them run faster and gain a few minutes to compensate for slow tests. Statistical tests also tend to undergo large runtime fluctuations on coyote runners that are under load from multiple CI jobs.

Although we should always prefer deterministic tests over statistical tests, there is a need for statistical tests that will not go away. Here are a few candidates for a statistical test framework (times in seconds; RelWithAssert builds are compiled with -O3 -g):

Test name RelWithAssert Coverage Sanitizers Remarks
langevin_thermostat 12.47 46.44 82.87
brownian_dynamics 6.57 14.24 23.92
dpd 8.28 31.25 43.49
integrator_npt 15.65 77.96 122.07
lb 10.76 21.44 47.73 mass & momentum conservation
lb_pressure_tensor_acf 21.60 24.73 41.54 CPU LB is untested (too slow)
lb_stokes_sphere 3.37 39.41 99.10
virtual_sites_tracers 5.41 49.39 111.55
mass-and-rinertia_per_particle 11.45 53.08 58.30
rotational-diffusion-aniso 18.86 97.10 81.72
coulomb_tuning 16.97 50.51 88.34 P3M tuning is already tested in p3m_tuning_exceptions
wang_landau_reaction_ensemble 115.69 >300.0 532.04 removed from Coverage builds

Prior work
The sample and tutorial tests were stochastic from January 2019 to July 2020 using randomized numpy seeds and espresso seeds (in 4.1). These tests were too unstable and were eventually converted to deterministic test due to the high failure rate in CI. Generating reliable tolerance intervals was also problematic due to the variety of probability distributions underlying the statistics being tested, forcing us to run new tests in a loop to bootstrap the tolerance interval. This strategy doesn't work for us, so we cannot simply convert statistical tests to pseudo-sample tests to get improved runtimes from Release builds.

Proposed solution
Re-purpose the unused (and no longer up-to-date) check_python_skip_long CMake target to skip the tests mentioned above in coverage and sanitizer builds. When a statistical test provides coverage, move the relevant part into a self-contained integration test. This should remove 11 min (= 22 min * cores) from the Clang job and provide a framework for statistical tests.

Shortcomings
One drawback is that integration tests and statistical tests will still run together under the umbrella term of python tests. Moving statistical tests to their own CMake target would offer better separation and prevent statistical tests from running if the underlying features raised e.g. ASAN or UBSAN errors, at the expense of more CMake code and a bottleneck in CTest (right now CTest schedules GPU tests to run in serial and CPU tests to take the remaining available cores; introducing a synchronization barrier for statistical tests makes resources allocation less efficient).

@jngrad
Copy link
Member Author

jngrad commented Sep 30, 2020

  • The wang_landau_reaction_ensemble test is timing out again in CI in sanitizer builds (logfile).
  • The lb_pressure_tensor_acf test on CPU required the introduction of a new workaround in the CMake logic (1c16301).

Both of these tests have introduced unmaintainable CMake workarounds that will have to be removed once we decide on a framework for statistical tests.

@jngrad
Copy link
Member Author

jngrad commented Oct 22, 2020

Builds are now timing out on ICP workstations. On giraffe with sanitizers:

 70/159 Test #107: wang_landau_reaction_ensemble .....................***Timeout 300.06 sec
 71/159 Test #150: stokesian_dynamics_cpu ............................***Timeout 300.08 sec

On lama without sanitizers and without coverage:

 52/159 Test #107: wang_landau_reaction_ensemble .....................***Timeout 300.04 sec
 89/159 Test  #67: lb_pressure_tensor ................................***Timeout 300.08 sec

This causes CI to fail in PRs (#3960) and nightly builds (#3951, #3963). I have removed lama from the list of active CI runners.

@RudolfWeeber
Copy link
Contributor

The following tests are deterministic:

  • lb_stokes_sphere
  • virtual_sites_tracers

The criterion for "deterministic", here is
There is a non-0 probability that the test fails, even if the implementation is correct.

There are quite a few tests using random numbers which nevertheless have to pass always.
E.g., the inerita ensor has to be correct what ever random particle positions are chosen.

IMO, the non-deterministic parts of the affected tests should be moved into separate files. As suggested, they can then be excluded from slow CI runs.
We should make a list of the individaul test functions before beginning with implementation.

@RudolfWeeber RudolfWeeber added this to the Espresso 4.2 milestone Oct 26, 2020
@RudolfWeeber
Copy link
Contributor

Langevin:

  • test_global_langevin*
  • test_langevin_per_particle*
  • test_diffusion
  • test_noise_correlation

Brownian fully

dpd:

  • test_single*
  • test_binary

dpd:

  • test_single*
  • test_binary

dpd:

integrator_npt: fully

lb:

  • mass_momentum_thermostat
    integrator_npt: fully

lb_pressure_tensor_acf: fully

lb_stokes_sphere: fully, but for speed, not because it is statistical

virtual_sites_tracers: fully, but for speed, not because it is statistical

mass-and-rinertia-per-particle: fully

mass-and-rinertia-per-particle:

rotational-diffusion-aniso: fully

coulomb_tuning: fully

wang_landau_reaction_ensemble: fully

@RudolfWeeber
Copy link
Contributor

virtual_sites_tracers:
The various surface restoring forces can be tested without lb. Then, the runtime is no longer an issue. See test in the Walberla branch.

kodiakhq bot added a commit that referenced this issue Nov 10, 2020
Separate C++ coverage from python coverage. This way, statistical tests (#3883) can appear in the coverage report even though they are excluded from C++ coverage builds, and tutorial/sample tests can also appear in the coverage report.
@jngrad jngrad self-assigned this Nov 17, 2020
@kodiakhq kodiakhq bot closed this as completed in #3999 Nov 24, 2020
kodiakhq bot added a commit that referenced this issue Nov 24, 2020
Fixes #3883

Description of changes:
- write more thorough RNG checks for thermostats
- skip slow/statistical tests in coverage and sanitizer builds
- fix broken ifdefs for Brownian Dynamics
- cleanup IBM code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants