Need better tests of system tests #1640

rljacob · 2017-06-05T00:23:42Z

The discussion in issue #1638 shows that its possible to put changes on master that will break a test without detection, in this case the widely used ERS test.

We need better "tests of system tests". We currently test the ERS.py test by deploying them with a couple of A cases in cime_developer. But if the test itself fails to exercise what its supposed to, while still running, then it will always seem to "pass'. If the models under test with ERS actually break restart, it could be weeks/months before anyone notices.

What we need is a case which by design will fail an ERS test. ERS detecting "fail" is then a pass for that test. May need that for other system tests as well.

billsacks · 2017-06-06T13:16:24Z

I think the problem was ERP, not ERS.

See also the discussion in #290 - which I guess shouldn't have been closed based on the fact that at least the ERP test is still prone to getting broken in this way. I think I closed it because we couldn't come up with an easy way to test the system tests in this way.

Much of my motivation for creating the SystemTestsCompareTwo infrastructure was to allow us to test that infrastructure extensively (via unit tests) and then build some simple tests on top of that. It feels to me that the tests that most often break are ones that have two different configurations but are not handled via SystemTestsCompareTwo - so require some resetting between each build/run. However, we never came up with a way to handle tests that require a restart using that framework (see #448 and #824 ). This leaves ERP as one of the most error-prone tests, which is unfortunate because it is also one of the most widely-used tests.

rljacob · 2017-06-06T13:19:00Z

Can testmods include source code mods?

billsacks · 2017-06-06T13:22:56Z

Can testmods include source code mods?

Yes. But, as discussed in #290, this carries a maintenance cost, since the SourceMods will need to be updated whenever the production code is updated. (As an aside, we've had some discussions about storing SourceMods as diffs rather than the full file, which could then be applied via a patch command. That would help somewhat - but would require developing some infrastructure to accomplish it.)

rljacob · 2017-06-08T04:34:17Z

I think it might be possible to generate an ERP fail with CIME code: Random numbers in the o-a flux. Since those are also part of CIME, it can all get updated at once.

billsacks · 2017-06-08T12:44:03Z

I think it might be possible to generate an ERP fail with CIME code: Random numbers in the o-a flux. Since those are also part of CIME, it can all get updated at once.

If I understand your suggestion, that would generate a fail in any test that compares two runs. This would test some aspects of the system test infrastructure, and so could be valuable in some respects. However, this wouldn't test the ERP-specific logic - for example, the bug that was introduced recently in the ERP test would still have gone undetected even with this in place. (This could still be worth doing; I'm just trying to point out what would & wouldn't be covered by this.)

compare two test case names match This PR makes the case2 casename in system_tests_compare_two match the case one casename and nests both the case directory and the output directories inside that of case1. This is a precursor for a PR to address issue #1640 Addresses #1640 Test suite: scripts_regression_tests.py Test baseline: Test namelist changes: Test status: bit for bit Fixes User interface changes?: Update gh-pages html (Y/N)?: N Code review:

…(PR #1640) Update netcdf for sandia machines. This is necessary to support the cdf5 file type which is needed for internal CIME testing. Machines that do not use PIO in pnetcdf mode do not need to need to worry about updating netcdf. If they do, netcdf 4.4 is needed. [BFB] * jgfouca/update_netcdf_for_sandia_machines: Point to new cprnc build Update netcdf for sandia machines.

jgfouca · 2017-11-13T22:49:21Z

@billsacks , I'm thinking this task may not be necessary now that nearly all the key test types use the compare-two infrastructure. Compare-two is well tested, does it include tests for failed comparisons? If so, I'm tempted to close this.

billsacks · 2017-11-14T04:38:14Z

@jgfouca I agree that this may not be necessary. Have you also checked with @rljacob since he's the one who opened this issue?

However, on this point:

Compare-two is well tested, does it include tests for failed comparisons?

All it's testing right now is that _component_compare_test is actually called at the right time. I was assuming that the logic in _component_compare_test was tested elsewhere.

However: Based on your question, I have decided to go one level further in this testing. I just made some changes (which I'll submit in a PR tomorrow, once scripts_regression_tests finishes) that add unit tests to cover _component_compare_test (success or failure) from system_tests_compare_two.

These new tests will cover the relevant logic in system_tests_compare_two and system_tests_common. However, they will NOT cover the logic in hist_utils: I'm stubbing out the actual call into hist_utils, under the assumption that this is - or should be - covered by other tests.

But I'm not sure if hist_utils is actually covered sufficiently by unit tests. If it isn't, it should be, and maybe that could be a specific issue opened to replace this one. There are some test files in tools/cprnc/test_inputs that could possibly be used for this purpose (pre-staging these files into the appropriate location, or (maybe better) generating files like this on the fly, then calling one of the routines in hist_utils.py).

rljacob · 2017-11-14T04:50:28Z

Maybe this can be replaced with a bunch of more specific issues. We still don't have a test that could have caught the fail discussed in #1638 : A change to the test's python code turned it in to a test that would always pass. A tests that forces a fail would have noticed that.

billsacks · 2017-11-14T12:30:54Z

@rljacob While I generally agree with the sentiment of making sure our tests are tested, in particular to make sure they pick up fails: I also feel that, now that most of our non-trivial tests use system_tests_compare_two, these sorts of problems are less likely to occur. (In particular, the error in #1638 involves code that no longer exists with the conversion of erp to system_tests_compare_two.) I feel that the best way to handle this is to have near-100% unit test coverage of the relevant code. I feel good about the unit test coverage of system_tests_compare_two itself, but I suspect we still need additional unit tests of some of the things it relies on (like hist_utils).

jgfouca · 2017-11-14T18:00:15Z

I agree with @billsacks . I think hist_utils is pretty well-covered in scripts_regression_tests. We do lots of tests that do hist comparisons checking for success and we also do a few that verify failure (diff).

…ests Add unit tests of compare_two's handling of pass/fail in comparison Previously, I had figured that it was sufficient to ensure that _component_compare_test was actually being called, figuring that the tests of _component_compare_test belong elsewhere. But, since this is such a critical aspect of the system_test_compare_two infrastructure, I'm adding some tests covering _component_compare_test here. These new tests cover the logic related to in-test comparisons in system_tests_compare_two and system_tests_common. However, they will NOT cover the logic in hist_utils: I'm stubbing out the actual call into hist_utils, under the assumption that this is - or should be - covered by other tests. But I'm not sure if hist_utils is actually covered sufficiently by unit tests. If it isn't, it should be. This also required some minor refactoring of system_tests_common in order to allow stubbing out the call into hist_utils. (This refactoring would not have been needed if we allowed use of the mock module: see #2056.) Test suite: scripts_regression_tests on yellowstone Passes other than the issues documented in #2057 and #2059 Also ensured that comparison failures are still reported correctly by running a REP test that forces a comparison failure (due to missing cprnc). Test baseline: n/a Test namelist changes: none Test status: bit for bit Fixes: Partially addresses #1640 User interface changes?: none Update gh-pages html (Y/N)?: N Code review:

…(PR #1640) Update netcdf for sandia machines. This is necessary to support the cdf5 file type which is needed for internal CIME testing. Machines that do not use PIO in pnetcdf mode do not need to need to worry about updating netcdf. If they do, netcdf 4.4 is needed. [BFB] * jgfouca/update_netcdf_for_sandia_machines: Point to new cprnc build Update netcdf for sandia machines.

rljacob added ty: enhancement tp: system tests labels Jun 6, 2017

billsacks mentioned this issue Jun 6, 2017

Would this be a way we could convert ERS and ERP to SystemTestsCompareTwo? #1647

Closed

jgfouca self-assigned this Jul 19, 2017

jgfouca added the Assigned label Jul 19, 2017

jedwards4b mentioned this issue Aug 23, 2017

compare two test case names match #1832

Merged

rljacob added in progress and removed Assigned labels Aug 23, 2017

jgfouca added Assigned ready and removed in progress Assigned labels Aug 25, 2017

billsacks mentioned this issue Nov 14, 2017

Add unit tests of compare_two's handling of pass/fail in comparison #2060

Merged

jgfouca closed this as completed in #2060 Nov 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need better tests of system tests #1640

Need better tests of system tests #1640

rljacob commented Jun 5, 2017

billsacks commented Jun 6, 2017

rljacob commented Jun 6, 2017

billsacks commented Jun 6, 2017

rljacob commented Jun 8, 2017

billsacks commented Jun 8, 2017 •

edited

Loading

jgfouca commented Nov 13, 2017

billsacks commented Nov 14, 2017

rljacob commented Nov 14, 2017

billsacks commented Nov 14, 2017

jgfouca commented Nov 14, 2017

Need better tests of system tests #1640

Need better tests of system tests #1640

Comments

rljacob commented Jun 5, 2017

billsacks commented Jun 6, 2017

rljacob commented Jun 6, 2017

billsacks commented Jun 6, 2017

rljacob commented Jun 8, 2017

billsacks commented Jun 8, 2017 • edited Loading

jgfouca commented Nov 13, 2017

billsacks commented Nov 14, 2017

rljacob commented Nov 14, 2017

billsacks commented Nov 14, 2017

jgfouca commented Nov 14, 2017

billsacks commented Jun 8, 2017 •

edited

Loading