To deal with system issues, automatically retry tests #1865

billsacks · 2017-09-03T12:17:45Z

This is a low-priority enhancement request; I just want to record some thoughts on it for the future.

When running a test suite, we often run into problems where a small number of tests failed due to system issues. It slows down the testing/tagging process to have to go back and manually rerun these tests. It would be nice if cime's testing facilities had an option to automatically retry tests N times before giving up.

@ekluzek raised this at a CSEG meeting June 12, 2017. @jedwards4b felt this could cause too much confusion. But I was just thinking: A way around the confusion problem would be for the test system to create a totally new version of the test, with a new test id (say, by appending "try2" to the testid) (I realize that may not be trivial to implement, though). @mnlevy1981 suggests having a flag to --create-test (like --num-tries – which is my name, not his). This would default to 1, but you could set it to something like 3 if you want

gold2718 · 2017-09-03T16:20:58Z

It seems to me that this would only be useful (in speeding the process) if you always used an N > 1. My experience is that these failures happen a minority of the time. I would like a button to push when I look at failures and suspect a system problem. How about something like the --use-existing switch but called --retry which would create new version(s) of the failed test(s).

jedwards4b · 2017-09-03T16:40:14Z

A switch to create_test that would refer to an existing test set and only run new tests that had a failed status in the existing test suite seems like a worthy option.

ekluzek · 2017-09-03T19:27:27Z

@gold2718 our experience with using the aux_clm test list is that these failures happen all the time. I have to rerun tests either on cheyenne or yellowstone (see #1793) whenever I run the test list.

billsacks · 2017-09-03T19:40:51Z

It seems to me that this would only be useful (in speeding the process) if you always used an N > 1

Personally, I would always do this. I was just recording the earlier conversation, where it sounded like some people wouldn't want it.

I would like a button to push when I look at failures and suspect a system problem. How about something like the --use-existing switch but called --retry which would create new version(s) of the failed test(s).

This would be a good incremental step towards what I want. Long-term, I still think it would be good to have an option to do this automatically. The problem is this: I often kick off CLM test suites overnight with the intention of making a CLM tag first thing in the morning. (The test suites currently take about 4-6 hours to turn around.) When there are system issues like this, it has these consequences:

I spend more of my time checking on test results and running tests. While this doesn't take too long, it creates more context switches in my day, which causes productivity to suffer.
It lengthens the time from start of testing until when a CLM tag can be made. This is especially problematic when there are a few CLM tags queued up.

That said, I'd be happy with your solution as an incremental step, since it would presumably be easier to implement, and would be a good stepping stone to this longer-term solution.

Machine file change for Livermore Computing compatibility with new CIME [BFB]

@jedwards4b

Add retry capability to create_test Plus new regression test to exercise this capability. Test suite: scripts_regression_tests --fast Test baseline: Test namelist changes: Test status: bit for bit Fixes #1865 User interface changes?: Yes, new --retry option to create_test Update gh-pages html (Y/N)?: N Code review: @jedwards4b @billsacks

Machine file change for Livermore Computing compatibility with new CIME [BFB]

billsacks added Low Priority tp: system tests ty: enhancement labels Sep 3, 2017

jgfouca self-assigned this Sep 5, 2017

jgfouca added the Assigned label Sep 8, 2017

jgfouca pushed a commit that referenced this issue Nov 7, 2017

Merge branch ACME-Climate/aarondonahue/cime/LC_20171024 (PR #1865)

7a4c551

Machine file change for Livermore Computing compatibility with new CIME [BFB]

jgfouca mentioned this issue Nov 7, 2017

Add retry capability to create_test #2034

Merged

jgfouca added in progress and removed Assigned labels Nov 7, 2017

jedwards4b closed this as completed in #2034 Nov 9, 2017

jedwards4b removed the in progress label Nov 9, 2017

jgfouca pushed a commit that referenced this issue Feb 23, 2018

Merge branch ACME-Climate/aarondonahue/cime/LC_20171024 (PR #1865)

2fb5433

Machine file change for Livermore Computing compatibility with new CIME [BFB]

jgfouca pushed a commit that referenced this issue Mar 13, 2018

Merge branch ACME-Climate/aarondonahue/cime/LC_20171024 (PR #1865)

7b49896

Machine file change for Livermore Computing compatibility with new CIME [BFB]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To deal with system issues, automatically retry tests #1865

To deal with system issues, automatically retry tests #1865

billsacks commented Sep 3, 2017

gold2718 commented Sep 3, 2017

jedwards4b commented Sep 3, 2017

ekluzek commented Sep 3, 2017

billsacks commented Sep 3, 2017

To deal with system issues, automatically retry tests #1865

To deal with system issues, automatically retry tests #1865

Comments

billsacks commented Sep 3, 2017

gold2718 commented Sep 3, 2017

jedwards4b commented Sep 3, 2017

ekluzek commented Sep 3, 2017

billsacks commented Sep 3, 2017