Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To deal with system issues, automatically retry tests #1865

Closed
billsacks opened this issue Sep 3, 2017 · 4 comments
Closed

To deal with system issues, automatically retry tests #1865

billsacks opened this issue Sep 3, 2017 · 4 comments

Comments

@billsacks
Copy link
Member

This is a low-priority enhancement request; I just want to record some thoughts on it for the future.

When running a test suite, we often run into problems where a small number of tests failed due to system issues. It slows down the testing/tagging process to have to go back and manually rerun these tests. It would be nice if cime's testing facilities had an option to automatically retry tests N times before giving up.

@ekluzek raised this at a CSEG meeting June 12, 2017. @jedwards4b felt this could cause too much confusion. But I was just thinking: A way around the confusion problem would be for the test system to create a totally new version of the test, with a new test id (say, by appending "try2" to the testid) (I realize that may not be trivial to implement, though). @mnlevy1981 suggests having a flag to --create-test (like --num-tries – which is my name, not his). This would default to 1, but you could set it to something like 3 if you want

@gold2718
Copy link

gold2718 commented Sep 3, 2017

It seems to me that this would only be useful (in speeding the process) if you always used an N > 1. My experience is that these failures happen a minority of the time. I would like a button to push when I look at failures and suspect a system problem. How about something like the --use-existing switch but called --retry which would create new version(s) of the failed test(s).

@jedwards4b
Copy link
Contributor

A switch to create_test that would refer to an existing test set and only run new tests that had a failed status in the existing test suite seems like a worthy option.

@ekluzek
Copy link
Contributor

ekluzek commented Sep 3, 2017

@gold2718 our experience with using the aux_clm test list is that these failures happen all the time. I have to rerun tests either on cheyenne or yellowstone (see #1793) whenever I run the test list.

@billsacks
Copy link
Member Author

It seems to me that this would only be useful (in speeding the process) if you always used an N > 1

Personally, I would always do this. I was just recording the earlier conversation, where it sounded like some people wouldn't want it.

I would like a button to push when I look at failures and suspect a system problem. How about something like the --use-existing switch but called --retry which would create new version(s) of the failed test(s).

This would be a good incremental step towards what I want. Long-term, I still think it would be good to have an option to do this automatically. The problem is this: I often kick off CLM test suites overnight with the intention of making a CLM tag first thing in the morning. (The test suites currently take about 4-6 hours to turn around.) When there are system issues like this, it has these consequences:

  1. I spend more of my time checking on test results and running tests. While this doesn't take too long, it creates more context switches in my day, which causes productivity to suffer.

  2. It lengthens the time from start of testing until when a CLM tag can be made. This is especially problematic when there are a few CLM tags queued up.

That said, I'd be happy with your solution as an incremental step, since it would presumably be easier to implement, and would be a good stepping stone to this longer-term solution.

@jgfouca jgfouca self-assigned this Sep 5, 2017
jgfouca pushed a commit that referenced this issue Nov 7, 2017
Machine file change for Livermore Computing compatibility with new CIME

[BFB]
jedwards4b added a commit that referenced this issue Nov 9, 2017
Add retry capability to create_test
Plus new regression test to exercise this capability.

Test suite: scripts_regression_tests --fast
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1865

User interface changes?: Yes, new --retry option to create_test

Update gh-pages html (Y/N)?: N

Code review: @jedwards4b @billsacks
jgfouca pushed a commit that referenced this issue Feb 23, 2018
Machine file change for Livermore Computing compatibility with new CIME

[BFB]
jgfouca pushed a commit that referenced this issue Mar 13, 2018
Machine file change for Livermore Computing compatibility with new CIME

[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants