-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stress] Allow for re-running only failed stress tests #5361
Comments
Basically the script can check something like this:
|
@ckairen , any opinion on this? (or any of it really). It might not matter unless there's some traceability? |
I would imagine users would want to identify and trace the logs and pod failures? maybe revision increment being the same but retry1, retry2...etc? |
Oooh, that would be nice. That way subsequent retries keep whittling down until it's all running. |
Most stress testers are now running > 7ish tests.
If a couple of these tests fail to run (for instance, deployment fails) you end up with a partially completed deployment. Today, you can run the entire job again but this will also destroy and start a brand new set of pods for each stress test.
What would be nice is there was something simple that I can run that will relaunch the pods that have failed in a way that doesn't force all the jobs to get redeployed so, as I fix things, I can get my stress tests to 100%. This is particularly valuable if a pod fails when the others have been running for a couple of hours or days, where I don't want to lose my progress.
@benbp and I discussed a few ideas around this. The one that we want to go with is to just craft a separate helm deployment for the "failed" run, but still run it in the same namespace. When we create the new helm deployment we should filter down to only the tests that have failed (either in init, or in the pod itself). This is similar to what a lot of test runners offer where you can "rerun only failed tests".
The ability to change the release name could be useful for other things as well but for now "rerun with failed" is the first use case.
The text was updated successfully, but these errors were encountered: