[stress] Allow for re-running only failed stress tests #5361

richardpark-msft · 2023-02-07T22:48:42Z

Most stress testers are now running > 7ish tests.

If a couple of these tests fail to run (for instance, deployment fails) you end up with a partially completed deployment. Today, you can run the entire job again but this will also destroy and start a brand new set of pods for each stress test.

What would be nice is there was something simple that I can run that will relaunch the pods that have failed in a way that doesn't force all the jobs to get redeployed so, as I fix things, I can get my stress tests to 100%. This is particularly valuable if a pod fails when the others have been running for a couple of hours or days, where I don't want to lose my progress.

@benbp and I discussed a few ideas around this. The one that we want to go with is to just craft a separate helm deployment for the "failed" run, but still run it in the same namespace. When we create the new helm deployment we should filter down to only the tests that have failed (either in init, or in the pod itself). This is similar to what a lot of test runners offer where you can "rerun only failed tests".

The ability to change the release name could be useful for other things as well but for now "rerun with failed" is the first use case.

benbp · 2023-02-07T22:53:41Z

Basically the script can check something like this:

Is the RerunFailedJobs flag set
What the helm release name will be (sans revision)
Which jobs are failed in the current release
Generate a matrix DisplayNameFilter for those jobs (job1|job2|job3)
Rev the release name to -retry or something and keep the revision increment the same?
Re-deploy with updated command

richardpark-msft · 2023-02-07T22:57:13Z

Rev the release name to -retry or something and keep the revision increment the same?

@ckairen , any opinion on this? (or any of it really). It might not matter unless there's some traceability?

ckairen · 2023-02-09T23:04:44Z

I would imagine users would want to identify and trace the logs and pod failures? maybe revision increment being the same but retry1, retry2...etc?

richardpark-msft · 2023-02-09T23:07:57Z

Oooh, that would be nice. That way subsequent retries keep whittling down until it's all running.

closes #5361

ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 7, 2023

benbp assigned ckairen Feb 7, 2023

benbp added Central-EngSys This issue is owned by the Engineering System team. Stress This issue is related to stress testing, part of our reliability pillar. and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Feb 7, 2023

benbp added this to Azure SDK EngSys 🚢🎉 Feb 7, 2023

github-project-automation bot moved this to 🤔Triage in Azure SDK EngSys 🚢🎉 Feb 7, 2023

benbp moved this from 🤔Triage to 📋Backlog in Azure SDK EngSys 🚢🎉 Feb 7, 2023

ckairen moved this from 📋Backlog to 🐝 Dev in Azure SDK EngSys 🚢🎉 Feb 23, 2023

ckairen mentioned this issue Mar 16, 2023

[stress] stress test rerun failed jobs feature #5726

Merged

ghost closed this as completed in #5726 Mar 28, 2023

ghost pushed a commit that referenced this issue Mar 28, 2023

[stress] stress test rerun failed jobs feature (#5726)

f6e7ccd

closes #5361

github-project-automation bot moved this from 🐝 Dev to 🎊Closed in Azure SDK EngSys 🚢🎉 Mar 28, 2023

kurtzeborn removed this from Azure SDK EngSys 🚢🎉 May 22, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stress] Allow for re-running only failed stress tests #5361

[stress] Allow for re-running only failed stress tests #5361

richardpark-msft commented Feb 7, 2023

benbp commented Feb 7, 2023

richardpark-msft commented Feb 7, 2023

ckairen commented Feb 9, 2023

richardpark-msft commented Feb 9, 2023

[stress] Allow for re-running only failed stress tests #5361

[stress] Allow for re-running only failed stress tests #5361

Comments

richardpark-msft commented Feb 7, 2023

benbp commented Feb 7, 2023

richardpark-msft commented Feb 7, 2023

ckairen commented Feb 9, 2023

richardpark-msft commented Feb 9, 2023