Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stress] Allow for re-running only failed stress tests #5361

Closed
richardpark-msft opened this issue Feb 7, 2023 · 4 comments · Fixed by #5726
Closed

[stress] Allow for re-running only failed stress tests #5361

richardpark-msft opened this issue Feb 7, 2023 · 4 comments · Fixed by #5726
Assignees
Labels
Central-EngSys This issue is owned by the Engineering System team. Stress This issue is related to stress testing, part of our reliability pillar.

Comments

@richardpark-msft
Copy link
Member

Most stress testers are now running > 7ish tests.

If a couple of these tests fail to run (for instance, deployment fails) you end up with a partially completed deployment. Today, you can run the entire job again but this will also destroy and start a brand new set of pods for each stress test.

What would be nice is there was something simple that I can run that will relaunch the pods that have failed in a way that doesn't force all the jobs to get redeployed so, as I fix things, I can get my stress tests to 100%. This is particularly valuable if a pod fails when the others have been running for a couple of hours or days, where I don't want to lose my progress.

@benbp and I discussed a few ideas around this. The one that we want to go with is to just craft a separate helm deployment for the "failed" run, but still run it in the same namespace. When we create the new helm deployment we should filter down to only the tests that have failed (either in init, or in the pod itself). This is similar to what a lot of test runners offer where you can "rerun only failed tests".

The ability to change the release name could be useful for other things as well but for now "rerun with failed" is the first use case.

@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Feb 7, 2023
@benbp
Copy link
Member

benbp commented Feb 7, 2023

Basically the script can check something like this:

  • Is the RerunFailedJobs flag set
  • What the helm release name will be (sans revision)
  • Which jobs are failed in the current release
  • Generate a matrix DisplayNameFilter for those jobs (job1|job2|job3)
  • Rev the release name to -retry or something and keep the revision increment the same?
  • Re-deploy with updated command

@benbp benbp added Central-EngSys This issue is owned by the Engineering System team. Stress This issue is related to stress testing, part of our reliability pillar. and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Feb 7, 2023
@benbp benbp moved this from 🤔Triage to 📋Backlog in Azure SDK EngSys 🚢🎉 Feb 7, 2023
@richardpark-msft
Copy link
Member Author

Rev the release name to -retry or something and keep the revision increment the same?

@ckairen , any opinion on this? (or any of it really). It might not matter unless there's some traceability?

@ckairen
Copy link
Member

ckairen commented Feb 9, 2023

I would imagine users would want to identify and trace the logs and pod failures? maybe revision increment being the same but retry1, retry2...etc?

@richardpark-msft
Copy link
Member Author

Oooh, that would be nice. That way subsequent retries keep whittling down until it's all running.

@ckairen ckairen moved this from 📋Backlog to 🐝 Dev in Azure SDK EngSys 🚢🎉 Feb 23, 2023
@ghost ghost closed this as completed in #5726 Mar 28, 2023
ghost pushed a commit that referenced this issue Mar 28, 2023
@github-project-automation github-project-automation bot moved this from 🐝 Dev to 🎊Closed in Azure SDK EngSys 🚢🎉 Mar 28, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Central-EngSys This issue is owned by the Engineering System team. Stress This issue is related to stress testing, part of our reliability pillar.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants