Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

Open
anna-yn opened this issue Oct 11, 2022 · 5 comments · Fixed by #3309 or #3388
Open

[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

anna-yn opened this issue Oct 11, 2022 · 5 comments · Fixed by #3309 or #3388
Labels

Comments

@anna-yn
Copy link

anna-yn commented Oct 11, 2022

Feature Request

Background / Motivation

We're using Garden for CI and inner loop development, so we utilize the garden test and the garden deploy commands pretty heavily. There are a few reasons that these commands might fail:

  • The tests we wanted to run failed
  • The pods took too long to deploy
  • The Garden namespace couldn't be created
  • Helm chart we use couldn't be pulled
  • Some other Kubernetes errors that are not related to our tests - image pull error, context deadline exceeded etc

We would like to be able to automatically retry some of these failures in CI. For example, if it's something unrelated to the test run - the helm chart couldn't be pulled or if the Garden namespace couldn't be created, we'd like to retry twice because we noticed that those are usually flakes.

An easy way to achieve this would be to set retry rules based on the exit code of the Garden commands. Right now the commands would produce an exit code 1 no matter what the failure is, so all those failure modes get lumped together. It would be a huge help for us if say K8s related errors exit at 103, timeouts exit at 104 and user errors (say tests failed) exit at 1 or something like that.

What should the user be able to do?

The user should be able to know the type of error based on the exit code of the Garden command

Why do they want to do this? What problem does it solve?

We'd like to set different auto retry rules for different failure modes. If the Garden command produced different exit codes, auto retry would be very easy to set up for us. We don't want to just retry everything 3 times because the tests take like 20min to run and we don't want flaky tests to make it into the main branch, but if it's a k8s flake then we'd like to retry away and not have our engineers see those errors if possible.

Suggested Implementation(s)

When emitting an error from the Garden command, check what it is (whether it was the command it ran that failed or some k8s failure) and produce a different exit code accordingly. To start we'd be very happy if the commands just produced 2 different errors - command failure vs everything else. More granularity would help but we could get started with just 2.

How important is this feature for you/your team?

🌵 Not having this feature makes using Garden painful

@vvagaytsev vvagaytsev added devex Developer experience and ease of use. triage/accepted labels Oct 12, 2022
@vvagaytsev
Copy link
Collaborator

Thanks for reporting this @anna-yn! We'll take a look soon.

@vvagaytsev
Copy link
Collaborator

This has been partially fixed in #3309. OOMs are now reported properly when running tests with artifacts.

Also, non-zero exit codes are logged in the service sections. If an exit code is not available, Garden still returns the default one which is 1.

@vvagaytsev
Copy link
Collaborator

Partially fixed in #3388. Reopening again, because it's still necessary to test and verify the error cases from the list above.

@anna-yn could you check and verify the cases from the list in the description, please?

  • The tests we wanted to run failed

This should be fixed now

  • The pods took too long to deploy

This fails with Garden's internal DeploymentError and there is no exit code available. The console error message looked informative. I got a message like this when I tried too long deployment of demo-project/backend with artificial delay in the backend code:

Failed deploying service 'backend' (from module 'backend'). Here is the output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Timed out waiting for backend to deploy after 300 seconds
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1 deploy action(s) failed!

Please let us know if there are any missing details.

It would be very helpful to share the example error messages and the expected output for the cases below:

  • The Garden namespace couldn't be created
  • Helm chart we use couldn't be pulled
  • Some other Kubernetes errors that are not related to our tests - image pull error, context deadline exceeded etc

@vvagaytsev vvagaytsev reopened this Dec 7, 2022
@stale
Copy link

stale bot commented May 21, 2023

This issue has been automatically marked as stale because it hasn't had any activity in 90 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!

@stale stale bot added the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label May 21, 2023
@Orzelius Orzelius removed the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label May 25, 2023
@vvagaytsev vvagaytsev removed their assignment Jun 2, 2023
@christopherjameshoward
Copy link

We have a similar use case.

We are using garden run test to execute test jobs in a (gitlab) pipeline.

If a job fails because of a genuine test failure, then we just want to fail the pipeline as usual.

However, if a job fails because of a garden deployment error (which could be caused by any sort of k8s problem), then we would like to retry the pipeline job, because there is a good chance that it will pass on retry.

If garden run test returned different error codes depending on the failure reason, then this would be a really easy thing for us to implement in our pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants