[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

anna-yn · 2022-10-11T17:27:10Z

Feature Request

Background / Motivation

We're using Garden for CI and inner loop development, so we utilize the garden test and the garden deploy commands pretty heavily. There are a few reasons that these commands might fail:

The tests we wanted to run failed
The pods took too long to deploy
The Garden namespace couldn't be created
Helm chart we use couldn't be pulled
Some other Kubernetes errors that are not related to our tests - image pull error, context deadline exceeded etc

We would like to be able to automatically retry some of these failures in CI. For example, if it's something unrelated to the test run - the helm chart couldn't be pulled or if the Garden namespace couldn't be created, we'd like to retry twice because we noticed that those are usually flakes.

An easy way to achieve this would be to set retry rules based on the exit code of the Garden commands. Right now the commands would produce an exit code 1 no matter what the failure is, so all those failure modes get lumped together. It would be a huge help for us if say K8s related errors exit at 103, timeouts exit at 104 and user errors (say tests failed) exit at 1 or something like that.

What should the user be able to do?

The user should be able to know the type of error based on the exit code of the Garden command

Why do they want to do this? What problem does it solve?

We'd like to set different auto retry rules for different failure modes. If the Garden command produced different exit codes, auto retry would be very easy to set up for us. We don't want to just retry everything 3 times because the tests take like 20min to run and we don't want flaky tests to make it into the main branch, but if it's a k8s flake then we'd like to retry away and not have our engineers see those errors if possible.

Suggested Implementation(s)

When emitting an error from the Garden command, check what it is (whether it was the command it ran that failed or some k8s failure) and produce a different exit code accordingly. To start we'd be very happy if the commands just produced 2 different errors - command failure vs everything else. More granularity would help but we could get started with just 2.

How important is this feature for you/your team?

🌵 Not having this feature makes using Garden painful

The text was updated successfully, but these errors were encountered:

vvagaytsev · 2022-10-12T16:45:39Z

Thanks for reporting this @anna-yn! We'll take a look soon.

vvagaytsev · 2022-10-20T08:53:54Z

This has been partially fixed in #3309. OOMs are now reported properly when running tests with artifacts.

Also, non-zero exit codes are logged in the service sections. If an exit code is not available, Garden still returns the default one which is 1.

vvagaytsev · 2022-12-07T11:22:10Z

Partially fixed in #3388. Reopening again, because it's still necessary to test and verify the error cases from the list above.

@anna-yn could you check and verify the cases from the list in the description, please?

The tests we wanted to run failed

This should be fixed now

The pods took too long to deploy

This fails with Garden's internal DeploymentError and there is no exit code available. The console error message looked informative. I got a message like this when I tried too long deployment of demo-project/backend with artificial delay in the backend code:

Failed deploying service 'backend' (from module 'backend'). Here is the output:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Timed out waiting for backend to deploy after 300 seconds
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1 deploy action(s) failed!

Please let us know if there are any missing details.

It would be very helpful to share the example error messages and the expected output for the cases below:

The Garden namespace couldn't be created

Helm chart we use couldn't be pulled

Some other Kubernetes errors that are not related to our tests - image pull error, context deadline exceeded etc

stale · 2023-05-21T21:50:11Z

This issue has been automatically marked as stale because it hasn't had any activity in 90 days. It will be closed in 14 days if no further activity occurs (e.g. changing labels, comments, commits, etc.). Please feel free to tag a maintainer and ask them to remove the label if you think it doesn't apply. Thank you for submitting this issue and helping make Garden a better product!

christopherjameshoward · 2023-07-06T09:57:45Z

We have a similar use case.

We are using garden run test to execute test jobs in a (gitlab) pipeline.

If a job fails because of a genuine test failure, then we just want to fail the pipeline as usual.

However, if a job fails because of a garden deployment error (which could be caused by any sort of k8s problem), then we would like to retry the pipeline job, because there is a good chance that it will pass on retry.

If garden run test returned different error codes depending on the failure reason, then this would be a really easy thing for us to implement in our pipelines.

anna-yn added the feature request label Oct 11, 2022

vvagaytsev added devex Developer experience and ease of use. triage/accepted labels Oct 12, 2022

vvagaytsev self-assigned this Oct 12, 2022

vvagaytsev mentioned this issue Oct 19, 2022

improvement(logs): run and test commands error handling #3309

Merged

vvagaytsev closed this as completed in #3309 Oct 20, 2022

vvagaytsev reopened this Oct 20, 2022

vvagaytsev mentioned this issue Nov 10, 2022

[FEATURE]: Improve error messages when Kubernetes pod fails #3330

Closed

vvagaytsev mentioned this issue Dec 6, 2022

improvement(k8s): better error handling and logging in PodRunner #3388

Merged

vvagaytsev closed this as completed in #3388 Dec 7, 2022

vvagaytsev reopened this Dec 7, 2022

stale bot added the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label May 21, 2023

Orzelius removed the stale Label that's automatically set by stalebot. Stale issues get closed after 14 days of inactivity. label May 25, 2023

vvagaytsev removed their assignment Jun 2, 2023

vvagaytsev added the reexamine-post-0.13 label Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

anna-yn commented Oct 11, 2022 •

edited by sync-by-unito bot

Loading

vvagaytsev commented Oct 12, 2022

vvagaytsev commented Oct 20, 2022

vvagaytsev commented Dec 7, 2022

stale bot commented May 21, 2023

christopherjameshoward commented Jul 6, 2023

[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

[FEATURE]: Different exit codes for different failure modes in Garden commands #3297

Comments

anna-yn commented Oct 11, 2022 • edited by sync-by-unito bot Loading

Feature Request

Background / Motivation

What should the user be able to do?

Why do they want to do this? What problem does it solve?

Suggested Implementation(s)

How important is this feature for you/your team?

vvagaytsev commented Oct 12, 2022

vvagaytsev commented Oct 20, 2022

vvagaytsev commented Dec 7, 2022

stale bot commented May 21, 2023

christopherjameshoward commented Jul 6, 2023

anna-yn commented Oct 11, 2022 •

edited by sync-by-unito bot

Loading