Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Enhancement of Helm deployment status in Garden CLI #6563

Closed
bxsx opened this issue Oct 21, 2024 · 8 comments · Fixed by #6626
Closed

[FEATURE]: Enhancement of Helm deployment status in Garden CLI #6563

bxsx opened this issue Oct 21, 2024 · 8 comments · Fixed by #6626
Assignees

Comments

@bxsx
Copy link
Contributor

bxsx commented Oct 21, 2024

Feature Request

Background / Motivation

Misconfigurations or missing Kubernetes resources (such as secrets) can cause Garden to timeout with unclear error messages like context deadline exceeded. These messages do not pinpoint the actual issue or provide useful guidance for resolution.

What should the user be able to do?

The user should see the specific issue that caused the timeout, such as Error: secret "my-service-secret" not found.

Why do they want to do this? What problem does it solve?

This feature removes the need to retrieve useful information from the Kubernetes API using external tools (such as kubectl, which Garden already uses). It consolidates the relevant information into a single source of truth (the Garden CLI output), which should reduce the time required to resolve issues. Additionally, some clusters, such as CI clusters, may not be accessible to end-users. Without this information available in the CLI output, users may be blocked from taking further action.

Suggested Implementation(s)

The missing information can be retrieved from K8s events, such as:

$ kubectl get events -n "$NAMESPACE" --field-selector involvedObject.kind=Pod
LAST SEEN   TYPE      REASON      OBJECT                                MESSAGE
...
7s          Warning   Failed      pod/my-service-6c12345666-8htrd   Error: secret "my-service-secret" not found
...

# Show failures for the given service
$ kubectl get events -n "$NAMESPACE" --field-selector involvedObject.kind=Pod | grep Failed | grep -E "$(kubectl get pods -n "$NAMESPACE" -l "app.kubernetes.io/name=$SERVICE_NAME" -o jsonpath='{.items[*].metadata.name}')"
14s         Warning   Failed      pod/my-service-6c12345666-8htrd   Error: secret "my-service-secret" not found

How important is this feature for you/your team?

🌵 Not having this feature makes using Garden painful

@eysi09 eysi09 self-assigned this Oct 22, 2024
@eysi09
Copy link
Collaborator

eysi09 commented Oct 22, 2024

Thanks for the feature request!

We are in fact already working on improvements on this front and would love your feedback and suggestions once the first iteration is ready.

EDIT: Am I right in assuming you're having these issues with Helm charts?

In general we print the events and logs for the kubernetes Deploy action type, so just trying to pinpoint the issue.

For the kubernetes Deploy action the output looks something like:

✖ deploy.api                → Failed (took 7.5 sec)
✖ deploy.api                → Failed processing Deploy type=kubernetes name=api (took 10.46 sec). This is what happened:

────────────────────────────────────────────────────────
Error deploying deploy.api: BackOff - Back-off restarting failed container init-b in pod api-6d6dc558c-zrzdb_vote-demo-eysi(cca2e3ce-9c9a-440f-9824-c4ab9e921316)

━━━ Latest events from Deployment api ━━━
Deployment api: ScalingReplicaSet - Scaled up replica set api-6d6dc558c to 1
Pod api-6d6dc558c-zrzdb: Scheduled - Successfully assigned vote-demo-eysi/api-6d6dc558c-zrzdb to gke-demo-3-default-node-pool-0ca0f1f8-i5ev
Pod api-6d6dc558c-zrzdb: Pulled - Container image "busybox:1.31.1" already present on machine
Pod api-6d6dc558c-zrzdb: Created - Created container init-a
Pod api-6d6dc558c-zrzdb: Started - Started container init-a
Pod api-6d6dc558c-zrzdb: Pulled - Container image "busybox:1.31.1" already present on machine
Pod api-6d6dc558c-zrzdb: Created - Created container init-b
Pod api-6d6dc558c-zrzdb: Started - Started container init-b
Pod api-6d6dc558c-zrzdb: BackOff - Back-off restarting failed container init-b in pod api-6d6dc558c-zrzdb_vote-demo-eysi(cca2e3ce-9c9a-440f-9824-c4ab9e921316)


━━━ Latest logs from failed containers in each Pod in Deployment api ━━━
api-6d6dc558c-zrzdb/init-b: bin/sh: eho: not found

💡 Garden hint: For complete Pod logs for this Deployment, run the following command:
kubectl -n vote-demo-eysi --context=gke_garden-demo-324810_europe-west3_demo-3 logs deployment/api --all-containers
────────────────────────────────────────────────────────
1 requested deploy action(s) failed!

@bxsx
Copy link
Contributor Author

bxsx commented Oct 23, 2024

Hi @eysi09 ! Thanks for your quick reply!

Sorry for the confusion. I'll try to clarify everything..

Am I right in assuming you're having these issues with Helm charts?

I have this issue with Garden helm module when running different Garden commands, like garden deploy or garden test (if the test is depended on the helm module).

Some outputs for garden test:

ℹ test.my-service-image-integration → Testing my-service-image-integration (type container) at version v-fa17161ccc...
ℹ test.my-service-image-integration → [kubernetes] Waiting for resources to be ready...
✖ test.my-service-image-integration → Failed (took 626 sec)
✖ test.my-service-image-integration → Failed processing Test type=container name=my-service-image-integration (from module my-service-image) (took 655.72 sec). This is what happened:

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Timed out waiting for resources to deploy after 600 seconds
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 requested test action(s) failed!
12:39:22 ℹ deploy.my-service     → Deploying my-service (type helm) at version v-faaaa61ccc...
13:09:28 ✖ deploy.my-service     → Failed (took 1807 sec)
13:09:28 ℹ test.my-service-image-integration → Aborting because upstream dependency failed.
13:09:29 ✖ deploy.my-service     → Failed processing Deploy type=helm name=my-service (from module my-service) (took 1816.44 sec). This is what happened:

────────────────────────────────────────────────────────────────────────────────────────────────────
Command "/home/runner/.garden/tools/helm/860b3ebd51d875d3/linux-amd64/helm --kube-context my-context --namespace my-service-namespace install my-service /home/runner/work/my-service/my-service/.garden/build/my-service --wait --namespace my-service-namespace --timeout 1800s --values /tmp/aaaae88d60f64f59b9ebc43b0bc0aaa0" failed with code 1:

Error: INSTALLATION FAILED: context deadline exceeded
────────────────────────────────────────────────────────────────────────────────────────────────────
13:09:29 1 requested test action(s) aborted!

Output from garden deploy:

ℹ deploy.my-service     → Deploying my-service (type helm) at version v-3cb9e606da...
✖ deploy.my-service     → Failed (took 141 sec)
✖ deploy.my-service     → Failed processing Deploy type=helm name=my-service (from module my-service) (took 171.35 sec). This is what happened:

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Command "/Users/bxsx/.garden/tools/helm/c842c811165de5e4/darwin-arm64/helm --kube-context my-context --namespace my-service-namespace install my-service /Users/bxsx/+dev/src/my-service/.garden/build/my-service --wait --namespace my-service-namespace --timeout 120s --values /private/var/folders/7v/sl4jpq994k942951yvsp01pr0000gp/T/fcbbce8f62a9869c574ea53e4e6d997a" failed with code 1:

Error: INSTALLATION FAILED: context deadline exceeded
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 requested deploy action(s) failed!

See .garden/error.log for detailed error message

In general we print the events and logs for the kubernetes Deploy action type, so just trying to pinpoint the issue.

Not sure if it matters, but currently, I use Garden v0.13.42 (also tested on v0.13.39) with apiVersion: 0 and Garden modules.

For the kubernetes Deploy action the output looks something like:

This looks very promising, indeed. How did you manage to get this output? Is it due to log level settings? Unfortunately, I don't think I have ever seen K8s events in the Garden output..

@eysi09
Copy link
Collaborator

eysi09 commented Oct 29, 2024

Thanks for the added context! And sorry for the late reply, had a few days off last week.

This looks very promising, indeed. How did you manage to get this output? Is it due to log level settings? Unfortunately, I don't think I have ever seen K8s events in the Garden output..

You should always get this if you are using a Deploy action of kind kubernetes (and not kind helm). Is that not the case for you? In any case I haven't bumped into issues with it myself or seen any reports about it missing.

But we're currently on adding this output to Deploy actions of kind helm and in general making the output more readable and more helpful. Hoping to have a PR up this week.

@eysi09
Copy link
Collaborator

eysi09 commented Oct 29, 2024

Irrespective of the above, hence the separate comment, but one other thing I noticed is that when using the helm kind we wait for Helm to timeout before exiting with an error. That is basically how Helm works if you use the --wait flag (e.g. helm upgrade --install --wait).

This means that if a container won't start or an image pull secret is missing, it won't fail until it timeouts which by default is 300 seconds.

So I'm also seeing if we can fail fast under those conditions. Unfortunately Helm itself isn't really helpful here but we can use similar heuristics as we do for "normal" K8s manifests although it probably won't work for all cases because a Helm chart can be a lot of things. We might make this opt-in behaviour to begin with since it's technically a change in how we install Helm charts.

@bxsx
Copy link
Contributor Author

bxsx commented Oct 30, 2024

You should always get this if you are using a Deploy action of kind kubernetes (and not kind helm). Is that not the case for you? In any case I haven't bumped into issues with it myself or seen any reports about it missing.

Good catch! I've just tested it with the kind of kubernetes and confirm that I'm getting similar output to what you shared!

But we're currently on adding this output to Deploy actions of kind helm and in general making the output more readable and more helpful. Hoping to have a PR up this week.

Great! I believe this is exactly what I'm looking for. Can't wait to hear any updates on this matter :)

Irrespective of the above, hence the separate comment, but one other thing I noticed is that when using the helm kind we wait for Helm to timeout before exiting with an error. That is basically how Helm works if you use the --wait flag (e.g. helm upgrade --install --wait).
This means that if a container won't start or an image pull secret is missing, it won't fail until it timeouts which by default is 300 seconds.

Correct. This is the current behavior, and we're just missing any output that could help in diagnosing the timeouted action.

So I'm also seeing if we can fail fast under those conditions. Unfortunately Helm itself isn't really helpful here but we can use similar heuristics as we do for "normal" K8s manifests although it probably won't work for all cases because a Helm chart can be a lot of things. We might make this opt-in behavior to begin with since it's technically a change in how we install Helm charts.

I'm not sure if the helm kind should support fail fast. I guess, in some scenarios, this might be considered as a breaking change unless it's configured in the manifest file. My understanding is that the helm kind shall be considered as a wrapper for Helm, without implicit functional changes in the expected behavior. Anyway, the fail-fast feature is nice to have, but the missing outputs from the events are more important (IMHO).

@eysi09 thanks for your detailed replies! I appreciate it!

@bxsx
Copy link
Contributor Author

bxsx commented Oct 30, 2024

I'm renaming the issue title to avoid further confusion :)

@bxsx bxsx changed the title [FEATURE]: Enhancement of K8s deployment status in Garden CLI [FEATURE]: Enhancement of Helm deployment status in Garden CLI Oct 30, 2024
@eysi09
Copy link
Collaborator

eysi09 commented Nov 7, 2024

Just a quick update, we're making good progress and getting this into review.

I'm not sure if the helm kind should support fail fast. I guess, in some scenarios, this might be considered as a breaking change unless it's configured in the manifest file.

On closer inspection, turns out that this is what we used to do until recently, on this current Garden major version.

The reason we changed it was not because of Garden failing fast but rather being "successful fast" and not always waiting for the Helm command to complete after all resources became healthy. We never received any reports about issues related to Garden failing fast, only a certain edge case that came up in the "successful" case and we were a bit overzealous in addressing it by applying --wait to everything.

With the changes I'm making I think we strike a good balance:

  • We always use the --wait flag but also monitor resources at the same time.
  • If the resources are healthy, we always wait for the Helm command to complete.
  • If the resources are unhealthy, we break out early and fail fast. This is what we used to do and we haven't seen issues with it.
  • We introduce a config flag to override the fail fast behaviour, but again, we don't expect this to be really needed.
  • We always wait for the Helm command to complete if the atomic options is set.

@bxsx
Copy link
Contributor Author

bxsx commented Nov 12, 2024

Amazing! Looking forward to see this in action :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants