Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator cancels rollout of ds/deployment #1300

Closed
andrewdinunzio opened this issue Dec 5, 2022 · 13 comments
Closed

Operator cancels rollout of ds/deployment #1300

andrewdinunzio opened this issue Dec 5, 2022 · 13 comments
Labels
area:collector Issues for deploying collector enhancement New feature or request

Comments

@andrewdinunzio
Copy link
Contributor

I can't say for sure this is what's happening, but it seems like it to me. kubectl rollout restart restarts the workload by adding/modifying an annotation to the pod template spec in order to trigger a rollout. It seems like when it does this, the controller is immediately reconciling and removing the new annotation. I assume this is because it's replacing the annotations field instead of doing some kind of merge patch / SSA.

This causes the OpenTelemetry Collector DaemonSet to restart maybe one pod, but the rollout stops after that.

@pavolloffay pavolloffay added area:collector Issues for deploying collector help wanted Extra attention is needed labels Dec 7, 2022
@pavolloffay
Copy link
Member

If that is the case, shouldn't the kubectl rollout restart be called on the OTELcol CRD?

@andrewdinunzio
Copy link
Contributor Author

I don't think that's possible. Just tried and got this error:

$ kubectl rollout restart otelcol otel
error: no kind "OpenTelemetryCollector" is registered for version "opentelemetry.io/v1alpha1" in scheme "pkg/scheme/scheme.go:28"

I think the things that are "rollout"-able are limited to Deployments, DaemonSets, and StatefulSets. Not sure if there's a way to configure that.

@pavolloffay
Copy link
Member

Anything that changes the collector pods spec in deployment will cause a restarts - e.g. probably adding a new annotation

PodAnnotations map[string]string `json:"podAnnotations,omitempty"`

@andrewdinunzio
Copy link
Contributor Author

Yes, doing a rollout on the deployment should do that, but the controller immediately replaces the annotations back to the way it was, which seemingly cancels the rollout

@pavolloffay
Copy link
Member

I meant adding the annotation on the collector CR, the operator will update the deployment and controller will restart the pod.

@andrewdinunzio
Copy link
Contributor Author

Oh I see. That's a good workaround.

@verejoel
Copy link

I think this needs to be tackled properly, not worked around. This is problematic, for example, when spot instances are taken down. We have seen pods on non-existent nodes, and rollouts stuck because of this.

How would a fix look? Naively I'd say we just exclude the rollout annotation from reconciliation. Would be happy to work on it.

@jaronoff97
Copy link
Contributor

@verejoel that would be great. Thank you!

@jaronoff97 jaronoff97 added area:controller enhancement New feature or request and removed help wanted Extra attention is needed labels Nov 28, 2023
@jaronoff97 jaronoff97 added this to the v1alpha2 CRD release milestone Nov 28, 2023
@jaronoff97
Copy link
Contributor

@verejoel any chance you'd be able to take a look at this for the current week? If not, i'm happy to take a look.

@jaronoff97
Copy link
Contributor

@verejoel im going to reassign this one, as its something i just ran into. Let me know if you've already begun. If not @Toaddyan take a look

@verejoel
Copy link

verejoel commented Jan 12, 2024 via email

@Toaddyan
Copy link
Contributor

Toaddyan commented Jan 17, 2024

So I was working on trying to re-produce the issue:

➜  otel-collector-charts git:(main) ✗ k get pods
NAME                                                       READY   STATUS    RESTARTS      AGE
kube-otel-stack-kube-state-metrics-66b58457bf-7bq9l        1/1     Running   1 (11m ago)   24m
kube-otel-stack-metrics-collector-0                        1/1     Running   1 (11m ago)   22m
kube-otel-stack-metrics-collector-4vfrt                    1/1     Running   0             2m22s
kube-otel-stack-metrics-collector-zh75g                    1/1     Running   0             2m22s
kube-otel-stack-metrics-targetallocator-6fb775b699-bhdbp   1/1     Running   1 (11m ago)   24m
kube-otel-stack-prometheus-node-exporter-4v8rh             1/1     Running   0             8m54s
kube-otel-stack-prometheus-node-exporter-j7lpm             1/1     Running   1 (11m ago)   24m
opentelemetry-operator-88dc494dc-6f5kp                     2/2     Running   2 (11m ago)   29m
➜  otel-collector-charts git:(main) ✗ k rollout restart daemonset kube-otel-stack-metrics-collector
daemonset.apps/kube-otel-stack-metrics-collector restarted
➜  otel-collector-charts git:(main) ✗ k get pods 
NAME                                                       READY   STATUS    RESTARTS      AGE
kube-otel-stack-kube-state-metrics-66b58457bf-7bq9l        1/1     Running   1 (11m ago)   24m
kube-otel-stack-metrics-collector-0                        1/1     Running   1 (11m ago)   22m
kube-otel-stack-metrics-collector-hvs7b                    1/1     Running   0             3s
kube-otel-stack-metrics-collector-mdlm2                    1/1     Running   0             3s
kube-otel-stack-metrics-targetallocator-6fb775b699-bhdbp   1/1     Running   1 (11m ago)   24m
kube-otel-stack-prometheus-node-exporter-4v8rh             1/1     Running   0             9m8s
kube-otel-stack-prometheus-node-exporter-j7lpm             1/1     Running   1 (11m ago)   24m
opentelemetry-operator-88dc494dc-6f5kp                     2/2     Running   2 (11m ago)   30m

From the above conversation, I'm interpreting that upon a rollout restart of a type of resource (daemonset), I should see the problem of a subset of pods restarting, where I SHOULD be seeing ALL of the pods restarting.

In my recent run down of this, I'm not seeing the OP's stated behavior of restarts.

am I missing anything particular here?

@jaronoff97
Copy link
Contributor

jaronoff97 commented Jan 18, 2024

Similar to @Toaddyan, I tried to repro this and was unable to. I think this may have been resolved in #1995 inadvertently. I'm going to close this for now. Please let me know if I should re-open this! Thanks Todd for looking in to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:collector Issues for deploying collector enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants