Operator cancels rollout of ds/deployment #1300

andrewdinunzio · 2022-12-05T16:02:18Z

I can't say for sure this is what's happening, but it seems like it to me. kubectl rollout restart restarts the workload by adding/modifying an annotation to the pod template spec in order to trigger a rollout. It seems like when it does this, the controller is immediately reconciling and removing the new annotation. I assume this is because it's replacing the annotations field instead of doing some kind of merge patch / SSA.

This causes the OpenTelemetry Collector DaemonSet to restart maybe one pod, but the rollout stops after that.

The text was updated successfully, but these errors were encountered:

pavolloffay · 2022-12-07T09:45:57Z

If that is the case, shouldn't the kubectl rollout restart be called on the OTELcol CRD?

andrewdinunzio · 2022-12-07T15:27:21Z

I don't think that's possible. Just tried and got this error:

$ kubectl rollout restart otelcol otel
error: no kind "OpenTelemetryCollector" is registered for version "opentelemetry.io/v1alpha1" in scheme "pkg/scheme/scheme.go:28"

I think the things that are "rollout"-able are limited to Deployments, DaemonSets, and StatefulSets. Not sure if there's a way to configure that.

pavolloffay · 2023-09-01T09:19:05Z

Anything that changes the collector pods spec in deployment will cause a restarts - e.g. probably adding a new annotation

opentelemetry-operator/apis/v1alpha1/opentelemetrycollector_types.go

Line 134 in d3c98ae

PodAnnotations map[string]string `json:"podAnnotations,omitempty"`

andrewdinunzio · 2023-09-01T13:13:31Z

Yes, doing a rollout on the deployment should do that, but the controller immediately replaces the annotations back to the way it was, which seemingly cancels the rollout

pavolloffay · 2023-09-01T14:16:58Z

I meant adding the annotation on the collector CR, the operator will update the deployment and controller will restart the pod.

andrewdinunzio · 2023-09-01T14:25:29Z

Oh I see. That's a good workaround.

verejoel · 2023-11-22T15:14:45Z

I think this needs to be tackled properly, not worked around. This is problematic, for example, when spot instances are taken down. We have seen pods on non-existent nodes, and rollouts stuck because of this.

How would a fix look? Naively I'd say we just exclude the rollout annotation from reconciliation. Would be happy to work on it.

jaronoff97 · 2023-11-28T22:18:32Z

@verejoel that would be great. Thank you!

jaronoff97 · 2023-12-04T15:49:35Z

@verejoel any chance you'd be able to take a look at this for the current week? If not, i'm happy to take a look.

jaronoff97 · 2024-01-12T19:29:23Z

@verejoel im going to reassign this one, as its something i just ran into. Let me know if you've already begun. If not @Toaddyan take a look

verejoel · 2024-01-12T20:22:41Z

Please go ahead, I don’t have time to look at this too deeply at the moment

…

On Fri, 12 Jan 2024 at 20:29, Jacob Aronoff ***@***.***> wrote: @verejoel <https://github.com/verejoel> im going to reassign this one, as its something i just ran into. Let me know if you've already begun. If not @Toaddyan <https://github.com/Toaddyan> take a look — Reply to this email directly, view it on GitHub <#1300 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALE5VOMGJBJSPVZDTREH6YLYOGFJ7AVCNFSM6AAAAAASUOHEHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBZHAZTQNZVHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Toaddyan · 2024-01-17T23:51:54Z

So I was working on trying to re-produce the issue:

➜  otel-collector-charts git:(main) ✗ k get pods
NAME                                                       READY   STATUS    RESTARTS      AGE
kube-otel-stack-kube-state-metrics-66b58457bf-7bq9l        1/1     Running   1 (11m ago)   24m
kube-otel-stack-metrics-collector-0                        1/1     Running   1 (11m ago)   22m
kube-otel-stack-metrics-collector-4vfrt                    1/1     Running   0             2m22s
kube-otel-stack-metrics-collector-zh75g                    1/1     Running   0             2m22s
kube-otel-stack-metrics-targetallocator-6fb775b699-bhdbp   1/1     Running   1 (11m ago)   24m
kube-otel-stack-prometheus-node-exporter-4v8rh             1/1     Running   0             8m54s
kube-otel-stack-prometheus-node-exporter-j7lpm             1/1     Running   1 (11m ago)   24m
opentelemetry-operator-88dc494dc-6f5kp                     2/2     Running   2 (11m ago)   29m
➜  otel-collector-charts git:(main) ✗ k rollout restart daemonset kube-otel-stack-metrics-collector
daemonset.apps/kube-otel-stack-metrics-collector restarted
➜  otel-collector-charts git:(main) ✗ k get pods 
NAME                                                       READY   STATUS    RESTARTS      AGE
kube-otel-stack-kube-state-metrics-66b58457bf-7bq9l        1/1     Running   1 (11m ago)   24m
kube-otel-stack-metrics-collector-0                        1/1     Running   1 (11m ago)   22m
kube-otel-stack-metrics-collector-hvs7b                    1/1     Running   0             3s
kube-otel-stack-metrics-collector-mdlm2                    1/1     Running   0             3s
kube-otel-stack-metrics-targetallocator-6fb775b699-bhdbp   1/1     Running   1 (11m ago)   24m
kube-otel-stack-prometheus-node-exporter-4v8rh             1/1     Running   0             9m8s
kube-otel-stack-prometheus-node-exporter-j7lpm             1/1     Running   1 (11m ago)   24m
opentelemetry-operator-88dc494dc-6f5kp                     2/2     Running   2 (11m ago)   30m

From the above conversation, I'm interpreting that upon a rollout restart of a type of resource (daemonset), I should see the problem of a subset of pods restarting, where I SHOULD be seeing ALL of the pods restarting.

In my recent run down of this, I'm not seeing the OP's stated behavior of restarts.

am I missing anything particular here?

jaronoff97 · 2024-01-18T16:50:08Z

Similar to @Toaddyan, I tried to repro this and was unable to. I think this may have been resolved in #1995 inadvertently. I'm going to close this for now. Please let me know if I should re-open this! Thanks Todd for looking in to this.

pavolloffay added area:collector Issues for deploying collector help wanted Extra attention is needed labels Dec 7, 2022

lsolovey mentioned this issue Aug 29, 2023

Restart OTEL Collector on change to imported ConfigMap #2070

Closed

jaronoff97 assigned verejoel Nov 28, 2023

jaronoff97 added area:controller enhancement New feature or request and removed help wanted Extra attention is needed labels Nov 28, 2023

jaronoff97 added this to the v1alpha2 CRD release milestone Nov 28, 2023

jaronoff97 unassigned verejoel Jan 12, 2024

jaronoff97 closed this as completed Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator cancels rollout of ds/deployment #1300

Operator cancels rollout of ds/deployment #1300

andrewdinunzio commented Dec 5, 2022

pavolloffay commented Dec 7, 2022

andrewdinunzio commented Dec 7, 2022

pavolloffay commented Sep 1, 2023

andrewdinunzio commented Sep 1, 2023

pavolloffay commented Sep 1, 2023

andrewdinunzio commented Sep 1, 2023

verejoel commented Nov 22, 2023

jaronoff97 commented Nov 28, 2023

jaronoff97 commented Dec 4, 2023

jaronoff97 commented Jan 12, 2024

verejoel commented Jan 12, 2024 via email

Toaddyan commented Jan 17, 2024 •

edited

Loading

jaronoff97 commented Jan 18, 2024 •

edited

Loading

Operator cancels rollout of ds/deployment #1300

Operator cancels rollout of ds/deployment #1300

Comments

andrewdinunzio commented Dec 5, 2022

pavolloffay commented Dec 7, 2022

andrewdinunzio commented Dec 7, 2022

pavolloffay commented Sep 1, 2023

andrewdinunzio commented Sep 1, 2023

pavolloffay commented Sep 1, 2023

andrewdinunzio commented Sep 1, 2023

verejoel commented Nov 22, 2023

jaronoff97 commented Nov 28, 2023

jaronoff97 commented Dec 4, 2023

jaronoff97 commented Jan 12, 2024

verejoel commented Jan 12, 2024 via email

Toaddyan commented Jan 17, 2024 • edited Loading

jaronoff97 commented Jan 18, 2024 • edited Loading

Toaddyan commented Jan 17, 2024 •

edited

Loading

jaronoff97 commented Jan 18, 2024 •

edited

Loading