Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[operator] Collector fails with featureGate errors when Upgrading the Operator to chart version 0.68.1 #1320

Open
jlcrow opened this issue Aug 28, 2024 · 29 comments
Labels
chart:operator Issue related to opentelemetry-operator helm chart

Comments

@jlcrow
Copy link

jlcrow commented Aug 28, 2024

Performed a routine helm upgrade from chart version 0.65.1 to 0.68.1 after the upgrade created Open Telemetry collector will not start. No errors in the operator - the collector errors and Crashloops

otel-prometheus-collector-0                        0/1     CrashLoopBackOff   7 (4m20s ago)   15m
 
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/08/28 19:23:44 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled

Collector config

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-prometheus
  namespace: monitoring
spec:
  mode: statefulset
  podAnnotations:
     sidecar.istio.io/inject: "false"
  targetAllocator:
    serviceAccount: opentelemetry-targetallocator-sa
    enabled: true
    prometheusCR:
      enabled: true
    observability:
      metrics:
        enableMetrics: true
    resources:
      requests:
        memory: 300Mi
        cpu: 300m
      limits:
        memory: 512Mi
        cpu: 500m
  priorityClassName: highest-priority
  resources:
    requests:
      memory: 600Mi
      cpu: 300m
    limits:
      memory: 1Gi
      cpu: 500m
  env:
    - name: K8S_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
  config: |
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90    
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      memory_ballast: {}
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ '0.0.0.0:8888' ]         
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
          - job_name: kubernetes-nodes-cadvisor
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: node
              regex: (.*)
              replacement: $$1         
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kube-state-metrics
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - role: endpoints
                label: "app.kubernetes.io/name=kube-state-metrics" 
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: exporter_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: kube_pod_status_(reason|scheduled|ready)
              action: drop
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    exporters:
      prometheusremotewrite:
        endpoint: https://<internal miimir endpoint>
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
          metrics:
            address: "0.0.0.0:8888"
            level: basic
          logs:
            level: "warn"  
      extensions:
      - health_check
      - memory_ballast
      pipelines:
        metrics:
          receivers:
          - prometheus
          - otlp
          processors:
          - memory_limiter
          - batch
          exporters:
          - prometheusremotewrite
Copy link

dosubot bot commented Aug 28, 2024

Hey there, @jlcrow! 👋 I'm here to assist you with any bugs, questions, or contributions you have while waiting for a human maintainer.

The issue you're encountering with the OpenTelemetry collector failing with featureGate errors and crashing in a loop after upgrading to chart version 0.68.1 is due to changes in the OpenTelemetryCollector CRD. Specifically, the confmap.unifyEnvVarExpansion feature gate is now stable and cannot be disabled.

To resolve this issue, you need to ensure that your CRDs are properly annotated and labeled before upgrading. Here are the steps:

  1. Annotate and label the CRDs:

    RELEASE_NAME=my-opentelemetry-operator
    RELEASE_NAMESPACE=opentelemetry-operator-system
    kubectl annotate crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io \
      meta.helm.sh/release-name=${RELEASE_NAME} \
      meta.helm.sh/release-namespace=${RELEASE_NAMESPACE}
    kubectl label crds instrumentations.opentelemetry.io opentelemetrycollectors.opentelemetry.io opampbridges.opentelemetry.io app.kubernetes.io/managed-by=Helm
  2. Follow the TLS certificate requirements and upgrade instructions provided in the OpenTelemetry Operator Helm Chart documentation.

Additionally, if you are upgrading from version 0.67 to 0.68, note that the preset.kubernetesEvents preset now excludes DELETED watch types. If you want to keep ingesting DELETED updates for Kubernetes Events, you will need to configure the k8sobjectsreceiver manually [1][2][3].

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the chart:operator Issue related to opentelemetry-operator helm chart label Aug 28, 2024
@jaronoff97
Copy link
Contributor

@jlcrow this is a know bug and will be fixed by this PR in the operator open-telemetry/opentelemetry-operator#3074

@jaronoff97
Copy link
Contributor

separately, i would recommend upgrading the collector's version to v1beta1 when you get a chance :)

@jaronoff97
Copy link
Contributor

solved by open-telemetry/opentelemetry-operator#3074

this will be fixed in the next operator helm release. Thank you for your patience :)

@jaronoff97
Copy link
Contributor

@jlcrow can you upgrade to latest and let me know if that fixes things?

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

@jaronoff97
Just did a helm repo update open-telemetry tried upgrading to 0.69.0

open-telemetry/opentelemetry-operator  	0.69.0       	0.108.0    	OpenTelemetry Operator Helm chart for Kubernetes

Still seeing errors when the collector comes up

otel-prometheus-collector-0                       0/1     Error       1 (5s ago)    11s
otel-prometheus-targetallocator-7bb6d4d7b-bq8q7   1/1     Running     0             12s
➜  cluster-management git: klon monitoring-system otel-prometheus-collector-0                                                                                                                               
Error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled
2024/09/10 18:14:02 collector server run finished with error: invalid argument "-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "confmap.unifyEnvVarExpansion" is stable, can not be disabled

@jaronoff97
Copy link
Contributor

hmm any logs from the operator?

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

@jaronoff97 Nothing on the operator but info logs for the manager container

{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
{"level":"INFO","timestamp":"2024-09-10T18:37:56Z","message":"Starting workers","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","worker count":1}

@jaronoff97
Copy link
Contributor

one note, i tried running your config and you should know that the memory_ballast extension is removed. testing this locally now though!

@jaronoff97
Copy link
Contributor

jaronoff97 commented Sep 10, 2024

i saw this message from the otel operator:

{"level":"INFO","timestamp":"2024-09-10T18:41:10Z","logger":"collector-upgrade","message":"instance upgraded","name":"otel-prometheus","namespace":"default","version":"0.108.0"}

and this is working now:

⫸ k logs otel-prometheus-collector-0
2024-09-10T18:41:15.297Z	warn	[email protected]/warning.go:42	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.	{"kind": "extension", "name": "health_check", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-09-10T18:41:15.302Z	warn	[email protected]/warning.go:42	Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning.	{"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}

Note: the target allocator is failing to startup because it's missing permissions on its service account, but otherwise things worked fully as expected.

@jaronoff97
Copy link
Contributor

before:

  Containers:
   otc-container:
    Image:       otel/opentelemetry-collector-k8s:0.104.0
    Ports:       8888/TCP, 4317/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

After:

  Containers:
   otc-container:
    Image:       otel/opentelemetry-collector-k8s:0.108.0
    Ports:       8888/TCP, 4317/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-component.UseLocalHostAsDefaultHost

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

@jaronoff97 Should have provided my latest config:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-prometheus
  namespace: monitoring-system
spec:
  mode: statefulset
  podAnnotations:
     sidecar.istio.io/inject: "false"
  targetAllocator:
    serviceAccount: opentelemetry-targetallocator-sa
    enabled: true
    prometheusCR:
      enabled: true
    observability:
      metrics:
        enableMetrics: true
    resources:
      requests:
        memory: 300Mi
        cpu: 300m
      limits:
        memory: 512Mi
        cpu: 500m
  priorityClassName: highest-priority
  resources:
    requests:
      memory: 600Mi
      cpu: 300m
    limits:
      memory: 1Gi
      cpu: 500m
  env:
    - name: K8S_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: K8S_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP          
  config:
    processors:
      batch: {}
      memory_limiter:
        check_interval: 5s
        limit_percentage: 90    
    extensions:
      health_check:
        endpoint: ${K8S_POD_IP}:13133
    receivers:
      prometheus:
        config:
          scrape_configs:
          - job_name: 'otel-collector'
            scrape_interval: 10s
            static_configs:
            - targets: [ "${K8S_POD_IP}:8888" ]         
            metric_relabel_configs:
            - action: labeldrop
              regex: (id|name)
            - action: labelmap
              regex: label_(.+)
          - job_name: kubernetes-nodes-cadvisor
            bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
            honor_timestamps: true
            kubernetes_sd_configs:
            - role: node
            relabel_configs:
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: node
              regex: (.*)
              replacement: $$1         
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - replacement: kubernetes.default.svc:443
              target_label: __address__
            - regex: (.+)
              replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
              source_labels:
              - __meta_kubernetes_node_name
              target_label: __metrics_path__
            scheme: https
            tls_config:
              ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
              insecure_skip_verify: true
          - job_name: kube-state-metrics
            kubernetes_sd_configs:
            - role: endpoints
              selectors:
              - role: endpoints
                label: "app.kubernetes.io/name=kube-state-metrics" 
            relabel_configs:
            - action: keep
              regex: true
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scrape
            - action: replace
              regex: (https?)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_scheme
              target_label: __scheme__
            - action: replace
              regex: (.+)
              source_labels:
              - __meta_kubernetes_service_annotation_prometheus_io_path
              target_label: __metrics_path__
            - action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              source_labels:
              - __address__
              - __meta_kubernetes_service_annotation_prometheus_io_port
              target_label: __address__
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - action: replace
              source_labels:
              - __meta_kubernetes_namespace
              target_label: exporter_namespace
            - action: replace
              source_labels:
              - __meta_kubernetes_service_name
              target_label: service_name
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: node
            metric_relabel_configs:
            - source_labels: [__name__]
              regex: kube_pod_status_(reason|scheduled|ready)
              action: drop
      otlp:
        protocols:
          grpc:
            endpoint: ${K8S_POD_IP}:4317
    exporters:
      prometheusremotewrite:
        endpoint: https://mimir/api/v1/push
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
          max_elapsed_time: 30s
    service:
      telemetry:
          metrics:
            address: "${K8S_POD_IP}:8888"
            level: basic
          logs:
            level: "warn"  
      extensions:
      - health_check
      pipelines:
        metrics:
          receivers:
          - prometheus
          - otlp
          processors:
          - memory_limiter
          - batch
          exporters:
          - prometheusremotewrite

@jaronoff97
Copy link
Contributor

also note, i needed to get rid of the priority class name and the service account name which weren't provided. but thanks for updating, giving it a try...

@jaronoff97
Copy link
Contributor

yeah i tested going from 0.65.0 -> 0.69.0 which was fully successful with this configuration:

Config
``` apiVersion: opentelemetry.io/v1beta1 kind: OpenTelemetryCollector metadata: name: otel-prometheus spec: mode: statefulset podAnnotations: sidecar.istio.io/inject: "false" targetAllocator: enabled: true prometheusCR: enabled: true observability: metrics: enableMetrics: true resources: requests: memory: 300Mi cpu: 300m limits: memory: 512Mi cpu: 500m resources: requests: memory: 600Mi cpu: 300m limits: memory: 1Gi cpu: 500m env: - name: K8S_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: K8S_POD_IP valueFrom: fieldRef: fieldPath: status.podIP config: processors: batch: {} memory_limiter: check_interval: 5s limit_percentage: 90 extensions: health_check: endpoint: ${K8S_POD_IP}:13133 receivers: prometheus: config: scrape_configs: - job_name: "otel-collector" scrape_interval: 10s static_configs: - targets: ["${K8S_POD_IP}:8888"] metric_relabel_configs: - action: labeldrop regex: (id|name) - action: labelmap regex: label_(.+) - job_name: kubernetes-nodes-cadvisor bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token honor_timestamps: true kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__meta_kubernetes_pod_node_name] action: replace target_label: node regex: (.*) replacement: $$1 - action: labelmap regex: __meta_kubernetes_node_label_(.+) - replacement: kubernetes.default.svc:443 target_label: __address__ - regex: (.+) replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor source_labels: - __meta_kubernetes_node_name target_label: __metrics_path__ scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints selectors: - role: endpoints label: "app.kubernetes.io/name=kube-state-metrics" relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $$1:$$2 source_labels: - __address__ - __meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_service_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: exporter_namespace - action: replace source_labels: - __meta_kubernetes_service_name target_label: service_name - action: replace source_labels: - __meta_kubernetes_pod_node_name target_label: node metric_relabel_configs: - source_labels: [__name__] regex: kube_pod_status_(reason|scheduled|ready) action: drop otlp: protocols: grpc: endpoint: ${K8S_POD_IP}:4317 exporters: debug: {} service: telemetry: metrics: address: "${K8S_POD_IP}:8888" level: basic logs: level: "warn" extensions: - health_check pipelines: metrics: receivers: - prometheus - otlp processors: - memory_limiter - batch exporters: - debug ---

Source: opentelemetry-kube-stack/templates/clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: example-collector
rules:

  • apiGroups: [""]
    resources:

    • namespaces
    • nodes
    • nodes/proxy
    • nodes/metrics
    • nodes/stats
    • services
    • endpoints
    • pods
    • events
    • secrets
      verbs: ["get", "list", "watch"]
  • apiGroups: ["monitoring.coreos.com"]
    resources:

    • servicemonitors
    • podmonitors
      verbs: ["get", "list", "watch"]
  • apiGroups:

    • extensions
      resources:
    • ingresses
      verbs: ["get", "list", "watch"]
  • apiGroups:

    • apps
      resources:
    • daemonsets
    • deployments
    • replicasets
    • statefulsets
      verbs: ["get", "list", "watch"]
  • apiGroups:

    • networking.k8s.io
      resources:
    • ingresses
      verbs: ["get", "list", "watch"]
  • apiGroups: ["discovery.k8s.io"]
    resources:

    • endpointslices
      verbs: ["get", "list", "watch"]
  • nonResourceURLs: ["/metrics", "/metrics/cadvisor"]
    verbs: ["get"]

  • apiGroups:

    • ""
      resources:
    • events
    • namespaces
    • namespaces/status
    • nodes
    • nodes/spec
    • pods
    • pods/status
    • replicationcontrollers
    • replicationcontrollers/status
    • resourcequotas
    • services
      verbs:
    • get
    • list
    • watch
  • apiGroups:

    • apps
      resources:
    • daemonsets
    • deployments
    • replicasets
    • statefulsets
      verbs:
    • get
    • list
    • watch
  • apiGroups:

    • extensions
      resources:
    • daemonsets
    • deployments
    • replicasets
      verbs:
    • get
    • list
    • watch
  • apiGroups:

    • batch
      resources:
    • jobs
    • cronjobs
      verbs:
    • get
    • list
    • watch
  • apiGroups:

    • autoscaling
      resources:
    • horizontalpodautoscalers
      verbs:
    • get
    • list
    • watch
  • apiGroups: ["events.k8s.io"]
    resources: ["events"]
    verbs: ["watch", "list"]


Source: opentelemetry-kube-stack/templates/clusterrole.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: example-daemon
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: example-collector
subjects:

  • kind: ServiceAccount

    quirk of the Operator

    name: "otel-prometheus-collector"
    namespace: default
  • kind: ServiceAccount
    name: otel-prometheus-targetallocator
    namespace: default
</details>

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

@jaronoff97 idk man the feature gates seem to be sticking around for me when the operator is deploying the collector. I'm running on GKE don't think that should matter though.

  otc-container:
    Container ID:  containerd://724dfd2080e9b46afac3fde71cb9e56747d8c6d352cd7c82b9baf272ed40a301
    Image:         otel/opentelemetry-collector-contrib:0.106.1
    Image ID:      docker.io/otel/opentelemetry-collector-contrib@sha256:12a6cab81088666668e312f1e814698f14f205d879181ec5f770301ab17692c2
    Ports:         8888/TCP, 4317/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost
  otc-container:
    Container ID:  containerd://1cf06d1b6368d070ceb3a9f9448351b1638140a459ee9dbb2b9dbf7e3b173610
    Image:         otel/opentelemetry-collector-contrib:0.108.0
    Image ID:      docker.io/otel/opentelemetry-collector-contrib@sha256:923eb1cfae32fe09676cfd74762b2b237349f2273888529594f6c6ffe1fb3d7e
    Ports:         8888/TCP, 4317/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      --config=/conf/collector.yaml
      --feature-gates=-confmap.unifyEnvVarExpansion,-component.UseLocalHostAsDefaultHost

@jaronoff97
Copy link
Contributor

what was the version before? I thought it 0.65.1, but want to confirm. And did you install the operator helm chart with upgrades disabled or any other flags? If i can get a local repro, I can try to get a fix out ASAP, otherwise it would be helpful to enable debug logging on the operator.

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

I was able to make it to 0.67.0, any version later breaks the same way

@jaronoff97
Copy link
Contributor

yeah i just did this exact process:

  • install 0.67.0
  • install config above
  • check success ✅
  • upgrade to 0.69.0
  • check success ✅
    One notable difference is that your 0.67.0 operator collector install has the -confmap.unifyEnvVarExpansion featuregate on it whereas mine does not. If you delete and recreate the otelcol object is it still present? Another option would be to upgrade to operator 0.69.0, and then delete recreate the otelcol object at which point it should be gone... If that doesn't work or isn't possible let me know and we can sort out some other options.

@jaronoff97
Copy link
Contributor

another user who reported a similar issue by doing a clean install of the operator #1339 (comment)

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

@jaronoff97

Looks like a full uninstall and reinstall and now the flag is no longer present and the collector comes up successfully

@jaronoff97
Copy link
Contributor

okay thats good, but im not satisfied with it. im going to keep investigating here and try to get a repro... im thinking maybe going from an older version to one that adds the flag, back to the previous version and then up to latest may cause it.

@jlcrow
Copy link
Author

jlcrow commented Sep 10, 2024

@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again

@jaronoff97
Copy link
Contributor

that's probably due to the permissions change i alluded to here. This was the error message I saw:

{"level":"error","ts":"2024-09-10T18:41:53Z","logger":"setup.prometheus-cr-watcher","msg":"Failed to create namespace informer in promOperator CRD watcher","error":"missing list/watch permissions on the 'namespaces' resource: missing \"list\" permission on resource \"namespaces\" (group: \"\") for all namespaces: missing \"watch\" permission on resource \"namespaces\" (group: \"\") for all namespaces","stacktrace":"github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator/watcher.NewPrometheusCRWatcher\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/watcher/promOperator.go:115\nmain.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/cmd/otel-allocator/main.go:119\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.6/x64/src/runtime/proc.go:271"}

@jaronoff97
Copy link
Contributor

apiGroups: [""]
resources:
- namespaces
verbs: ["get", "list", "watch"]

this block should do the trick, but I'm on mobile rn so sorry if it's not exactly right 😅

@jlcrow
Copy link
Author

jlcrow commented Sep 24, 2024

@jaronoff97 I spoke too soon, somewhere along the lines the targetallocator stopped picking up my monitors and I lost almost all of my metrics, I just went back to the alpha spec and 67 to get things working again

I'm still having weird issues with the targetallocator on one of my clusters - it consistently fails to pick up any servicemonitor or podmonitor crds. I tried a number of things including full uninstall and reinstall, working with version 69 of the chart and 108 of the collector. I checked the rbac for the sa account and the auth appears to be there.

kubectl auth can-i get podmonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator                                                             
yes

kubectl auth can-i get servicemonitors --as=system:serviceaccount:monitoring-system:otel-prometheus-targetallocator                                                   
yes

At the end on a whim I reverted the api back to v1alpha1 and when I deployed the spec and the targetallocator/scrape_configs started showing all the podmonitors and servicemonitors instead of only the default prometheus config that's in the chart. I'm actually not understanding at all why this isn't working correctly as I have another operator on another GKE cluster with the same config that doesn't seem to have an issue with the beta api.

@cabrinha
Copy link
Contributor

cabrinha commented Dec 6, 2024

Can't upgrade past v109?

2024/12/06 01:08:40 collector server run finished with error: invalid argument "-component.UseLocalHostAsDefaultHost" for "--feature-gates" flag: feature gate "component.UseLocalHostAsDefaultHost" is stable, can not be disabled

@jaronoff97
Copy link
Contributor

@cabrinha which operator version are you attempting to upgrade to? This should be fixed in the latest version.

@cabrinha
Copy link
Contributor

cabrinha commented Dec 8, 2024

@cabrinha which operator version are you attempting to upgrade to? This should be fixed in the latest version.

I'm trying to run all the latest images, 0.115.0 was my most recent try.

@jaronoff97
Copy link
Contributor

hm, can you supply the following so i can try and reproduce?

  • previous operator version and any custom configuration
  • a collector configuration that is failing
  • any logs you are seeing on the collector, operator, collector events?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chart:operator Issue related to opentelemetry-operator helm chart
Projects
None yet
Development

No branches or pull requests

3 participants