Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8sattributes Returns information from Alloy, not from originating pod #1336

Open
jseiser opened this issue Jul 19, 2024 · 14 comments
Open

k8sattributes Returns information from Alloy, not from originating pod #1336

jseiser opened this issue Jul 19, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@jseiser
Copy link

jseiser commented Jul 19, 2024

What's wrong?

When enabling k8sattributes on grafana alloy running in EKS, you end up getting information from Alloy, not from the originating pod.

So you end up with worthless attributes. Not the log at the end, is from an nginx ingress pod, in the name space nginx-ingress-internal, but all the attributes are for a grafana alloy pod.

❯ kubectl describe pod/ingress-nginx-controller-76fb6f965b-6k6hm -n nginx-ingress-internal                                                                                                                                                    
Name:             ingress-nginx-controller-76fb6f965b-6k6hm                                                                                                                                                                                   
Namespace:        nginx-ingress-internal                                                                                                                                                                                                      
Priority:         0                                                                                                                                                                                                                           
Service Account:  ingress-nginx                                                                                                                                                                                                               
Node:             i-0e5f7e0fccbd428f5.us-gov-west-1.compute.internal/10.2.29.50                                                                                                                                                               
Start Time:       Fri, 19 Jul 2024 12:49:59 -0400                                                                                                                                                                                             
Labels:           app.kubernetes.io/component=controller                                                                                                                                                                                      
                  app.kubernetes.io/instance=ingress-nginx                                                                                                                                                                                    
                  app.kubernetes.io/managed-by=Helm                                                                                                                                                                                           
                  app.kubernetes.io/name=ingress-nginx                                                                                                                                                                                        
                  app.kubernetes.io/part-of=ingress-nginx                                                                                                                                                                                     
                  app.kubernetes.io/version=1.11.1                                                                                                                                                                                            
                  helm.sh/chart=ingress-nginx-4.11.1                                                                                                                                                                                          
                  linkerd.io/control-plane-ns=linkerd                                                                                                                                                                                         
                  linkerd.io/proxy-deployment=ingress-nginx-controller                                                                                                                                                                        
                  linkerd.io/workload-ns=nginx-ingress-internal                                                                                                                                                                               
                  pod-template-hash=76fb6f965b                                                                                                                                                                                                
Annotations:      jaeger.linkerd.io/tracing-enabled: true                                                                                                                                                                                     
                  linkerd.io/created-by: linkerd/proxy-injector edge-24.7.3                                                                                                                                                                   
                  linkerd.io/inject: enabled                                                                                                                                                                                                  
                  linkerd.io/proxy-version: edge-24.7.3
                  linkerd.io/trust-root-sha256: 35504e48329c1792791907e06a50bbfe8a1dc2bc0217233d68eee3eb08bed27a
                  viz.linkerd.io/tap-enabled: true
Status:           Running
IP:               10.2.26.198

You can see the ip for the pod is correct in the trace below, but nothing else./

e.g.

          {
            "key": "linkerd.io.proxy-daemonset",
            "value": {
              "stringValue": "alloy"
            }
          },
          {
            "key": "service.name",
            "value": {
              "stringValue": "nginx"
            }
          },
          {
            "key": "k8s.namespace.name",
            "value": {
              "stringValue": "grafana-alloy"
            }
          },
          {
            "key": "k8s.pod.name",
            "value": {
              "stringValue": "alloy-sp77n"
            }
          }

Steps to reproduce

  1. Deploy EKS
  2. Deploy Alloy

System information

No response

Software version

v1.2.1

Configuration

alloy:
  mode: "flow"
  configMap:
    create: true
    content: |-
      logging {
        level  = "info"
        format = "json"
      }

      otelcol.exporter.otlp "to_tempo" {
        client {
          endpoint = "tempo-distributed-distributor.tempo.svc.cluster.local:4317"
          tls {
              insecure             = true
              insecure_skip_verify = true
          }
        }
      }

      otelcol.receiver.otlp "default" {
        debug_metrics {
          disable_high_cardinality_metrics = true
        }
        grpc {
          endpoint = "0.0.0.0:4317"
          include_metadata = true
        }

        http {
          endpoint = "0.0.0.0:4318"
          include_metadata = true
        }
        output {
          traces = [otelcol.processor.resourcedetection.default.input]
        }
      }

      otelcol.receiver.opencensus "default" {
        debug_metrics {
          disable_high_cardinality_metrics = true
        }
        endpoint  = "0.0.0.0:55678"
        transport = "tcp"
        output {
          traces = [otelcol.processor.resourcedetection.default.input]
        }
      }

      otelcol.processor.resourcedetection "default" {
        detectors = ["env", "eks"]

        output {
          traces = [otelcol.processor.k8sattributes.default.input]
        }
      }

      otelcol.processor.k8sattributes "default" {
        extract {
          annotation {
            from      = "pod"
            key_regex = "(.*)/(.*)"
            tag_name  = "$1.$2"
          }
          label {
            from      = "pod"
            key_regex = "(.*)/(.*)"
            tag_name  = "$1.$2"
          }

          metadata = [
            "k8s.namespace.name",
            "k8s.deployment.name",
            "k8s.statefulset.name",
            "k8s.daemonset.name",
            "k8s.cronjob.name",
            "k8s.job.name",
            "k8s.node.name",
            "k8s.pod.name",
            "k8s.pod.uid",
            "k8s.pod.start_time",
          ]
        }

        output {
          traces  = [otelcol.processor.memory_limiter.default.input]
        }
      }

      otelcol.processor.memory_limiter "default" {
        check_interval = "5s"

        limit = "512MiB"

        output {
            traces  = [otelcol.processor.tail_sampling.default.input]
        }
      }

      otelcol.processor.tail_sampling "default" {
        policy {
          name = "ignore-health"
          type = "string_attribute"

          string_attribute {
            key                    = "http.url"
            values                 = ["/health", "/metrics", "/healthz", "/loki/api/v1/push"]
            enabled_regex_matching = true
            invert_match           = true
          }
        }

        policy {
          name = "ignore-health-target"
          type = "string_attribute"

          string_attribute {
            key                    = "http.target"
            values                 = ["/health", "/metrics", "/healthz", "/loki/api/v1/push"]
            enabled_regex_matching = true
            invert_match           = true
          }
        }


        policy {
          name = "ignore-health-path"
          type = "string_attribute"

          string_attribute {
            key                    = "http.path"
            values                 = ["/health", "/metrics", "/healthz", "/loki/api/v1/push"]
            enabled_regex_matching = true
            invert_match           = true
          }
        }

        policy {
          name = "all-errors"
          type = "status_code"

          status_code {
            status_codes = ["ERROR"]
          }
        }

        policy {
          name = "sample-percent"
          type = "probabilistic"

          probabilistic {
            sampling_percentage = 50
          }
        }

        output {
          traces =  [otelcol.processor.batch.default.input]
        }
      }


      otelcol.processor.batch "default" {
        send_batch_size = 16384
        send_batch_max_size = 0
        timeout = "2s"

        output {
            traces  = [otelcol.exporter.otlp.to_tempo.input]
        }
      }

  enableReporting: false
  extraPorts:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
      protocol: TCP
    - name: otlp-http
      port: 4318
      targetPort: 4318
      protocol: TCP
    - name: opencensus
      port: 55678
      targetPort: 55678
      protocol: TCP

controller:
  priorityClassName: "system-cluster-critical"
  tolerations:
    - operator: Exists

serviceMonitor:
  enabled: true
  additionalLabels:
    release: kube-prometheus-stack

ingress:
  enabled: true
  ingressClassName: "nginx-internal"
  annotations:
    nginx.ingress.kubernetes.io/service-upstream: "true"
    cert-manager.io/cluster-issuer: cert-manager-r53-qa
  labels:
    ingress: externaldns
  path: /
  pathType: Prefix
  hosts:
    - faro-${cluster_number}-${environment}.${base_domain}
  tls:
    - secretName: faro-${cluster_number}-${environment}.${base_domain}-tls
      hosts:
        - faro-${cluster_number}-${environment}.${base_domain}

Logs

{
  "batches": [
    {
      "resource": {
        "attributes": [
          {
            "key": "telemetry.sdk.version",
            "value": {
              "stringValue": "1.11.0"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "opentelemetry"
            }
          },
          {
            "key": "telemetry.sdk.language",
            "value": {
              "stringValue": "cpp"
            }
          },
          {
            "key": "cloud.provider",
            "value": {
              "stringValue": "aws"
            }
          },
          {
            "key": "cloud.platform",
            "value": {
              "stringValue": "aws_eks"
            }
          },
          {
            "key": "k8s.pod.ip",
            "value": {
              "stringValue": "10.2.29.64"
            }
          },
          {
            "key": "linkerd.io.workload-ns",
            "value": {
              "stringValue": "grafana-alloy"
            }
          },
          {
            "key": "linkerd.io.inject",
            "value": {
              "stringValue": "enabled"
            }
          },
          {
            "key": "k8s.pod.uid",
            "value": {
              "stringValue": "021b9b57-49c7-4453-bcd6-099ea9ed6c05"
            }
          },
          {
            "key": "app.kubernetes.io.instance",
            "value": {
              "stringValue": "alloy"
            }
          },
          {
            "key": "linkerd.io.trust-root-sha256",
            "value": {
              "stringValue": "35504e48329c1792791907e06a50bbfe8a1dc2bc0217233d68eee3eb08bed27a"
            }
          },
          {
            "key": "viz.linkerd.io.tap-enabled",
            "value": {
              "stringValue": "true"
            }
          },
          {
            "key": "jaeger.linkerd.io.tracing-enabled",
            "value": {
              "stringValue": "true"
            }
          },
          {
            "key": "k8s.node.name",
            "value": {
              "stringValue": "i-0af9c8279dabb5258.us-gov-west-1.compute.internal"
            }
          },
          {
            "key": "linkerd.io.control-plane-ns",
            "value": {
              "stringValue": "linkerd"
            }
          },
          {
            "key": "app.kubernetes.io.name",
            "value": {
              "stringValue": "alloy"
            }
          },
          {
            "key": "linkerd.io.created-by",
            "value": {
              "stringValue": "linkerd/proxy-injector edge-24.5.1"
            }
          },
          {
            "key": "kubectl.kubernetes.io.default-container",
            "value": {
              "stringValue": "alloy"
            }
          },
          {
            "key": "linkerd.io.proxy-version",
            "value": {
              "stringValue": "edge-24.5.1"
            }
          },
          {
            "key": "k8s.pod.start_time",
            "value": {
              "stringValue": "2024-07-18T19:19:12Z"
            }
          },
          {
            "key": "k8s.daemonset.name",
            "value": {
              "stringValue": "alloy"
            }
          },
          {
            "key": "linkerd.io.proxy-daemonset",
            "value": {
              "stringValue": "alloy"
            }
          },
          {
            "key": "service.name",
            "value": {
              "stringValue": "nginx"
            }
          },
          {
            "key": "k8s.namespace.name",
            "value": {
              "stringValue": "grafana-alloy"
            }
          },
          {
            "key": "k8s.pod.name",
            "value": {
              "stringValue": "alloy-sp77n"
            }
          }
        ],
        "droppedAttributesCount": 0
      },
      "instrumentationLibrarySpans": [
        {
          "spans": [
            {
              "traceId": "0f413376449cc556e74d8fb427776954",
              "spanId": "0a13242128b666de",
              "parentSpanId": "0000000000000000",
              "traceState": "",
              "name": "",
              "kind": "SPAN_KIND_SERVER",
              "startTimeUnixNano": 1721408032421640000,
              "endTimeUnixNano": 1721408032450101000,
              "attributes": [
                {
                  "key": "http.flavor",
                  "value": {
                    "stringValue": "1.1"
                  }
                },
                {
                  "key": "http.target",
                  "value": {
                    "stringValue": "/v2/socs/e3ec6005-665d-4c29-9ac7-9effff9b423f/gateways/71352608-b0b5-4e13-a5ba-6efab206e256"
                  }
                },
                {
                  "key": "http.server_name",
                  "value": {
                    "stringValue": "console-api-qa1-dev-01.madeup.com"
                  }
                },
                {
                  "key": "http.host",
                  "value": {
                    "stringValue": "console-api-qa1-dev-01.madeup.com"
                  }
                },
                {
                  "key": "http.user_agent",
                  "value": {
                    "stringValue": "OpenAPI-Generator/0.13.0.post0/python"
                  }
                },
                {
                  "key": "http.scheme",
                  "value": {
                    "stringValue": "https"
                  }
                },
                {
                  "key": "net.host.port",
                  "value": {
                    "intValue": 443
                  }
                },
                {
                  "key": "net.peer.ip",
                  "value": {
                    "stringValue": "10.2.26.198"
                  }
                },
                {
                  "key": "net.peer.port",
                  "value": {
                    "intValue": 37944
                  }
                },
                {
                  "key": "ingress.namespace",
                  "value": {
                    "stringValue": "qa1-dev"
                  }
                },
                {
                  "key": "ingress.service_name",
                  "value": {
                    "stringValue": "console-api-qa1-dev"
                  }
                },
                {
                  "key": "ingress.name",
                  "value": {
                    "stringValue": "console-api-qa1-dev"
                  }
                },
                {
                  "key": "ingress.upstream",
                  "value": {
                    "stringValue": "qa1-dev-console-api-qa1-dev-8000"
                  }
                },
                {
                  "key": "http.method",
                  "value": {
                    "stringValue": "PATCH"
                  }
                },
                {
                  "key": "http.status_code",
                  "value": {
                    "intValue": 200
                  }
                }
              ],
              "droppedAttributesCount": 0,
              "droppedEventsCount": 0,
              "droppedLinksCount": 0,
              "status": {
                "code": 0,
                "message": ""
              }
            }
          ],
          "instrumentationLibrary": {
            "name": "nginx",
            "version": ""
          }
        }
      ]
    }
  ]
}
@jseiser jseiser added the bug Something isn't working label Jul 19, 2024
Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@tshuma1
Copy link

tshuma1 commented Sep 5, 2024

@jseiser How are you populating your substitution values i.e - faro-${cluster_number}-${environment}.${base_domain} . Is this possible in Alloy .config files?

@jseiser
Copy link
Author

jseiser commented Sep 11, 2024

@tshuma1

Terraform is doing it, so the file are interpolated by the time the helm command is ran.

@jseiser
Copy link
Author

jseiser commented Sep 25, 2024

Is there any other information I can provide here? We have tried as a deployment, as a daemonset. With and without Alloy being in the service mesh. We have even hardcoded the OTEL Attributes and removes the k8s attributes, but you still end up with the traces from linkerd not being matched.

We have not been able to find a working example of AWS EKS + Grafana Alloy. The issue also appears to extend to the actual OTLP Collector itself,

open-telemetry/opentelemetry-collector-contrib#29630 (comment)

@jseiser
Copy link
Author

jseiser commented Nov 20, 2024

This is still an issue on the latest stable release.

@scottamation
Copy link

@jseiser
I was having the same issues due to the linkerd sidecar causing the agent to think the source of the traffic was itself.
This was resolved by adding the config.linkerd.io/skip-inbound-ports annotation to the agent pod spec. We just gave it the value of the ports used for trace collection.

@jseiser
Copy link
Author

jseiser commented Dec 2, 2024

@scottamation

I ended up ripping out linkerd-tracing, Ill get it back into place, and try your suggestion. We were hitting another issue, where if we removed Grafana Alloy from the mesh, linkerd would freak out and was unable to send any traces.

{"timestamp":"2024-11-22T16:23:21.908138Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.2.27.161:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-11-22T16:23:22.007482Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.2.43.6:12347: Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-11-22T16:23:22.029704Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.2.43.6:55678: Connection refused (os error 111)"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}

So i just ripped the tracing out all together.

Ill get it back in place and test, can you confirm what your Config roughly looks like?

@scottamation
Copy link

We're not using EKS, we just happen to also be using linkerd. Also we're using the k8s-monitoring-helm chart.

Here's the relevant part of our values file:

cluster:
  name: ClusterName
metrics: ...
logs: ...
traces: ...
receivers ...
alloy:
  controller:
    podAnnotations:
      linkerd.io/inject: enabled
      config.linkerd.io/skip-inbound-ports: 4317,4318
    replicas: 3

@jseiser
Copy link
Author

jseiser commented Dec 4, 2024

Ugh, turned everything back on, marked inbound skip, and linkerd just freaks out.

{"timestamp":"2024-12-04T16:30:47.984933Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.7.129:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-12-04T16:30:48.202478Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.16.177:4317: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}
{"timestamp":"2024-12-04T16:30:48.229660Z","level":"WARN","fields":{"message":"Failed to connect","error":"endpoint 10.1.16.177:4318: received corrupt message of type InvalidContentType"},"target":"linkerd_reconnect","threadId":"ThreadId(1)"}

@jseiser
Copy link
Author

jseiser commented Dec 5, 2024

@scottamation

Any chance you do anything else? like marking those ports opaque or anything at the namespace level?

@scottamation
Copy link

@jseiser That annotation was the only thing required for our configuration. The linkerd sidecar is running in the Alloy pods and joining the mesh. This allowed the pods to send telemetry via OTLP but it was labeled with the kubernetes data of the alloy pod itself. The skip-inbound-ports setting allows the source linkerd sidecar to send directly to the pod instead of the linkerd sidecar. This way the source of the connection has the correct IP and alloy can look up the correct information.
I'm not familiar with how your linkerd mesh is configured, but it may require additional configuration to allow this traffic to pass to the alloy pod directly.

Hopefully this additional context helps.

@jseiser
Copy link
Author

jseiser commented Dec 5, 2024

@scottamation

Ya, Im assuming there is a bug in linkerd at this point. Since everything works, except the actual linkerd-proxy sending traces to Alloy. If alloy is fully meshed, it sends, but as you know you get the wrong information. If its removed from the mesh, marked opaque or marked to skip, it breaks.

Thank for atleast confirming it should work if linkerd operates correctly.

@FredrikAugust
Copy link

Running into this as well. Really annoying to work around. I can't see the aforementioned resource.attributes in the data either. Using the opentelemetry protocol. Ended up just disabling linkerd-jaeger for now.

@jseiser
Copy link
Author

jseiser commented Feb 10, 2025

@FredrikAugust

While we have not been able to make this work at all, I was able to start getting linkerd's traces to contain some proper information.

linkerd/linkerd2#13427 (comment)

They still do not associate with the other traces like they are supposed too, but I guess its progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants