Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes input plugin not working (deprecated /stats/summary endpoint?) #6959

Closed
ghost opened this issue Jan 31, 2020 · 18 comments
Closed
Labels
area/k8s docs Issues related to Telegraf documentation and configuration descriptions

Comments

@ghost
Copy link

ghost commented Jan 31, 2020

Relevant telegraf.conf:

[[inputs.kubernetes]]
      url = "https://kubernetes.default.svc"
      bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"
      insecure_skip_verify = true

System info:

Ubuntu 18.04
k3s v1.17.2+k3s1
Telegraf image: telegraf:1.12.2

Steps to reproduce:

Configure the Kubernetes input plugin in a Telegraf container.

Expected behavior:

The plugin should colect the Kubernetes metrics.

Actual behavior:

The Telegraf plugin log shows that Kubernetes API server returned a 403 Forbiden error code. After adding to the RBAC Service Account of the pod the following rules:

rules:
  - nonResourceURLs: ["/stats", "/stats/*"]
      verbs: ["get", "list"]

the error is 404. No metrics are being collected.

Additional info:

The input plugin kube_intentory seems to be working just fine but the plugin kubernetes is not capable of obtaining any metric, as described. Looking at the code, the kubernetes input pluging calls the /stats/summary Kubernetes API server endpoint.

/stats/summary endpoint was planned to be depracated (kubernetes/kubernetes#68522) but it seems that it is already removed.

@danielnelson
Copy link
Contributor

We should put together some documentation about what needs done to switch to the replacement and anyway we can smooth the transition. I could definitely use some help from the community on this.

I am assuming similar metrics can be captured with the prometheus input plugin. It would be good to gather a listing of the new metrics because switching over will likely change all metrics and break dashboards/alerts.

It also looks like it should also be possible to use the --enable-cadvisor-endpoints flag to reenable the endpoint, it would be good to describe how this can be set as well.

@danielnelson danielnelson added area/k8s docs Issues related to Telegraf documentation and configuration descriptions labels Jan 31, 2020
@masual
Copy link

masual commented Feb 3, 2020

Hello @danielnelson , thank you for your reply. The cadvisor endpoint support will be removed in Kubernetes 1.19 (kubernetes/kubernetes#76660) so I would recommend using the --enable-cadvisor-endpoints flag only as a temporary fix. I think the way to go is to query the metrics server API (https://github.com/kubernetes-sigs/metrics-server) throught the standard Kubernetes API to obtain pod metrics.

@nsteinmetz
Copy link
Contributor

@danielnelson for managed kubernetes, not sure you can ask to add this flag so even as a temporary fix, it won't work for many (most ?) people

@masual : so it would mean we need to deploy the metrics server first to use then this plugin ? or should we use only kube_inventory plugin ?

@nsteinmetz
Copy link
Contributor

I could make it work with the help of @rawkode:

As endpoint, you need:

[[inputs.kubernetes]]
    url = "https://kubernetes.default.svc.cluster.local/api/v1/nodes/$NODE_NAME/proxy/"
    bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
    insecure_skip_verify = true

be sure to have:

env:
  - name: NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName

and as ClusterRole (I use ClusterRoleAggregations):

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: influx:stats:viewer
  labels:
    rbac.authorization.k8s.io/aggregate-view-telegraf-stats: "true"
rules:
  - apiGroups: [""]
    resources: ["nodes/proxy"]
    verbs: ["get", "watch", "list"]

Tested on k8s 1.17.0 on OVH K8S Managed Service

@nsteinmetz
Copy link
Contributor

... and available soon as an helm chart for the deployment of telegraf as a daemonset => influxdata/helm-charts#16

@jmorcar
Copy link

jmorcar commented Apr 3, 2020

I have the same problem, I follow these recommentations, but same error:
Error:
2020-04-03T08:38:00Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found

Is there any solution or another documentation to fix the problem?

I Checked I have configured rbac permissions, this output:

Name:         telegraf-cluster-reader
Labels:       rbac.authorization.k8s.io/aggregate-view-telegraf=true
rbac.authorization.k8s.io/aggregate-view-telegraf-stats=true
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"rbac.authorization.k8s.io/aggreg...
PolicyRule:
Resources          Non-Resource URLs  Resource Names  Verbs

deployments        []                 []              [get watch list]
nodes/proxy        []                 []              [get watch list]
nodes              []                 []              [get watch list]
persistentvolumes  []                 []              [get watch list]
pods               []                 []              [get watch list]
statefulsets       []                 []              [get watch list]
[/stats/*]         []              [get]
[/stats]           []              [get]
[/stats/*]         []              [list]
[/stats]           []              [list]
[/stats/*]         []              [watch]
[/stats]           []              [watch]`

I have this config applied in yamls:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-reader
  namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: telegraf-cluster-reader
  labels:
    rbac.authorization.k8s.io/aggregate-view-telegraf: "true"
    rbac.authorization.k8s.io/aggregate-view-telegraf-stats: "true"
rules:
  - nonResourceURLs: ["/stats", "/stats/*"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["persistentvolumes", "nodes", "pods", "deployments", "statefulsets", "nodes/proxy"]
    verbs: ["get", "watch", "list"]
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: telegraf-reader-role
aggregationRule:
  clusterRoleSelectors:
    - matchLabels:
        rbac.authorization.k8s.io/aggregate-view-telegraf-stats: "true"
    - matchLabels:
        rbac.authorization.k8s.io/aggregate-view-telegraf: "true"
    - matchLabels:
        rbac.authorization.k8s.io/aggregate-to-view: "true"
rules: [] # Rules are automatically filled in by the controller manager.
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: telegraf-reader-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: telegraf-reader-role
subjects:
  - kind: ServiceAccount
    name: telegraf-reader
    namespace: default

Mi Pod use this, + use token via secrets applied in configMap, other plugins like kube_inventory works fine with this:

    spec:
      serviceAccountName: telegraf-reader
      containers:
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName  

@nsteinmetz
Copy link
Contributor

@jmorcar have a look at what we did for telegraf-ds chart as we get it working => https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

@ellieayla
Copy link

[[inputs.kubernetes]]
      url = "https://kubernetes.default.svc"
      

I think the plugin is expecting a URL to the Node's API, not the API-server's API. So the telegraph container runs on every node, in a daemonset, configured with something like url = "https://$NODEIP:10250", with the environment variable coming from the downward API.

@jmorcar
Copy link

jmorcar commented Apr 3, 2020

I have checked right now with NODE IP variable, here HOSTIP, captured via fieldPath: status.hostIP, but is answer is Forbidden:

# curl https://$HOSTIP:10250/stats/summary --header "Authorization: Bearer $TOKEN" --insecureForbidden (user=system:serviceaccount:default:telegraf-reader, verb=get, resource=nodes, subresource=stats)

While if I use the previous command I posted, the query is permitted with data:

# curl https://kubernetes/stats/summary --header "Authorization: Bearer $TOKEN" --insecure
{
  "paths": [
    "/apis",
    "/apis/",
    "/apis/apiextensions.k8s.io",
    "/apis/apiextensions.k8s.io/v1",
    "/apis/apiextensions.k8s.io/v1beta1",
    "/healthz",
    "/healthz/etcd",
    "/healthz/log",
    "/healthz/ping",
    "/healthz/poststarthook/crd-informer-synced",
    "/healthz/poststarthook/generic-apiserver-start-informers",
    "/healthz/poststarthook/start-apiextensions-controllers",
    "/healthz/poststarthook/start-apiextensions-informers",
    "/livez",
    "/livez/etcd",
    "/livez/log",
    "/livez/ping",
    "/livez/poststarthook/crd-informer-synced",
    "/livez/poststarthook/generic-apiserver-start-informers",
    "/livez/poststarthook/start-apiextensions-controllers",
    "/livez/poststarthook/start-apiextensions-informers",
    "/metrics",
    "/openapi/v2",
    "/readyz",
    "/readyz/etcd",
    "/readyz/log",
    "/readyz/ping",
    "/readyz/poststarthook/crd-informer-synced",
    "/readyz/poststarthook/generic-apiserver-start-informers",
    "/readyz/poststarthook/start-apiextensions-controllers",
    "/readyz/poststarthook/start-apiextensions-informers",
    "/readyz/shutdown",
    "/version"
  ]

(Both queries are exec inside the container Telegraf and use the service account created in yaml definition)

For the creation the serviceaccount , telegraf-reader , I followed the guide posted by kube_inventory plugin in GitHub. I checked telegraf-reader has privilegies to query resources like /api/v1/namespaces/default/pods...for that I created ClusterRole and rolebindings.

Before that, it was when all answers of any resource query was Forbidden, but not right now, so URL should be the problem.

I checked "Kubernetes.default.svc" is same "kubernetes" short name, both are the ClusterIP for default to the Kubernetes cluster.

I will have to check source code to kubernetes input plugin for telegraf to find the exac query return a "404 not found"

@jmorcar
Copy link

jmorcar commented Apr 3, 2020

@jmorcar have a look at what we did for telegraf-ds chart as we get it working => https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

I don't found the ClusterRole or role bindings definitions on template charts, so I think the deploy will have the Forbidden error. I posted a suggest to include this documentation in charts because, yaml definition calling the service account is not sufficient, if you don't created RBAC permissions before.

@nsteinmetz
Copy link
Contributor

@jmorcar,

here is the role and rolebinding

The telegraf-ds chart works fine for me - did you try it on your cluster ?

@jmorcar
Copy link

jmorcar commented Apr 3, 2020

Thanks! I have applied now... and same problem:

2020-04-03T17:21:20Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:21:30Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:21:40Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:21:50Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found
2020-04-03T17:22:00Z E! [inputs.kubernetes] Error in plugin: https://kubernetes/stats/summary returned HTTP status 404 Not Found

@rawkode
Copy link
Contributor

rawkode commented Apr 3, 2020

@jmorcar if you are going through the Kubernetes API, you need the proxy endpoint.

It's usually best to go through the NODEIP from the downwardAPI.

I see mentions of that above, but I couldn't work out what problem you had with that approach.

By any chance are you on GKE? They do block access to the Kubelet this way (last time I checked)

@jmorcar
Copy link

jmorcar commented Apr 6, 2020

Thanks at all, I found the problem, I was using a Deployment defintion, instead of Daemonset. Related problem when you change to daemonset is like commented @alanjcastonguay or @rawkode , you have to use NODEIP:10250, like this:

[[inputs.kubernetes]]
url = "https://$HOSTIP:10250"
bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
insecure_skip_verify = true

So I have changed my yaml for the official helm chart like recommended @nsteinmetz because I had to change/add too params in my yaml. The official chart is OK, deploy in the namespace that you need and collect all metrics ok.

Conclusion:
IF you need to monitor a kubernetes cluster the better option is deploy offical helm chart telegraf-ds. This monitorize by Node inside the cluster (deploy a telegraf agent in each one via daemonset) with only one deploy definition.

https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-ds

@hershdhillon
Copy link

hershdhillon commented Sep 14, 2020

Try creating a Service Account and ClusterRoleBinding for telegraf using the yaml configuration below. Mind the namespace.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: metric-scanner-kubelet-api-admin
subjects:
- kind: ServiceAccount
  name: telegraf
  namespace: influxdb
roleRef:
  kind: ClusterRole
  name: system:kubelet-api-admin
  apiGroup: rbac.authorization.k8s.io 

Faced similar issue, after applying the yaml telegraf was able to authenticate in the cluster to scrape the metrics.

@manucloud9
Copy link

I am using telegraf-ds chart but getting below error in the pod logs.

2021-02-11T17:32:50Z W! [inputs.kubernetes] Collection took longer than expected; not complete after interval of 10s
2021-02-11T17:33:00Z W! [inputs.kubernetes] Collection took longer than expected; not complete after interval of 10s

@JeongsikKang
Copy link

JeongsikKang commented Dec 8, 2021

I worked fine.

#-----------------------------------------------
# 1. ServiceAccount
#-----------------------------------------------
apiVersion: v1
kind: ServiceAccount
metadata:
  name: telegraf-ds
  labels:
    app.kubernetes.io/name: telegraf-ds

---
#-----------------------------------------------
# 2. ClusterRole
#-----------------------------------------------
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: influx-stats-viewer
  labels:
    app.kubernetes.io/name: telegraf-ds
rules:
  - apiGroups: ["metrics.k8s.io"]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["nodes/proxy", "nodes/stats"]
    verbs: ["get", "list", "watch"]
---
#-----------------------------------------------
# 3. ClusterRoleBinding
#-----------------------------------------------
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: influx-telegraf-viewer
  labels:
    app.kubernetes.io/name: telegraf-ds
subjects:
  - kind: ServiceAccount
    name: telegraf-ds 
    namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: influx-stats-viewer

---
#-----------------------------------------------
# 4. ConfigMap
#-----------------------------------------------
apiVersion: v1
kind: ConfigMap
metadata:
  name: telegraf-ds
  labels:
    app.kubernetes.io/name: telegraf-ds
data:
  telegraf.conf: |+

    [agent]
      collection_jitter = "0s"
      debug = false
      flush_interval = "10s"
      flush_jitter = "0s"
      hostname = "$HOSTNAME"
      interval = "10s"
      logfile = ""
      metric_batch_size = 1000
      metric_buffer_limit = 10000
      omit_hostname = false
      precision = ""
      quiet = false
      round_interval = true


    [[outputs.influxdb]]
      database = "telegraf-ds"
      insecure_skip_verify = false
      password = "blahblah"
      retention_policy = ""
      timeout = "5s"
      urls = [
        "http://xxx.xxx.xxx.xxx:8086"
      ]
      user_agent = "telegraf"
      username = "k8s"

    [[inputs.diskio]]
    [[inputs.kernel]]
    [[inputs.mem]]
    [[inputs.net]]
    [[inputs.processes]]
    [[inputs.swap]]
    [[inputs.system]]

    [[inputs.cpu]]
    percpu = true
    totalcpu = true
    collect_cpu_time = false
    report_active = false

    [[inputs.disk]]
    ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

    [[inputs.docker]]
    endpoint = "unix:///var/run/docker.sock"

    [[inputs.kubernetes]]
    url = "https://$HOSTIP:10250"
    bearer_token = "/run/secrets/kubernetes.io/serviceaccount/token"
    insecure_skip_verify = true
---
#-----------------------------------------------
# 5. DaemonSet
#-----------------------------------------------
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: telegraf-ds
  labels:
    app.kubernetes.io/name: telegraf-ds
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: telegraf-ds
  template:
    metadata:
      labels:
        app.kubernetes.io/name: telegraf-ds
    spec:
      serviceAccountName: telegraf-ds
      containers:
      - name: telegraf-ds
        image: telegraf:1.20.2
        imagePullPolicy: "IfNotPresent"
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 0.1
            memory: 256Mi
        env:
        - name: HOSTIP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: HOSTIP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: HOST_PROC
          value: /hostfs/proc
        - name: HOST_SYS
          value: /hostfs/sys
        - name: HOST_MOUNT_PREFIX
          value: /hostfs
        volumeMounts:
        - name: varrunutmpro
          mountPath: /var/run/utmp
          readOnly: true
        - name: hostfsro
          mountPath: /hostfs
          readOnly: true
        - name: docker-socket
          mountPath: /var/run/docker.sock
        - name: config
          mountPath: /etc/telegraf
      volumes:
      - name: hostfsro
        hostPath:
          path: /
      - name: docker-socket
        hostPath:
          path: /var/run/docker.sock
      - name: varrunutmpro
        hostPath:
          path: /var/run/utmp
      - name: config
        configMap:
          name:  telegraf-ds

@sspaink
Copy link
Contributor

sspaink commented Jan 24, 2022

Closing, from the discussion it seems this issue is resolved (there have been significant changes to the k8 input plugin and dependencies updated) and also has a viable workaround by using the official helm chart. Please re-open if this isn't the case.

@sspaink sspaink closed this as completed Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/k8s docs Issues related to Telegraf documentation and configuration descriptions
Projects
None yet
Development

No branches or pull requests

10 participants