Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health-checker not working as expected for containerd #683

Closed
balusarakesh opened this issue Jul 1, 2022 · 5 comments
Closed

health-checker not working as expected for containerd #683

balusarakesh opened this issue Jul 1, 2022 · 5 comments

Comments

@balusarakesh
Copy link

Hi,
we are seeing errors while trying to enable monitoring for containerd:

I0701 21:00:30.390124       1 plugin.go:276] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc00030ee70 Timeout:3m0s} 
 I0701 21:00:20.388749     295 health_checker.go:172] command /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock --image-endpoint=unix:///var/run/containerd/containerd.sock pods failed: fork/exec /usr/bin/crictl: no such file or directory, []
I0701 21:00:30.389453     295 health_checker.go:172] command /bin/systemctl show containerd --property=InactiveExitTimestamp failed: signal: killed, []
I0701 21:00:30.389501     295 health_checker.go:86] error in getting uptime for cri: signal: killed
I0701 21:00:30.390193       1 plugin.go:277] End logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc00030ee70 Timeout:3m0s}

here's the daemonset:

# Source: node-problem-detector/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: deliveryhero-node-problem-detector
  namespace: node-problem-detector
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      serviceAccountName: deliveryhero-node-problem-detector
      hostNetwork: false
      hostPID: false
      terminationGracePeriodSeconds: 30
      containers:
        - name: node-problem-detector
          image:  "DOCKER_HOST/node-problem-detector:v0.8.9-test"
          imagePullPolicy: "IfNotPresent"
          command:
            - "/bin/sh"
            - "-c"
            - "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json --config.custom-plugin-monitor=/config/systemd-monitor-counter.json,/config/kernel-monitor-counter.json,/config/health-checker-containerd.json --prometheus-address=0.0.0.0 --prometheus-port=20257 --k8s-exporter-heartbeat-period=5m0s"
          securityContext:
            privileged: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: log
              mountPath: /var/log/
              readOnly: true
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
            - name: custom-config
              mountPath: /custom-config
              readOnly: true
            - mountPath: /run/systemd/system
              name: systemd
            - mountPath: /var/run/dbus/
              mountPropagation: Bidirectional
              name: dbus
          ports:
            - containerPort: 20257
              name: exporter
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 100m
              memory: 100Mi
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: log
          hostPath:
            path: /var/log/
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: "FileOrCreate"
        - name: custom-config
          configMap:
            name: deliveryhero-node-problem-detector-custom-config
        - hostPath:
            path: /run/systemd/system/
            type: Directory
          name: systemd
        - hostPath:
            path: /var/run/dbus/
            type: Directory
          name: dbus
  • the docker image DOCKER_HOST/node-problem-detector:v0.8.9-test is created from this comment
  • we use bottlerocket AMI (ami-0fd6126f25df4ba20 | bottlerocket-aws-k8s-1.21-x86_64-v1.5.3-f37bd7cb) on AWS EKS

kubectl version output:

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:17:11Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-eks-a64ea69", GitCommit:"d4336843ba36120e9ed1491fddff5f2fec33eb77", GitTreeState:"clean", BuildDate:"2022-05-12T18:29:27Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
@diranged
Copy link

diranged commented Sep 9, 2022

chiming in here, this is definitely broken even in the 0.8.12 release...

@karlhungus
Copy link
Contributor

karlhungus commented Sep 14, 2022

Adding crictl to the docker image works for me i.e.:

FROM registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.12 as builder
... # (may need to install wget)

# Install crictl
ARG TARGETOS
ARG TARGETARCH
#`BUILDX_ARCH` will be used in the buildx package download URL
# The required format is in `TARGETOS-TARGETARCH`
# Set it default to linux-amd64 to make the Dockerfile
# works with / without buildkit
ENV BUILDX_ARCH="${TARGETOS:-linux}-${TARGETARCH:-amd64}"


ARG VERSION="v1.24.1"
RUN yum install -y wget unzip && yum clean all
RUN wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-${VERSION}-${BUILDX_ARCH}.tar.gz && \
    tar zxvf crictl-$VERSION-${BUILDX_ARCH}.tar.gz -C /usr/bin && \
    rm -f crictl-$VERSION-${BUILDX_ARCH}.tar.gz

Then adding the volume mounts (we use the helm chart):

values.yaml
settings:
  # Custom monitor definitions to add to Node Problem Detector - to be
  # mounted at /custom-config. These are in addition to pre-packaged monitor
  # definitions provided within the default docker image available at /config:
  # https://github.com/kubernetes/node-problem-detector/tree/master/config
  # settings.custom_monitor_definitions -- Custom plugin monitor config files
  custom_monitor_definitions:
    health-checker-containerd.json: | # https://github.com/kubernetes/node-problem-detector/blob/1e8008bdedbeae39074c93cfe3fcdad7735f4db1/config/health-checker-containerd.json
      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "10s",
          "timeout": "3m",
          "max_output_length": 80,
          "concurrency": 1
        },
        "source": "health-checker",
        "metricsReporting": true,
        "conditions": [
          {
            "type": "ContainerRuntimeUnhealthy",
            "reason": "ContainerRuntimeIsHealthy",
            "message": "Container runtime on the node is functioning properly"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "ContainerRuntimeUnhealthy",
            "reason": "ContainerdUnhealthy",
            "path": "/home/kubernetes/bin/health-checker",
            "args": [
              "--component=cri",
              "--enable-repair=false",
              "--cooldown-time=2m",
              "--health-check-timeout=60s"
            ],
            "timeout": "3m"
          }
        ]
      }
    health-checker-kubelet.json: | # https://github.com/kubernetes/node-problem-detector/blob/1e8008bdedbeae39074c93cfe3fcdad7735f4db1/config/health-checker-kubelet.json
      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "10s",
          "timeout": "3m",
          "max_output_length": 80,
          "concurrency": 1
        },
        "source": "health-checker",
        "metricsReporting": true,
        "conditions": [
          {
            "type": "KubeletUnhealthy",
            "reason": "KubeletIsHealthy",
            "message": "kubelet on the node is functioning properly"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "KubeletUnhealthy",
            "reason": "KubeletUnhealthy",
            "path": "/home/kubernetes/bin/health-checker",
            "args": [
              "--component=kubelet",
              "--enable-repair=false",
              "--cooldown-time=1m",
              "--health-check-timeout=10s"
            ],
            "timeout": "3m"
          }
        ]
      }
    # docker-monitor-filelog.json: |
    #   {
    #     "plugin": "filelog",
    #     "pluginConfig": {
    #       "timestamp": "^time=\"(\\S*)\"",
    #       "message": "msg=\"([^\n]*)\"",
    #       "timestampFormat": "2006-01-02T15:04:05.999999999-07:00"
    #     },
    #     "logPath": "/var/log/docker.log",
    #     "lookback": "5m",
    #     "bufferSize": 10,
    #     "source": "docker-monitor",
    #     "conditions": [],
    #     "rules": [
    #       {
    #         "type": "temporary",
    #         "reason": "CorruptDockerImage",
    #         "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"
    #       }
    #     ]
    #   }
  # settings.log_monitors -- User-specified custom monitor definitions
  log_monitors:
    - /config/kernel-monitor.json

    # An example of activating a custom log monitor definition in
    # Node Problem Detector
    # - /custom-config/docker-monitor-filelog.json
  custom_plugin_monitors:
    - /custom-config/health-checker-kubelet.json
    - /custom-config/health-checker-containerd.json

  # settings.prometheus_address -- Prometheus exporter address
  prometheus_address: 0.0.0.0
  # settings.prometheus_port -- Prometheus exporter port
  prometheus_port: 20257 # update prometheus.io/port below

  # The period at which k8s-exporter does forcibly sync with apiserver
  # settings.heartBeatPeriod -- Syncing interval with API server
  heartBeatPeriod: 5m0s

logDir:
  # logDir.host -- log directory on k8s host
  host: /var/log/
  # logDir.pod -- log directory in pod (volume mount), use logDir.host if empty
  pod: ""

image:
  repository: <our_repo>/node-problem-detector/node-problem-detector
  tag: v0.8.12-modified
  pullPolicy: IfNotPresent

imagePullSecrets: []

nameOverride: "node-problem-detector"
fullnameOverride: "node-problem-detector"

rbac:
  create: true
  pspEnabled: false

# hostNetwork -- Run pod on host network
# Flag to run Node Problem Detector on the host's network. This is typically
# not recommended, but may be useful for certain use cases.
hostNetwork: true
hostPID: false

priorityClassName: system-node-critical

securityContext:
  privileged: true

resources:
  limits:
    cpu: 10m
    memory: 80Mi
  requests:
    cpu: 10m
    memory: 80Mi


annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "20257" # should match prometheus_port above

labels: {}

tolerations:
  - effect: NoSchedule
    operator: Exists

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:

affinity: {}

nodeSelector: {}

metrics:
  enabled: true
  annotations: {}
  serviceMonitor:
    enabled: false
    additionalLabels: {}
  prometheusRule:
    enabled: false
    defaultRules:
      create: true
      disabled: []
    additionalLabels: {}
    additionalRules: []

env:
#  - name: FOO
#    value: BAR
#  - name: POD_NAME
#    valueFrom:
#      fieldRef:
#        fieldPath: metadata.name

extraVolumes:
  - name: kmsg
    hostPath:
      path: /dev/kmsg
  - name: machine-id
    hostPath:
      path: /etc/machine-id
      type: "File"
  - name: systemd
    hostPath:
      path: /run/systemd/system/
      type: ""
  - name: dbus
    hostPath:
      path: /var/run/dbus/
      type: ""
  - name: containerd
    hostPath:
      path: /var/run/containerd
      type: ""

extraVolumeMounts:
  - name: kmsg
    mountPath: /dev/kmsg
    readOnly: true
  - mountPath: /etc/machine-id
    name: machine-id
    readOnly: true
  - mountPath: /run/systemd/system
    name: systemd
  - mountPath: /var/run/dbus/
    name: dbus
    mountPropagation: Bidirectional
  - mountPath: /var/run/containerd
    name: containerd
    readOnly: true
extraContainers: []

# updateStrategy -- Manage the daemonset update strategy
updateStrategy: RollingUpdate
# maxUnavailable -- The max pods unavailable during an update
maxUnavailable: 1
daemonset.yaml
---
# Source: node-problem-detector/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  labels:
    app.kubernetes.io/name: node-problem-detector
    helm.sh/chart: node-problem-detector-2.2.4
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Helm
  namespace: node-problem-detector
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: node-problem-detector
      app.kubernetes.io/instance: release-name
      app: node-problem-detector
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-problem-detector
        app.kubernetes.io/instance: release-name
        app: node-problem-detector
      annotations:
        checksum/config: 802cfbb98adbbb0754fbc87e3ca04ca46623ed173c8ea4b33bc7b3148611a04a
        prometheus.io/port: "20257"
        prometheus.io/scrape: "true"
    spec:
      serviceAccountName: node-problem-detector
      hostNetwork: true
      hostPID: false
      terminationGracePeriodSeconds: 30
      priorityClassName: "system-node-critical"
      containers:
        - name: node-problem-detector
          image:  "<our_repo>/node-problem-detector/node-problem-detector:v0.8.12-modified"
          imagePullPolicy: "IfNotPresent"
          command:
            - "/bin/sh"
            - "-c"
            - "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json --config.custom-plugin-monitor=/custom-config/health-checker-kubelet.json,/custom-config/health-checker-containerd.json --prometheus-address=0.0.0.0 --prometheus-port=20257 --k8s-exporter-heartbeat-period=5m0s"
          securityContext:
            privileged: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: log
              mountPath: /var/log/
              readOnly: true
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
            - name: custom-config
              mountPath: /custom-config
              readOnly: true
            - mountPath: /dev/kmsg
              name: kmsg
              readOnly: true
            - mountPath: /etc/machine-id
              name: machine-id
              readOnly: true
            - mountPath: /run/systemd/system
              name: systemd
            - mountPath: /var/run/dbus/
              mountPropagation: Bidirectional
              name: dbus
            - mountPath: /var/run/containerd
              name: containerd
              readOnly: true
          ports:
            - containerPort: 20257
              name: exporter
          resources:
            limits:
              cpu: 10m
              memory: 80Mi
            requests:
              cpu: 10m
              memory: 80Mi
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: log
          hostPath:
            path: /var/log/
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: "FileOrCreate"
        - name: custom-config
          configMap:
            name: node-problem-detector-custom-config
        - hostPath:
            path: /dev/kmsg
          name: kmsg
        - hostPath:
            path: /etc/machine-id
            type: File
          name: machine-id
        - hostPath:
            path: /run/systemd/system/
            type: ""
          name: systemd
        - hostPath:
            path: /var/run/dbus/
            type: ""
          name: dbus
        - hostPath:
            path: /var/run/containerd
            type: ""
          name: containerd

@balusarakesh
Copy link
Author

@karlhungus I'm still seeing a similar error again

I0921 22:49:35.952030       1 plugin.go:281] End logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc0004645a0 Timeout:3m0s}
I0921 22:49:45.966322       1 plugin.go:280] Start logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc0004645a0 Timeout:3m0s} 
 I0921 22:49:45.965359     356 health_checker.go:172] command /bin/systemctl show kubelet --property=InactiveExitTimestamp failed: signal: killed, []
I0921 22:49:45.965463     356 health_checker.go:86] error in getting uptime for kubelet: signal: killed
I0921 22:49:45.966396       1 plugin.go:281] End logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc0004645a0 Timeout:3m0s}
I0921 22:49:47.439907       1 plugin.go:280] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000464680 Timeout:3m0s} 
 I0921 22:49:37.438546     343 health_checker.go:172] command /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest failed: exit status 1, []
I0921 22:49:47.439307     343 health_checker.go:172] command /bin/systemctl show containerd --property=InactiveExitTimestamp failed: signal: killed, []
I0921 22:49:47.439338     343 health_checker.go:86] error in getting uptime for cri: signal: killed

@karlhungus
Copy link
Contributor

This is completely a guess, but it sounds like your mounts aren't right -- it's hard to tell because crictl isn't outputing stderr (i've submitted a pr for that #702, but it hasn't gotten much attention)

One thing i found helpful when debugging these issues is to just shell into the container, and see what the command does i.e.

kubectl exec <podname> -it -- /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest

@balusarakesh
Copy link
Author

thank you @karlhungus
I finally fixed it by changing the path to --cri-socket-path=unix:///var/run/containerd/dockershim.sock using the volume (apparently bottlerocket has a different path):

  - name: containerd
    hostPath:
      path: /var/run/
      type: ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants