health-checker not working as expected for containerd #683

balusarakesh · 2022-07-01T21:06:16Z

Hi,
we are seeing errors while trying to enable monitoring for containerd:

I0701 21:00:30.390124       1 plugin.go:276] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc00030ee70 Timeout:3m0s} 
 I0701 21:00:20.388749     295 health_checker.go:172] command /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock --image-endpoint=unix:///var/run/containerd/containerd.sock pods failed: fork/exec /usr/bin/crictl: no such file or directory, []
I0701 21:00:30.389453     295 health_checker.go:172] command /bin/systemctl show containerd --property=InactiveExitTimestamp failed: signal: killed, []
I0701 21:00:30.389501     295 health_checker.go:86] error in getting uptime for cri: signal: killed
I0701 21:00:30.390193       1 plugin.go:277] End logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc00030ee70 Timeout:3m0s}

here's the daemonset:

# Source: node-problem-detector/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: deliveryhero-node-problem-detector
  namespace: node-problem-detector
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      serviceAccountName: deliveryhero-node-problem-detector
      hostNetwork: false
      hostPID: false
      terminationGracePeriodSeconds: 30
      containers:
        - name: node-problem-detector
          image:  "DOCKER_HOST/node-problem-detector:v0.8.9-test"
          imagePullPolicy: "IfNotPresent"
          command:
            - "/bin/sh"
            - "-c"
            - "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json --config.custom-plugin-monitor=/config/systemd-monitor-counter.json,/config/kernel-monitor-counter.json,/config/health-checker-containerd.json --prometheus-address=0.0.0.0 --prometheus-port=20257 --k8s-exporter-heartbeat-period=5m0s"
          securityContext:
            privileged: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: log
              mountPath: /var/log/
              readOnly: true
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
            - name: custom-config
              mountPath: /custom-config
              readOnly: true
            - mountPath: /run/systemd/system
              name: systemd
            - mountPath: /var/run/dbus/
              mountPropagation: Bidirectional
              name: dbus
          ports:
            - containerPort: 20257
              name: exporter
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 100m
              memory: 100Mi
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: log
          hostPath:
            path: /var/log/
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: "FileOrCreate"
        - name: custom-config
          configMap:
            name: deliveryhero-node-problem-detector-custom-config
        - hostPath:
            path: /run/systemd/system/
            type: Directory
          name: systemd
        - hostPath:
            path: /var/run/dbus/
            type: Directory
          name: dbus

the docker image DOCKER_HOST/node-problem-detector:v0.8.9-test is created from this comment
we use bottlerocket AMI (ami-0fd6126f25df4ba20 | bottlerocket-aws-k8s-1.21-x86_64-v1.5.3-f37bd7cb) on AWS EKS

kubectl version output:

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.1", GitCommit:"3ddd0f45aa91e2f30c70734b175631bec5b5825a", GitTreeState:"clean", BuildDate:"2022-05-24T12:17:11Z", GoVersion:"go1.18.2", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.12-eks-a64ea69", GitCommit:"d4336843ba36120e9ed1491fddff5f2fec33eb77", GitTreeState:"clean", BuildDate:"2022-05-12T18:29:27Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

diranged · 2022-09-09T21:04:33Z

chiming in here, this is definitely broken even in the 0.8.12 release...

karlhungus · 2022-09-14T15:55:02Z

Adding crictl to the docker image works for me i.e.:

FROM registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.12 as builder
... # (may need to install wget)

# Install crictl
ARG TARGETOS
ARG TARGETARCH
#`BUILDX_ARCH` will be used in the buildx package download URL
# The required format is in `TARGETOS-TARGETARCH`
# Set it default to linux-amd64 to make the Dockerfile
# works with / without buildkit
ENV BUILDX_ARCH="${TARGETOS:-linux}-${TARGETARCH:-amd64}"


ARG VERSION="v1.24.1"
RUN yum install -y wget unzip && yum clean all
RUN wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-${VERSION}-${BUILDX_ARCH}.tar.gz && \
    tar zxvf crictl-$VERSION-${BUILDX_ARCH}.tar.gz -C /usr/bin && \
    rm -f crictl-$VERSION-${BUILDX_ARCH}.tar.gz

Then adding the volume mounts (we use the helm chart):

values.yaml

settings:
  # Custom monitor definitions to add to Node Problem Detector - to be
  # mounted at /custom-config. These are in addition to pre-packaged monitor
  # definitions provided within the default docker image available at /config:
  # https://github.com/kubernetes/node-problem-detector/tree/master/config
  # settings.custom_monitor_definitions -- Custom plugin monitor config files
  custom_monitor_definitions:
    health-checker-containerd.json: | # https://github.com/kubernetes/node-problem-detector/blob/1e8008bdedbeae39074c93cfe3fcdad7735f4db1/config/health-checker-containerd.json
      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "10s",
          "timeout": "3m",
          "max_output_length": 80,
          "concurrency": 1
        },
        "source": "health-checker",
        "metricsReporting": true,
        "conditions": [
          {
            "type": "ContainerRuntimeUnhealthy",
            "reason": "ContainerRuntimeIsHealthy",
            "message": "Container runtime on the node is functioning properly"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "ContainerRuntimeUnhealthy",
            "reason": "ContainerdUnhealthy",
            "path": "/home/kubernetes/bin/health-checker",
            "args": [
              "--component=cri",
              "--enable-repair=false",
              "--cooldown-time=2m",
              "--health-check-timeout=60s"
            ],
            "timeout": "3m"
          }
        ]
      }
    health-checker-kubelet.json: | # https://github.com/kubernetes/node-problem-detector/blob/1e8008bdedbeae39074c93cfe3fcdad7735f4db1/config/health-checker-kubelet.json
      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "10s",
          "timeout": "3m",
          "max_output_length": 80,
          "concurrency": 1
        },
        "source": "health-checker",
        "metricsReporting": true,
        "conditions": [
          {
            "type": "KubeletUnhealthy",
            "reason": "KubeletIsHealthy",
            "message": "kubelet on the node is functioning properly"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "KubeletUnhealthy",
            "reason": "KubeletUnhealthy",
            "path": "/home/kubernetes/bin/health-checker",
            "args": [
              "--component=kubelet",
              "--enable-repair=false",
              "--cooldown-time=1m",
              "--health-check-timeout=10s"
            ],
            "timeout": "3m"
          }
        ]
      }
    # docker-monitor-filelog.json: |
    #   {
    #     "plugin": "filelog",
    #     "pluginConfig": {
    #       "timestamp": "^time=\"(\\S*)\"",
    #       "message": "msg=\"([^\n]*)\"",
    #       "timestampFormat": "2006-01-02T15:04:05.999999999-07:00"
    #     },
    #     "logPath": "/var/log/docker.log",
    #     "lookback": "5m",
    #     "bufferSize": 10,
    #     "source": "docker-monitor",
    #     "conditions": [],
    #     "rules": [
    #       {
    #         "type": "temporary",
    #         "reason": "CorruptDockerImage",
    #         "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"
    #       }
    #     ]
    #   }
  # settings.log_monitors -- User-specified custom monitor definitions
  log_monitors:
    - /config/kernel-monitor.json

    # An example of activating a custom log monitor definition in
    # Node Problem Detector
    # - /custom-config/docker-monitor-filelog.json
  custom_plugin_monitors:
    - /custom-config/health-checker-kubelet.json
    - /custom-config/health-checker-containerd.json

  # settings.prometheus_address -- Prometheus exporter address
  prometheus_address: 0.0.0.0
  # settings.prometheus_port -- Prometheus exporter port
  prometheus_port: 20257 # update prometheus.io/port below

  # The period at which k8s-exporter does forcibly sync with apiserver
  # settings.heartBeatPeriod -- Syncing interval with API server
  heartBeatPeriod: 5m0s

logDir:
  # logDir.host -- log directory on k8s host
  host: /var/log/
  # logDir.pod -- log directory in pod (volume mount), use logDir.host if empty
  pod: ""

image:
  repository: <our_repo>/node-problem-detector/node-problem-detector
  tag: v0.8.12-modified
  pullPolicy: IfNotPresent

imagePullSecrets: []

nameOverride: "node-problem-detector"
fullnameOverride: "node-problem-detector"

rbac:
  create: true
  pspEnabled: false

# hostNetwork -- Run pod on host network
# Flag to run Node Problem Detector on the host's network. This is typically
# not recommended, but may be useful for certain use cases.
hostNetwork: true
hostPID: false

priorityClassName: system-node-critical

securityContext:
  privileged: true

resources:
  limits:
    cpu: 10m
    memory: 80Mi
  requests:
    cpu: 10m
    memory: 80Mi


annotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "20257" # should match prometheus_port above

labels: {}

tolerations:
  - effect: NoSchedule
    operator: Exists

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:

affinity: {}

nodeSelector: {}

metrics:
  enabled: true
  annotations: {}
  serviceMonitor:
    enabled: false
    additionalLabels: {}
  prometheusRule:
    enabled: false
    defaultRules:
      create: true
      disabled: []
    additionalLabels: {}
    additionalRules: []

env:
#  - name: FOO
#    value: BAR
#  - name: POD_NAME
#    valueFrom:
#      fieldRef:
#        fieldPath: metadata.name

extraVolumes:
  - name: kmsg
    hostPath:
      path: /dev/kmsg
  - name: machine-id
    hostPath:
      path: /etc/machine-id
      type: "File"
  - name: systemd
    hostPath:
      path: /run/systemd/system/
      type: ""
  - name: dbus
    hostPath:
      path: /var/run/dbus/
      type: ""
  - name: containerd
    hostPath:
      path: /var/run/containerd
      type: ""

extraVolumeMounts:
  - name: kmsg
    mountPath: /dev/kmsg
    readOnly: true
  - mountPath: /etc/machine-id
    name: machine-id
    readOnly: true
  - mountPath: /run/systemd/system
    name: systemd
  - mountPath: /var/run/dbus/
    name: dbus
    mountPropagation: Bidirectional
  - mountPath: /var/run/containerd
    name: containerd
    readOnly: true
extraContainers: []

# updateStrategy -- Manage the daemonset update strategy
updateStrategy: RollingUpdate
# maxUnavailable -- The max pods unavailable during an update
maxUnavailable: 1

daemonset.yaml

---
# Source: node-problem-detector/templates/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  labels:
    app.kubernetes.io/name: node-problem-detector
    helm.sh/chart: node-problem-detector-2.2.4
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Helm
  namespace: node-problem-detector
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: node-problem-detector
      app.kubernetes.io/instance: release-name
      app: node-problem-detector
  template:
    metadata:
      labels:
        app.kubernetes.io/name: node-problem-detector
        app.kubernetes.io/instance: release-name
        app: node-problem-detector
      annotations:
        checksum/config: 802cfbb98adbbb0754fbc87e3ca04ca46623ed173c8ea4b33bc7b3148611a04a
        prometheus.io/port: "20257"
        prometheus.io/scrape: "true"
    spec:
      serviceAccountName: node-problem-detector
      hostNetwork: true
      hostPID: false
      terminationGracePeriodSeconds: 30
      priorityClassName: "system-node-critical"
      containers:
        - name: node-problem-detector
          image:  "<our_repo>/node-problem-detector/node-problem-detector:v0.8.12-modified"
          imagePullPolicy: "IfNotPresent"
          command:
            - "/bin/sh"
            - "-c"
            - "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json --config.custom-plugin-monitor=/custom-config/health-checker-kubelet.json,/custom-config/health-checker-containerd.json --prometheus-address=0.0.0.0 --prometheus-port=20257 --k8s-exporter-heartbeat-period=5m0s"
          securityContext:
            privileged: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: log
              mountPath: /var/log/
              readOnly: true
            - name: localtime
              mountPath: /etc/localtime
              readOnly: true
            - name: custom-config
              mountPath: /custom-config
              readOnly: true
            - mountPath: /dev/kmsg
              name: kmsg
              readOnly: true
            - mountPath: /etc/machine-id
              name: machine-id
              readOnly: true
            - mountPath: /run/systemd/system
              name: systemd
            - mountPath: /var/run/dbus/
              mountPropagation: Bidirectional
              name: dbus
            - mountPath: /var/run/containerd
              name: containerd
              readOnly: true
          ports:
            - containerPort: 20257
              name: exporter
          resources:
            limits:
              cpu: 10m
              memory: 80Mi
            requests:
              cpu: 10m
              memory: 80Mi
      tolerations:
        - effect: NoSchedule
          operator: Exists
      volumes:
        - name: log
          hostPath:
            path: /var/log/
        - name: localtime
          hostPath:
            path: /etc/localtime
            type: "FileOrCreate"
        - name: custom-config
          configMap:
            name: node-problem-detector-custom-config
        - hostPath:
            path: /dev/kmsg
          name: kmsg
        - hostPath:
            path: /etc/machine-id
            type: File
          name: machine-id
        - hostPath:
            path: /run/systemd/system/
            type: ""
          name: systemd
        - hostPath:
            path: /var/run/dbus/
            type: ""
          name: dbus
        - hostPath:
            path: /var/run/containerd
            type: ""
          name: containerd

balusarakesh · 2022-09-21T22:50:21Z

@karlhungus I'm still seeing a similar error again

I0921 22:49:35.952030       1 plugin.go:281] End logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc0004645a0 Timeout:3m0s}
I0921 22:49:45.966322       1 plugin.go:280] Start logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc0004645a0 Timeout:3m0s} 
 I0921 22:49:45.965359     356 health_checker.go:172] command /bin/systemctl show kubelet --property=InactiveExitTimestamp failed: signal: killed, []
I0921 22:49:45.965463     356 health_checker.go:86] error in getting uptime for kubelet: signal: killed
I0921 22:49:45.966396       1 plugin.go:281] End logs from plugin {Type:permanent Condition:KubeletUnhealthy Reason:KubeletUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=kubelet --enable-repair=true --cooldown-time=1m --loopback-time=0 --health-check-timeout=10s] TimeoutString:0xc0004645a0 Timeout:3m0s}
I0921 22:49:47.439907       1 plugin.go:280] Start logs from plugin {Type:permanent Condition:ContainerRuntimeUnhealthy Reason:ContainerdUnhealthy Path:/home/kubernetes/bin/health-checker Args:[--component=cri --enable-repair=true --cooldown-time=2m --health-check-timeout=60s] TimeoutString:0xc000464680 Timeout:3m0s} 
 I0921 22:49:37.438546     343 health_checker.go:172] command /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest failed: exit status 1, []
I0921 22:49:47.439307     343 health_checker.go:172] command /bin/systemctl show containerd --property=InactiveExitTimestamp failed: signal: killed, []
I0921 22:49:47.439338     343 health_checker.go:86] error in getting uptime for cri: signal: killed

karlhungus · 2022-09-21T23:34:14Z

This is completely a guess, but it sounds like your mounts aren't right -- it's hard to tell because crictl isn't outputing stderr (i've submitted a pr for that #702, but it hasn't gotten much attention)

One thing i found helpful when debugging these issues is to just shell into the container, and see what the command does i.e.

kubectl exec <podname> -it -- /usr/bin/crictl --runtime-endpoint=unix:///var/run/containerd/containerd.sock pods --latest

balusarakesh · 2022-09-22T21:43:02Z

thank you @karlhungus
I finally fixed it by changing the path to --cri-socket-path=unix:///var/run/containerd/dockershim.sock using the volume (apparently bottlerocket has a different path):

  - name: containerd
    hostPath:
      path: /var/run/
      type: ""

balusarakesh closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

health-checker not working as expected for containerd #683

health-checker not working as expected for containerd #683

balusarakesh commented Jul 1, 2022

diranged commented Sep 9, 2022

karlhungus commented Sep 14, 2022 •

edited

Loading

balusarakesh commented Sep 21, 2022

karlhungus commented Sep 21, 2022

balusarakesh commented Sep 22, 2022

health-checker not working as expected for containerd #683

health-checker not working as expected for containerd #683

Comments

balusarakesh commented Jul 1, 2022

diranged commented Sep 9, 2022

karlhungus commented Sep 14, 2022 • edited Loading

balusarakesh commented Sep 21, 2022

karlhungus commented Sep 21, 2022

balusarakesh commented Sep 22, 2022

karlhungus commented Sep 14, 2022 •

edited

Loading