Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get the kubelet monitor to run #584

Closed
dylanlingelbach opened this issue Jun 17, 2021 · 7 comments
Closed

Unable to get the kubelet monitor to run #584

dylanlingelbach opened this issue Jun 17, 2021 · 7 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dylanlingelbach
Copy link

I am running into a similar issue as #214 and #439

I am using the k8s.gcr.io/node-problem-detector/node-problem-detector:v0.8.8 image (which doesn't have systemd installed) and installing it into a bottlerocket host that does have systemd.

Everything works great until I try to enable the kubelet monitor as that shells out to systemctl to get uptime of kubelet.

I've tried mounting /bin/systemctl and the other suggestions in those issues without luck.

Is mounting systemctl from the host the only way to get the kubelet monitor running? Or is installing node-problem-detector on the host itself a better solution?

@hjkatz
Copy link

hjkatz commented Jul 23, 2021

I ran into these same problems and was able to successfully run node-problem-detector with access to systemctl and docker binaries using a wrapper Dockerfile image, like so:

FROM us.gcr.io/k8s-artifacts-prod/node-problem-detector/node-problem-detector:v0.8.9

RUN clean-install \
        curl \
        systemd \
        docker.io

I was also able to successfully get the log-counter and health-checker scripts and custom plugins to work (as expected with Node conditions and testing) with the following daemonset.yaml:

    spec:
      # health-checker for kubelet uses the local network to check kubelet's /healthz
      hostNetwork: true
      containers:
      - name: node-problem-detector
        command:
        - /node-problem-detector
        - --logtostderr
        - --config.system-log-monitor=[files]
        - --config.custom-plugin-monitor=[files]
        image: custom-registry/custom-image:0.8.9
        securityContext:
          privileged: true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: log
          mountPath: /var/log
          readOnly: true
        - name: kmsg
          mountPath: /dev/kmsg
          readOnly: true
        # Make sure container is in the same timezone with the host.
        - name: localtime
          mountPath: /etc/localtime
          readOnly: true
        - name: config
          mountPath: /config
        - mountPath: /etc/machine-id
          name: machine-id
          readOnly: true
        - mountPath: /run/systemd/system
          name: systemd
        - mountPath: /var/run/docker.sock
          name: docker-sock
        - mountPath: /var/run/dbus/
          name: dbus
          mountPropagation: Bidirectional
      volumes:
      - name: log
        # Config `log` to your system log directory
        hostPath:
          path: /var/log/
      - name: kmsg
        hostPath:
          path: /dev/kmsg
      - name: localtime
        hostPath:
          path: /etc/localtime
      - name: config
        configMap:
          defaultMode: 0744
          name: node-problem-detector-config
      - name: machine-id
        hostPath:
          path: /etc/machine-id
          type: File
      - name: systemd
        hostPath:
          path: /run/systemd/system/
          type: Directory
      - name: dbus
        hostPath:
          path: /var/run/dbus/
          type: Directory
      - name: docker-sock
        hostPath:
          path: /var/run/docker.sock
          type: Socket

I hope that this helps others that run into these challenges. Cheers!

Related Issue:

@peterrosell
Copy link

peterrosell commented Aug 12, 2021

I run into this issue due to error logs about missing systemctl and now see that there are more binaries that are missing.

I just played around with the Dockerfile in this repo to see what impact it makes to include these binaries in the default docker image.
The image size today is about 140MB
add systemctl - increase image size with 18MB
add curl - increase image size with 4MB
add docker.io - increase image size with 146MB

For me adding systemctl and curl seem to be no big deal. docker.io on the otherhand, I don't know. Many people has replaced docker with containerd, but on the other hand. If you enables the docker health check the binary is needed.

One idea can be to create a PR that adds systemctl and curl and let people build their own image for docker, but it's not that user friendly.

Any thoughts on this?

com6056 added a commit to com6056/node-problem-detector that referenced this issue Sep 3, 2021
A few issues have popped up where the provided image doesn't have the required packages for certain health checking operations (like kubernetes#584 (comment)).

This installs curl and systemd in the container to help alleviate these issues.
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 10, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vteratipally
Copy link
Collaborator

/lgtm
/approve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants