Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer-metrics-sidecar should have healthy-check #1811

Closed
genofire opened this issue Sep 19, 2024 · 1 comment · Fixed by #1826
Closed

buffer-metrics-sidecar should have healthy-check #1811

genofire opened this issue Sep 19, 2024 · 1 comment · Fixed by #1826
Assignees
Labels
bug Something isn't working feature-request
Milestone

Comments

@genofire
Copy link
Collaborator

genofire commented Sep 19, 2024

Describe the bug:

  • Container of buffer-metrics-sidecar in Fluentd-Pod stop working
  • Container keep in this state

Expected behaviour:

  • Health Check runs
  • restart with healthy check

Steps to reproduce the bug:
No idea, no time to debug why this container (based on node-exporter) stopped

Additional context:
wanted changes

apiVersion: v1
kind: Pod
metadata:
  name: logging-operator-fluentd-0
spec:
  containers:
    - name: buffer-metrics-sidecar                                                                                                               
      ports:
        - containerPort: 9200
          name: buffer-metrics
          protocol: TCP
+     livenessProbe:
+       failureThreshold: 3
+       httpGet:
+         path: /
+         port: buffer-metrics
+         scheme: HTTP
+       periodSeconds: 10
+       successThreshold: 1
+       timeoutSeconds: 1
+     readinessProbe:
+       failureThreshold: 3
+       httpGet:
+         path: /
+         port: buffer-metrics
+         scheme: HTTP
+       periodSeconds: 10
+       successThreshold: 1
+       timeoutSeconds: 1

some logs:

du: /buffers/logging::logging-operator:clusteroutput:logging:default.q6220910f64666f76f0e5d6a6941540c0.buffer.meta: No such file or director
du: /buffers/logging::logging-operator:clusteroutput:logging:default.q62209308d4d386b74a031efee393d4a5.buffer.meta: No such file or director
du: /buffers/main-fluentd-error.b62272e9971649f8879a4f3ec5490c5be.buffer.meta: No such file or directory                                    
       
du: /buffers/logging::logging-operator:clusteroutput:logging:default.q622093edc2636962fab6211bb7203aff.buffer.meta: No such file or director
du: /buffers/logging::logging-operator:clusteroutput:logging:default.b622093fb0e0b9055f27e2e3a1cb643e9.buffer.meta: No such file or director
du: /buffers/flow:ingress:traefik:clusteroutput:logging:default.q6223847f2be1f4a363d29a01b90c2778.buffer: No such file or directory

Environment details:

  • Kubernetes version (e.g. v1.15.2):
  • Cloud-provider/provisioner (e.g. AKS, GKE, EKS, PKE etc):
  • logging-operator version (e.g. 2.1.1):
  • Install method (e.g. helm or static manifests):
  • Logs from the misbehaving component (and any other relevant logs):
  • Resource definition (possibly in YAML format) that caused the issue, without sensitive data:

/kind bug

@genofire genofire added the bug Something isn't working label Sep 19, 2024
@pepov pepov modified the milestones: 4.x, 4.11 Oct 7, 2024
@csatib02
Copy link
Member

Hey @genofire,

First of all, thanks for using the Logging-operator!

I started investigating this issue and found that the sidecar container runs out of memory. This happens quite regularly. Seems like it was a known issue before: prometheus/node_exporter#1008. The root cause was that the wifi-collector was turned on by default, but it has since been changed to off by default.

An instant solution would be to modify the memory request and limit.
(NOTE: I went ahead and tested by doubling both the memory request and limit, and the pod was running okay. The default value for both limit and request is: 10M.)

Here’s an example of something like this:

apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: logging-example
spec:
  controlNamespace: logging
  enableRecreateWorkloadOnImmutableFieldChange: true
  fluentd:
    bufferVolumeImage:
      repository: ghcr.io/kube-logging/node-exporter
    bufferVolumeMetrics: 
      prometheusRules: true
      serviceMonitor: true
    bufferVolumeResources: # Pick the values that fits your use-case.
      requests:
        cpu: 2m
        memory: 20M
      limits:
        cpu: 100m
        memory: 20M
    metrics: 
      prometheusRules: true
      serviceMonitor: true
  fluentbit:
    metrics:
      prometheusRules: true
      serviceMonitor: true
    bufferStorage:
      storage.metrics: "On"
    healthCheck:
      hcErrorsCount: 15
      hcPeriod: 60
      hcRetryFailureCount: 5

@csatib02 csatib02 linked a pull request Oct 11, 2024 that will close this issue
@csatib02 csatib02 self-assigned this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature-request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants