buffer-metrics-sidecar should have healthy-check #1811

genofire · 2024-09-19T08:26:58Z

Describe the bug:

Container of buffer-metrics-sidecar in Fluentd-Pod stop working
Container keep in this state

Expected behaviour:

Health Check runs
restart with healthy check

Steps to reproduce the bug:
No idea, no time to debug why this container (based on node-exporter) stopped

Additional context:
wanted changes

apiVersion: v1
kind: Pod
metadata:
  name: logging-operator-fluentd-0
spec:
  containers:
    - name: buffer-metrics-sidecar                                                                                                               
      ports:
        - containerPort: 9200
          name: buffer-metrics
          protocol: TCP
+     livenessProbe:
+       failureThreshold: 3
+       httpGet:
+         path: /
+         port: buffer-metrics
+         scheme: HTTP
+       periodSeconds: 10
+       successThreshold: 1
+       timeoutSeconds: 1
+     readinessProbe:
+       failureThreshold: 3
+       httpGet:
+         path: /
+         port: buffer-metrics
+         scheme: HTTP
+       periodSeconds: 10
+       successThreshold: 1
+       timeoutSeconds: 1

some logs:

du: /buffers/logging::logging-operator:clusteroutput:logging:default.q6220910f64666f76f0e5d6a6941540c0.buffer.meta: No such file or director
du: /buffers/logging::logging-operator:clusteroutput:logging:default.q62209308d4d386b74a031efee393d4a5.buffer.meta: No such file or director
du: /buffers/main-fluentd-error.b62272e9971649f8879a4f3ec5490c5be.buffer.meta: No such file or directory                                    
       
du: /buffers/logging::logging-operator:clusteroutput:logging:default.q622093edc2636962fab6211bb7203aff.buffer.meta: No such file or director
du: /buffers/logging::logging-operator:clusteroutput:logging:default.b622093fb0e0b9055f27e2e3a1cb643e9.buffer.meta: No such file or director
du: /buffers/flow:ingress:traefik:clusteroutput:logging:default.q6223847f2be1f4a363d29a01b90c2778.buffer: No such file or directory

Environment details:

Kubernetes version (e.g. v1.15.2):
Cloud-provider/provisioner (e.g. AKS, GKE, EKS, PKE etc):
logging-operator version (e.g. 2.1.1):
Install method (e.g. helm or static manifests):
Logs from the misbehaving component (and any other relevant logs):
Resource definition (possibly in YAML format) that caused the issue, without sensitive data:

/kind bug

The text was updated successfully, but these errors were encountered:

csatib02 · 2024-10-10T12:25:02Z

Hey @genofire,

First of all, thanks for using the Logging-operator!

I started investigating this issue and found that the sidecar container runs out of memory. This happens quite regularly. Seems like it was a known issue before: prometheus/node_exporter#1008. The root cause was that the wifi-collector was turned on by default, but it has since been changed to off by default.

An instant solution would be to modify the memory request and limit.
(NOTE: I went ahead and tested by doubling both the memory request and limit, and the pod was running okay. The default value for both limit and request is: 10M.)

Here’s an example of something like this:

apiVersion: logging.banzaicloud.io/v1beta1
kind: Logging
metadata:
  name: logging-example
spec:
  controlNamespace: logging
  enableRecreateWorkloadOnImmutableFieldChange: true
  fluentd:
    bufferVolumeImage:
      repository: ghcr.io/kube-logging/node-exporter
    bufferVolumeMetrics: 
      prometheusRules: true
      serviceMonitor: true
    bufferVolumeResources: # Pick the values that fits your use-case.
      requests:
        cpu: 2m
        memory: 20M
      limits:
        cpu: 100m
        memory: 20M
    metrics: 
      prometheusRules: true
      serviceMonitor: true
  fluentbit:
    metrics:
      prometheusRules: true
      serviceMonitor: true
    bufferStorage:
      storage.metrics: "On"
    healthCheck:
      hcErrorsCount: 15
      hcPeriod: 60
      hcRetryFailureCount: 5

genofire added the bug Something isn't working label Sep 19, 2024

pepov modified the milestones: 4.x, 4.11 Oct 7, 2024

csatib02 linked a pull request Oct 11, 2024 that will close this issue

feat: enhance buffer-metrics sidecar #1826

Merged

pepov closed this as completed in #1826 Oct 11, 2024

csatib02 self-assigned this Nov 1, 2024

csatib02 added the feature-request label Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

buffer-metrics-sidecar should have healthy-check #1811

buffer-metrics-sidecar should have healthy-check #1811

genofire commented Sep 19, 2024 •

edited

Loading

csatib02 commented Oct 10, 2024

buffer-metrics-sidecar should have healthy-check #1811

buffer-metrics-sidecar should have healthy-check #1811

Comments

genofire commented Sep 19, 2024 • edited Loading

csatib02 commented Oct 10, 2024

genofire commented Sep 19, 2024 •

edited

Loading