Filebeat doesn't collect logs of CronJob pods #34045

ghost · 2022-12-14T14:42:21Z

We experience the same issue mentioned here #22718
We use the latest Filebeat 8.5.1 on AWS EKS with Amazon Linux 2 EC2 worker nodes.
Sometimes Filebeat loses CronJob Pods log mesages.
For CronJobs that runs every minute we see only 54 message with one hour instead of 60.
Around 9-10% of CronJob messages are lost.
In comparison the legacy Filebeat 6.8.13 doesn't lose any message.

Filebeat version 8.5.1.
AWS EKS is v1.21.14

Filebeat configuration:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 24h
      add_resource_metadata:
        namespace:
          include_labels: [""]
          include_annotations: [""]
        node:
          include_labels: [""]
          include_annotations: [""]
        deployment: true
        cronjob: true
      templates:
        - config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log

Cronjob example:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cronjob-test
  namespace: default
spec:
  schedule: "* * * * *"
  successfulJobsHistoryLimit: 30
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cronjob
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date
          restartPolicy: OnFailure

Example of messages on Kibana.
Here you can see that messages at 15:25 and 15:30 and 15:31 are missing:

Dec 14, 2022 @ 15:34:04.596	cronjob-test-27850474-m6pnd	Wed Dec 14 14:34:04 UTC 2022
Dec 14, 2022 @ 15:33:04.570	cronjob-test-27850473-bxzdp	Wed Dec 14 14:33:04 UTC 2022
Dec 14, 2022 @ 15:32:04.823	cronjob-test-27850472-nrr7p	Wed Dec 14 14:32:04 UTC 2022
Dec 14, 2022 @ 15:29:03.972	cronjob-test-27850469-9gbkx	Wed Dec 14 14:29:03 UTC 2022
Dec 14, 2022 @ 15:28:03.997	cronjob-test-27850468-5l5c7	Wed Dec 14 14:28:03 UTC 2022
Dec 14, 2022 @ 15:27:03.710	cronjob-test-27850467-24g44	Wed Dec 14 14:27:03 UTC 2022
Dec 14, 2022 @ 15:26:03.730	cronjob-test-27850466-qdxk9	Wed Dec 14 14:26:03 UTC 2022
Dec 14, 2022 @ 15:24:06.376	cronjob-test-27850464-ztclr	Wed Dec 14 14:24:06 UTC 2022
Dec 14, 2022 @ 15:23:06.970	cronjob-test-27850463-f2qm6	Wed Dec 14 14:23:06 UTC 2022

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-12-15T10:14:29Z

Pinging @elastic/integrations (Team:Integrations)

botelastic · 2022-12-15T10:14:39Z

This issue doesn't have a Team:<team> label.

dezmodue · 2023-02-07T16:29:04Z

We're running filebeat 8.5.2 on k8s and it looks like we are facing the same issue

marcoval123 · 2023-02-21T18:08:37Z

We're running filebeat 7.10.2 n k8s and it looks like we are facing the same issue

StephanErb · 2023-02-21T19:30:20Z

Has anyone tried playing around with kube_client_options yet?

I am wondering if this could be related to the issue here. Assuming filebeat is running on a busy node, I could totally expect it to run into the query throttle. Discovery could then be delayed so much that we never end up picking up the logs before the container is terminated again.

Never looked into it deeper, but still a hypothesis I'd like to validate one day.

ChrsMark · 2023-02-22T06:22:58Z

Hey folks, sorry for the late response here.
This one should be fixed by #22718 already. @gizas @rameshelastic could we have someone from the team to have a look into this one? It could be a regression or another corner case. This one as well as the
#22718 provide steps to reproduce the issue so it should be fairly easy to check.

eedugon · 2023-02-22T21:39:17Z

@StephanErb , I believe that's a nice hypothesis. I've seen a case where some CronJob related pods logs were not being shipped and there was no track at all of the beat even trying to harvest the log file. A delay in discovery would definitely explain the outcome.

gizas · 2023-02-23T12:59:38Z

cc @gsantoro

gsantoro · 2023-02-23T14:15:00Z

sorry for the late response. I've now assigned this issue to myself. It's now the highest priority item on my list.

gsantoro · 2023-02-23T16:42:24Z

I would like to start with a simplified setup so that we can exclude possible root causes:

K8s: local cluster with Kind with 1 node. Version = v1.26.0
Elastic: local cluster with version 8.7.0-SNAPSHOT
Filebeat: code from main branch with the following config. to notice that autodiscover is not enabled yet.

    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

same cronjob manifest as in #22718. Only thing that changed is the apiVersion (which might have changed since then)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
  namespace: kube-system
spec:
  schedule: "*/1 * * * *"
  failedJobsHistoryLimit: 10
  successfulJobsHistoryLimit: 20
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

This is the list of k8s events for a single run of that cronjob. The order of events is reversed (first event is the last to occurr)

│ kube-system                  12m                         Normal                  SawCompletedJob                   cronjob/hello                              1                       │
│ kube-system                  12m                         Normal                  SuccessfulCreate                  job/hello-27952816                         1                       │
│ kube-system                  12m                         Normal                  Completed                         job/hello-27952816                         1                       │
│ kube-system                  12m                         Normal                  Started                           pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Scheduled                         pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Created                           pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Pulled                            pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Pulling                           pod/hello-27952816-p2zxj                   1                       │

and this is a screenshot of Kibana with the list of all logs for that single pod (and all the other cronjobs run as well).

So this use case confirm that:

logs are correctly collected by filebeat if running without autodiscover and with this combination of elastic, k8s, filebeat code
the k8s events progression seems correct. I can't find any errors. I agree like it was said in the comment Kubernetes autodiscover doesn't discover short living jobs (and pods?) #22718 (comment) that there is no Running event for cronjobs compared to long running jobs. But everything seems to point to the fact there is no Running event raised by cron jobs at all when everything is working normally.
there is no init cointainers in this example yet but I'll leave that out for now. We might come back to this later.

➜ k describe cronjob hello
Name:                          hello
Namespace:                     kube-system
Labels:                        <none>
Annotations:                   <none>
Schedule:                      */1 * * * *
Concurrency Policy:            Allow
Suspend:                       False
Successful Job History Limit:  20
Failed Job History Limit:      10
Starting Deadline Seconds:     <unset>
Selector:                      <unset>
Parallelism:                   <unset>
Completions:                   <unset>
Pod Template:
  Labels:  <none>
  Containers:
   hello:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      /bin/sh
      -c
      date; echo Hello from the Kubernetes cluster
    Environment:     <none>
    Mounts:          <none>
  Volumes:           <none>
Last Schedule Time:  Thu, 23 Feb 2023 16:38:00 +0000
Active Jobs:         <none>
Events:
  Type    Reason            Age                   From                Message
  ----    ------            ----                  ----                -------
  Normal  SuccessfulCreate  22m                   cronjob-controller  Created job hello-27952816
  Normal  SawCompletedJob   22m                   cronjob-controller  Saw completed job: hello-27952816, status: Complete
  Normal  SuccessfulCreate  21m                   cronjob-controller  Created job hello-27952817
  Normal  SawCompletedJob   21m                   cronjob-controller  Saw completed job: hello-27952817, status: Complete
  Normal  SuccessfulCreate  20m                   cronjob-controller  Created job hello-27952818

I'll move now to replicate the same setup but only changing the filebeat config with autodiscover.

gsantoro · 2023-02-23T17:01:08Z

Here is the test with autodiscover. Everything else is the same from the previous comment

filebeat config to replace filebeat.inputs

    filebeat.autodiscover:
     providers:
       - type: kubernetes
         node: ${NODE_NAME}
         hints.enabled: true
         hints.default_config:
           type: container
           paths:
             - /var/log/containers/*${data.kubernetes.container.id}.log

List of completed cron jobs

│ 6m30s            hello-27952850-gd26q                                          10.244.0.36             integrations-control-plane             Completed             0/1               │
│ 5m30s            hello-27952851-478zd                                          10.244.0.37             integrations-control-plane             Completed             0/1               │
│ 4m30s            hello-27952852-6nkr9                                          10.244.0.38             integrations-control-plane             Completed             0/1               │
│ 3m30s            hello-27952853-g6rhz                                          10.244.0.39             integrations-control-plane             Completed             0/1               │
│ 2m30s            hello-27952854-r6stl                                          10.244.0.40             integrations-control-plane             Completed             0/1               │
│ 90s              hello-27952855-cf55v                                          10.244.0.41             integrations-control-plane             Completed             0/1               │
│ 30s              hello-27952856-flk9w                                          10.244.0.42             integrations-control-plane             Completed             0/1

Screenshots from Kibana with logs ingested in ES

The number of completed cronjobs matches the number of reported events in Kibana.

gsantoro · 2023-02-23T17:12:09Z

How to go forward:

I'll leave the init containers out of the picture for now
Since I haven't been able to replicate Kubernetes autodiscover doesn't discover short living jobs (and pods?) #22718 with the latest version of the stack, I'll move to replicate this issue instead with tests in this order

Autodiscover with templates instead of hint with the following configuration

Filebeat configuration:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 24h
      add_resource_metadata:
        namespace:
          include_labels: [""]
          include_annotations: [""]
        node:
          include_labels: [""]
          include_annotations: [""]
        deployment: true
        cronjob: true
      templates:
        - config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log

Filebeat and elastic stack with version 8.5.1 with same config as at point 1
same k8s version as AWS EKS is v1.21.14 but with Kind
Setup in the cloud with AWS EKS is v1.21.14 with everything else from previous test the same

gsantoro · 2023-02-23T17:32:04Z

This works as expected too

Autodiscover with templates instead of hint with the following configuration

 13m              hello-27952878-btqkg                                          10.244.0.50             integrations-control-plane             Completed             0/1               │
│ 12m              hello-27952879-4n6rt                                          10.244.0.51             integrations-control-plane             Completed             0/1               │
│ 11m              hello-27952880-f6pc2                                          10.244.0.52             integrations-control-plane             Completed             0/1               │
│ 10m              hello-27952881-d5kl9                                          10.244.0.53             integrations-control-plane             Completed             0/1               │
│ 9m34s            hello-27952882-xjpl4                                          10.244.0.54             integrations-control-plane             Completed             0/1               │
│ 8m34s            hello-27952883-whxvw                                          10.244.0.55             integrations-control-plane             Completed             0/1               │
│ 7m34s            hello-27952884-bh65s                                          10.244.0.56             integrations-control-plane             Completed             0/1               │
│ 6m34s            hello-27952885-fqfn8                                          10.244.0.57             integrations-control-plane             Completed             0/1               │
│ 5m34s            hello-27952886-7whwf                                          10.244.0.58             integrations-control-plane             Completed             0/1               │
│ 4m34s            hello-27952887-b75hs                                          10.244.0.59             integrations-control-plane             Completed             0/1               │
│ 3m34s            hello-27952888-sgv85                                          10.244.0.60             integrations-control-plane             Completed             0/1               │
│ 2m34s            hello-27952889-qqtcs                                          10.244.0.61             integrations-control-plane             Completed             0/1               │
│ 94s              hello-27952890-56hxq                                          10.244.0.62             integrations-control-plane             Completed             0/1               │
│ 34s              hello-27952891-gc9gn                                          10.244.0.63             integrations-control-plane             Completed             0/1               │

and the Kibana screenshot

gsantoro · 2023-02-23T18:01:22Z

Filebeat and elastic stack with version 8.5.1 with same config as at point 1

7m31s        kube-system               hello-27952913-c9z55                                      10.244.0.6         integrations-control-plane        Completed         0/1           │
│ 6m31s        kube-system               hello-27952914-rcfjx                                      10.244.0.7         integrations-control-plane        Completed         0/1           │
│ 5m31s        kube-system               hello-27952915-jlvrg                                      10.244.0.8         integrations-control-plane        Completed         0/1           │
│ 4m31s        kube-system               hello-27952916-k8s9w                                      10.244.0.9         integrations-control-plane        Completed         0/1           │
│ 3m31s        kube-system               hello-27952917-59ssr                                      10.244.0.10        integrations-control-plane        Completed         0/1           │
│ 2m31s        kube-system               hello-27952918-tzwlt                                      10.244.0.11        integrations-control-plane        Completed         0/1           │
│ 91s          kube-system               hello-27952919-j79zr                                      10.244.0.12        integrations-control-plane        Completed         0/1           │
│ 31s          kube-system               hello-27952920-hqj54                                      10.244.0.13        integrations-control-plane        Completed         0/1           │

and the usual Kibana page with all the cronjobs properly logged

gsantoro · 2023-02-23T18:17:45Z

same k8s version as AWS EKS is v1.21.14 but with Kind

same thing here. all pods completed

│ 3m6s         kube-system               hello-27952933-jvjr5                                      10.244.0.6        integrations-control-plane         Completed         0/1           │
│ 2m6s         kube-system               hello-27952934-lxpxh                                      10.244.0.7        integrations-control-plane         Completed         0/1           │
│ 66s          kube-system               hello-27952935-lm5r5                                      10.244.0.8        integrations-control-plane         Completed         0/1           │

all pods accounted for

Before I move on and start a more complicated setup with AWS EKS, I am wondering if I should pause and consider any other settings, config or setup that can still replicated locally without using AWS EKS.

Does anyone here have a better idea?

StephanErb · 2023-02-23T18:58:24Z

Does anyone here have a better idea?

I still believe this is either load related or has a statistical element to it. I would try starting 50 cron jobs (on a 1-2 node cluster) and see how that behaves.

gizas · 2023-02-24T09:06:40Z

Ok I think that @gsantoro verified that all work in a small scale.

So lets choose one of the above tests. and run it for long time. In more details:

Hints configuration (Filebeat doesn't collect logs of CronJob pods #34045 (comment)) to be applied,
Apply 50 diffrent cronjobs (hello1, hello2, ... hello50 with difffrent echo message [8.5](backport #33697) Fix error spam on AKS #33702 )
Leave the cluster up for the whole weekend

Lets see if something happens

gsantoro · 2023-02-24T09:33:30Z

couple of questions:

if we want to leave it running at scale over the weekend I would suggest to use the cloud instead of my local laptop. I have some automation to quickly setup a GCP k8s cluster with the latest k8s version or at least the oldest version still supported by google. This might be greater than the version v1.21.14 from AWS. Is this OK?
I'm not 100% sure if the hints based configuration (that I normally use) or the templates configuration (from the original message) is a relevant factor. Maybe @ChrsMark can chime in about this specific topic
about having 50 different cronjob, I'll write a script that match the cronjobs started in k8s to the ones ingested in ES so that we can check if there is any logs missing. I am not aware there is already anything to do this.

gizas · 2023-02-24T11:32:04Z

OK
The config should not be a factor. Ok choose what is closer to user's sdh
OK.

StephanErb · 2023-02-26T14:14:09Z

There is a related thread on Filebeat autodiscover stopping too early when kubernetes pod terminates over at discuss.elstic.co. It raises the interesting questions: Could the problem be related to multiple within the same POD and how this affects the POD termination?

At least in our case we have Istio running, so the command terminating within the cronjob is never the last container to terminate in the POD. Is that maybe similar to other here also who face the problem?

gsantoro · 2023-02-27T13:07:09Z

hello everyone, I have just finished a performance test on the cloud. this is the environment:

elastic stack: 8.7.0-SNAPSHOT
K8S version: v.1.25.6
3 nodes k8s cluster = 3 filebeat pods
filebeat with autodiscover

instead of a cronjob I have used the following job template since it is easier to deal with parallelism.

apiVersion: batch/v1
kind: Job
metadata:
  name: process-item-$ITEM
  namespace: kube-system
  labels:
    jobgroup: jobexample
spec:
  completions: 500
  parallelism: 50
  template:
    metadata:
      name: jobexample
      labels:
        jobgroup: jobexample
    spec:
      containers:
      - name: process-item
        image: busybox
        command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
      restartPolicy: Never

where the variable $ITEM has been replace with values a, b, c, d so that I can have a total of:

50 parallel job x 4 (values for $ITEM)
total = 2000 jobs

And this are the results:

logs from 474 jobs were correctly ingested into Elasticsearch
over 800 jobs were started. Some where still running (around 100) and the other already completed

As you can see from the previous screenshot, filebeat wasn't able to catchup with the amount of logs produced. I'll have another look with a smaller scale to check if it's just a scaling issue of filebeat or something else.

gsantoro · 2023-02-27T13:25:05Z

I just finished another test with a lower parallism and total number of job completions

parallelism: 5 x 4 jobs
completions: 50
total jobs: 200
correctly ingested: 200
test duration: 2m

And this is what I noticed:

even at this scale, the k8s cluster cpu reached ~80% usage
with ~90% critical cpu usage for the last 30s
all events where ingested
I believe that the issue might have to do with the scale of events to be ingested and the performance capacity of filebeat.

gsantoro · 2023-02-27T13:46:57Z

Similar results: all events ingested at the current scale

parallelism 10 x 4 jobs
completions: 100
total jobs: 400
correctly ingested: 400
test duration: 3m

I'll run another test for a longer period of time but a similar parallelism to see if anything changes but I am running out of ideas to test.

StephanErb · 2023-03-23T20:22:26Z

@gsantoro did the longer running test help to reproduce the issue?

Looks like another user filed a similar issue recently #34789, including an hypothesis of what could cause missing logs for short lived pods.

gsantoro · 2023-03-24T09:57:43Z

hello @StephanErb , sorry for not updating this issue but the longer running test wasn't successful either in replicating the issue.
Everything in my tests seems to point out to a performance issue on filebeat not being able to catchup with the amount of logs to ingest.

I'll have a look at that issue. Thanks for pointing it out.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 14, 2022

ghost closed this as completed Dec 14, 2022

ghost reopened this Dec 14, 2022

belimawr added the Team:Integrations Label for the Integrations team label Dec 15, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 15, 2022

belimawr added bug needs_team Indicates that the issue/PR needs a Team:* label labels Dec 15, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 15, 2022

tetianakravchenko added Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team and removed Team:Integrations Label for the Integrations team labels Feb 7, 2023

gsantoro self-assigned this Feb 23, 2023

StephanErb mentioned this issue Mar 23, 2023

Filebeat autodiscover stopping too early when kubernetes pod terminates #34789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat doesn't collect logs of CronJob pods #34045

Filebeat doesn't collect logs of CronJob pods #34045

ghost commented Dec 14, 2022

elasticmachine commented Dec 15, 2022

botelastic bot commented Dec 15, 2022

dezmodue commented Feb 7, 2023

marcoval123 commented Feb 21, 2023

StephanErb commented Feb 21, 2023 •

edited

Loading

ChrsMark commented Feb 22, 2023

eedugon commented Feb 22, 2023

gizas commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023 •

edited

Loading

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

StephanErb commented Feb 23, 2023

gizas commented Feb 24, 2023

gsantoro commented Feb 24, 2023

gizas commented Feb 24, 2023 •

edited

Loading

StephanErb commented Feb 26, 2023

gsantoro commented Feb 27, 2023

gsantoro commented Feb 27, 2023

gsantoro commented Feb 27, 2023

StephanErb commented Mar 23, 2023 •

edited

Loading

gsantoro commented Mar 24, 2023 •

edited

Loading

Filebeat doesn't collect logs of CronJob pods #34045

Filebeat doesn't collect logs of CronJob pods #34045

Comments

ghost commented Dec 14, 2022

elasticmachine commented Dec 15, 2022

botelastic bot commented Dec 15, 2022

dezmodue commented Feb 7, 2023

marcoval123 commented Feb 21, 2023

StephanErb commented Feb 21, 2023 • edited Loading

ChrsMark commented Feb 22, 2023

eedugon commented Feb 22, 2023

gizas commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023 • edited Loading

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

gsantoro commented Feb 23, 2023

StephanErb commented Feb 23, 2023

gizas commented Feb 24, 2023

gsantoro commented Feb 24, 2023

gizas commented Feb 24, 2023 • edited Loading

StephanErb commented Feb 26, 2023

gsantoro commented Feb 27, 2023

gsantoro commented Feb 27, 2023

gsantoro commented Feb 27, 2023

StephanErb commented Mar 23, 2023 • edited Loading

gsantoro commented Mar 24, 2023 • edited Loading

StephanErb commented Feb 21, 2023 •

edited

Loading

gsantoro commented Feb 23, 2023 •

edited

Loading

gizas commented Feb 24, 2023 •

edited

Loading

StephanErb commented Mar 23, 2023 •

edited

Loading

gsantoro commented Mar 24, 2023 •

edited

Loading