Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat doesn't collect logs of CronJob pods #34045

Open
ghost opened this issue Dec 14, 2022 · 25 comments
Open

Filebeat doesn't collect logs of CronJob pods #34045

ghost opened this issue Dec 14, 2022 · 25 comments
Assignees
Labels
bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@ghost
Copy link

ghost commented Dec 14, 2022

We experience the same issue mentioned here #22718
We use the latest Filebeat 8.5.1 on AWS EKS with Amazon Linux 2 EC2 worker nodes.
Sometimes Filebeat loses CronJob Pods log mesages.
For CronJobs that runs every minute we see only 54 message with one hour instead of 60.
Around 9-10% of CronJob messages are lost.
In comparison the legacy Filebeat 6.8.13 doesn't lose any message.

Filebeat version 8.5.1.
AWS EKS is v1.21.14

Filebeat configuration:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 24h
      add_resource_metadata:
        namespace:
          include_labels: [""]
          include_annotations: [""]
        node:
          include_labels: [""]
          include_annotations: [""]
        deployment: true
        cronjob: true
      templates:
        - config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log

Cronjob example:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cronjob-test
  namespace: default
spec:
  schedule: "* * * * *"
  successfulJobsHistoryLimit: 30
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cronjob
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date
          restartPolicy: OnFailure

Example of messages on Kibana.
Here you can see that messages at 15:25 and 15:30 and 15:31 are missing:

Dec 14, 2022 @ 15:34:04.596	cronjob-test-27850474-m6pnd	Wed Dec 14 14:34:04 UTC 2022
Dec 14, 2022 @ 15:33:04.570	cronjob-test-27850473-bxzdp	Wed Dec 14 14:33:04 UTC 2022
Dec 14, 2022 @ 15:32:04.823	cronjob-test-27850472-nrr7p	Wed Dec 14 14:32:04 UTC 2022
Dec 14, 2022 @ 15:29:03.972	cronjob-test-27850469-9gbkx	Wed Dec 14 14:29:03 UTC 2022
Dec 14, 2022 @ 15:28:03.997	cronjob-test-27850468-5l5c7	Wed Dec 14 14:28:03 UTC 2022
Dec 14, 2022 @ 15:27:03.710	cronjob-test-27850467-24g44	Wed Dec 14 14:27:03 UTC 2022
Dec 14, 2022 @ 15:26:03.730	cronjob-test-27850466-qdxk9	Wed Dec 14 14:26:03 UTC 2022
Dec 14, 2022 @ 15:24:06.376	cronjob-test-27850464-ztclr	Wed Dec 14 14:24:06 UTC 2022
Dec 14, 2022 @ 15:23:06.970	cronjob-test-27850463-f2qm6	Wed Dec 14 14:23:06 UTC 2022
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 14, 2022
@ghost ghost closed this as completed Dec 14, 2022
@ghost ghost reopened this Dec 14, 2022
@belimawr belimawr added the Team:Integrations Label for the Integrations team label Dec 15, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 15, 2022
@belimawr belimawr added bug needs_team Indicates that the issue/PR needs a Team:* label labels Dec 15, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 15, 2022
@botelastic
Copy link

botelastic bot commented Dec 15, 2022

This issue doesn't have a Team:<team> label.

@dezmodue
Copy link

dezmodue commented Feb 7, 2023

We're running filebeat 8.5.2 on k8s and it looks like we are facing the same issue

@tetianakravchenko tetianakravchenko added Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team and removed Team:Integrations Label for the Integrations team labels Feb 7, 2023
@marcoval123
Copy link

We're running filebeat 7.10.2 n k8s and it looks like we are facing the same issue

@StephanErb
Copy link

StephanErb commented Feb 21, 2023

Has anyone tried playing around with kube_client_options yet?

I am wondering if this could be related to the issue here. Assuming filebeat is running on a busy node, I could totally expect it to run into the query throttle. Discovery could then be delayed so much that we never end up picking up the logs before the container is terminated again.

Never looked into it deeper, but still a hypothesis I'd like to validate one day.

@ChrsMark
Copy link
Member

Hey folks, sorry for the late response here.
This one should be fixed by #22718 already. @gizas @rameshelastic could we have someone from the team to have a look into this one? It could be a regression or another corner case. This one as well as the
#22718 provide steps to reproduce the issue so it should be fairly easy to check.

@eedugon
Copy link
Contributor

eedugon commented Feb 22, 2023

@StephanErb , I believe that's a nice hypothesis. I've seen a case where some CronJob related pods logs were not being shipped and there was no track at all of the beat even trying to harvest the log file. A delay in discovery would definitely explain the outcome.

@gizas
Copy link
Contributor

gizas commented Feb 23, 2023

cc @gsantoro

@gsantoro gsantoro self-assigned this Feb 23, 2023
@gsantoro
Copy link
Contributor

sorry for the late response. I've now assigned this issue to myself. It's now the highest priority item on my list.

@gsantoro
Copy link
Contributor

I would like to start with a simplified setup so that we can exclude possible root causes:

  • K8s: local cluster with Kind with 1 node. Version = v1.26.0
  • Elastic: local cluster with version 8.7.0-SNAPSHOT
  • Filebeat: code from main branch with the following config. to notice that autodiscover is not enabled yet.
    - type: container
      paths:
        - /var/log/containers/*.log
      processors:
        - add_kubernetes_metadata:
            host: ${NODE_NAME}
            matchers:
            - logs_path:
                logs_path: "/var/log/containers/"

same cronjob manifest as in #22718. Only thing that changed is the apiVersion (which might have changed since then)

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
  namespace: kube-system
spec:
  schedule: "*/1 * * * *"
  failedJobsHistoryLimit: 10
  successfulJobsHistoryLimit: 20
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

This is the list of k8s events for a single run of that cronjob. The order of events is reversed (first event is the last to occurr)

│ kube-system                  12m                         Normal                  SawCompletedJob                   cronjob/hello                              1                       │
│ kube-system                  12m                         Normal                  SuccessfulCreate                  job/hello-27952816                         1                       │
│ kube-system                  12m                         Normal                  Completed                         job/hello-27952816                         1                       │
│ kube-system                  12m                         Normal                  Started                           pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Scheduled                         pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Created                           pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Pulled                            pod/hello-27952816-p2zxj                   1                       │
│ kube-system                  12m                         Normal                  Pulling                           pod/hello-27952816-p2zxj                   1                       │

and this is a screenshot of Kibana with the list of all logs for that single pod (and all the other cronjobs run as well).

Screenshot 2023-02-23 at 16 31 26

So this use case confirm that:

  1. logs are correctly collected by filebeat if running without autodiscover and with this combination of elastic, k8s, filebeat code
  2. the k8s events progression seems correct. I can't find any errors. I agree like it was said in the comment Kubernetes autodiscover doesn't discover short living jobs (and pods?) #22718 (comment) that there is no Running event for cronjobs compared to long running jobs. But everything seems to point to the fact there is no Running event raised by cron jobs at all when everything is working normally.
  3. there is no init cointainers in this example yet but I'll leave that out for now. We might come back to this later.
➜ k describe cronjob hello
Name:                          hello
Namespace:                     kube-system
Labels:                        <none>
Annotations:                   <none>
Schedule:                      */1 * * * *
Concurrency Policy:            Allow
Suspend:                       False
Successful Job History Limit:  20
Failed Job History Limit:      10
Starting Deadline Seconds:     <unset>
Selector:                      <unset>
Parallelism:                   <unset>
Completions:                   <unset>
Pod Template:
  Labels:  <none>
  Containers:
   hello:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      /bin/sh
      -c
      date; echo Hello from the Kubernetes cluster
    Environment:     <none>
    Mounts:          <none>
  Volumes:           <none>
Last Schedule Time:  Thu, 23 Feb 2023 16:38:00 +0000
Active Jobs:         <none>
Events:
  Type    Reason            Age                   From                Message
  ----    ------            ----                  ----                -------
  Normal  SuccessfulCreate  22m                   cronjob-controller  Created job hello-27952816
  Normal  SawCompletedJob   22m                   cronjob-controller  Saw completed job: hello-27952816, status: Complete
  Normal  SuccessfulCreate  21m                   cronjob-controller  Created job hello-27952817
  Normal  SawCompletedJob   21m                   cronjob-controller  Saw completed job: hello-27952817, status: Complete
  Normal  SuccessfulCreate  20m                   cronjob-controller  Created job hello-27952818

I'll move now to replicate the same setup but only changing the filebeat config with autodiscover.

@gsantoro
Copy link
Contributor

Here is the test with autodiscover. Everything else is the same from the previous comment

filebeat config to replace filebeat.inputs

    filebeat.autodiscover:
     providers:
       - type: kubernetes
         node: ${NODE_NAME}
         hints.enabled: true
         hints.default_config:
           type: container
           paths:
             - /var/log/containers/*${data.kubernetes.container.id}.log

List of completed cron jobs

│ 6m30s            hello-27952850-gd26q                                          10.244.0.36             integrations-control-plane             Completed             0/1               │
│ 5m30s            hello-27952851-478zd                                          10.244.0.37             integrations-control-plane             Completed             0/1               │
│ 4m30s            hello-27952852-6nkr9                                          10.244.0.38             integrations-control-plane             Completed             0/1               │
│ 3m30s            hello-27952853-g6rhz                                          10.244.0.39             integrations-control-plane             Completed             0/1               │
│ 2m30s            hello-27952854-r6stl                                          10.244.0.40             integrations-control-plane             Completed             0/1               │
│ 90s              hello-27952855-cf55v                                          10.244.0.41             integrations-control-plane             Completed             0/1               │
│ 30s              hello-27952856-flk9w                                          10.244.0.42             integrations-control-plane             Completed             0/1

Screenshots from Kibana with logs ingested in ES

Screenshot 2023-02-23 at 16 56 22

The number of completed cronjobs matches the number of reported events in Kibana.

@gsantoro
Copy link
Contributor

gsantoro commented Feb 23, 2023

How to go forward:

  1. Autodiscover with templates instead of hint with the following configuration
Filebeat configuration:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      cleanup_timeout: 24h
      add_resource_metadata:
        namespace:
          include_labels: [""]
          include_annotations: [""]
        node:
          include_labels: [""]
          include_annotations: [""]
        deployment: true
        cronjob: true
      templates:
        - config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log
  1. Filebeat and elastic stack with version 8.5.1 with same config as at point 1
  2. same k8s version as AWS EKS is v1.21.14 but with Kind
  3. Setup in the cloud with AWS EKS is v1.21.14 with everything else from previous test the same

@gsantoro
Copy link
Contributor

This works as expected too

  1. Autodiscover with templates instead of hint with the following configuration
 13m              hello-27952878-btqkg                                          10.244.0.50             integrations-control-plane             Completed             0/1               │
│ 12m              hello-27952879-4n6rt                                          10.244.0.51             integrations-control-plane             Completed             0/1               │
│ 11m              hello-27952880-f6pc2                                          10.244.0.52             integrations-control-plane             Completed             0/1               │
│ 10m              hello-27952881-d5kl9                                          10.244.0.53             integrations-control-plane             Completed             0/1               │
│ 9m34s            hello-27952882-xjpl4                                          10.244.0.54             integrations-control-plane             Completed             0/1               │
│ 8m34s            hello-27952883-whxvw                                          10.244.0.55             integrations-control-plane             Completed             0/1               │
│ 7m34s            hello-27952884-bh65s                                          10.244.0.56             integrations-control-plane             Completed             0/1               │
│ 6m34s            hello-27952885-fqfn8                                          10.244.0.57             integrations-control-plane             Completed             0/1               │
│ 5m34s            hello-27952886-7whwf                                          10.244.0.58             integrations-control-plane             Completed             0/1               │
│ 4m34s            hello-27952887-b75hs                                          10.244.0.59             integrations-control-plane             Completed             0/1               │
│ 3m34s            hello-27952888-sgv85                                          10.244.0.60             integrations-control-plane             Completed             0/1               │
│ 2m34s            hello-27952889-qqtcs                                          10.244.0.61             integrations-control-plane             Completed             0/1               │
│ 94s              hello-27952890-56hxq                                          10.244.0.62             integrations-control-plane             Completed             0/1               │
│ 34s              hello-27952891-gc9gn                                          10.244.0.63             integrations-control-plane             Completed             0/1               │

and the Kibana screenshot

Screenshot 2023-02-23 at 17 30 52

@gsantoro
Copy link
Contributor

  1. Filebeat and elastic stack with version 8.5.1 with same config as at point 1
7m31s        kube-system               hello-27952913-c9z55                                      10.244.0.6         integrations-control-plane        Completed         0/1           │
│ 6m31s        kube-system               hello-27952914-rcfjx                                      10.244.0.7         integrations-control-plane        Completed         0/1           │
│ 5m31s        kube-system               hello-27952915-jlvrg                                      10.244.0.8         integrations-control-plane        Completed         0/1           │
│ 4m31s        kube-system               hello-27952916-k8s9w                                      10.244.0.9         integrations-control-plane        Completed         0/1           │
│ 3m31s        kube-system               hello-27952917-59ssr                                      10.244.0.10        integrations-control-plane        Completed         0/1           │
│ 2m31s        kube-system               hello-27952918-tzwlt                                      10.244.0.11        integrations-control-plane        Completed         0/1           │
│ 91s          kube-system               hello-27952919-j79zr                                      10.244.0.12        integrations-control-plane        Completed         0/1           │
│ 31s          kube-system               hello-27952920-hqj54                                      10.244.0.13        integrations-control-plane        Completed         0/1           │

and the usual Kibana page with all the cronjobs properly logged

Screenshot 2023-02-23 at 18 00 26

@gsantoro
Copy link
Contributor

  1. same k8s version as AWS EKS is v1.21.14 but with Kind

same thing here. all pods completed

│ 3m6s         kube-system               hello-27952933-jvjr5                                      10.244.0.6        integrations-control-plane         Completed         0/1           │
│ 2m6s         kube-system               hello-27952934-lxpxh                                      10.244.0.7        integrations-control-plane         Completed         0/1           │
│ 66s          kube-system               hello-27952935-lm5r5                                      10.244.0.8        integrations-control-plane         Completed         0/1           │

all pods accounted for

Screenshot 2023-02-23 at 18 15 56

Before I move on and start a more complicated setup with AWS EKS, I am wondering if I should pause and consider any other settings, config or setup that can still replicated locally without using AWS EKS.

Does anyone here have a better idea?

@StephanErb
Copy link

Does anyone here have a better idea?

I still believe this is either load related or has a statistical element to it. I would try starting 50 cron jobs (on a 1-2 node cluster) and see how that behaves.

@gizas
Copy link
Contributor

gizas commented Feb 24, 2023

Ok I think that @gsantoro verified that all work in a small scale.

So lets choose one of the above tests. and run it for long time. In more details:

Lets see if something happens

@gsantoro
Copy link
Contributor

couple of questions:

  1. if we want to leave it running at scale over the weekend I would suggest to use the cloud instead of my local laptop. I have some automation to quickly setup a GCP k8s cluster with the latest k8s version or at least the oldest version still supported by google. This might be greater than the version v1.21.14 from AWS. Is this OK?
  2. I'm not 100% sure if the hints based configuration (that I normally use) or the templates configuration (from the original message) is a relevant factor. Maybe @ChrsMark can chime in about this specific topic
  3. about having 50 different cronjob, I'll write a script that match the cronjobs started in k8s to the ones ingested in ES so that we can check if there is any logs missing. I am not aware there is already anything to do this.

@gizas
Copy link
Contributor

gizas commented Feb 24, 2023

  1. OK
  2. The config should not be a factor. Ok choose what is closer to user's sdh
  3. OK.

@StephanErb
Copy link

There is a related thread on Filebeat autodiscover stopping too early when kubernetes pod terminates over at discuss.elstic.co. It raises the interesting questions: Could the problem be related to multiple within the same POD and how this affects the POD termination?

At least in our case we have Istio running, so the command terminating within the cronjob is never the last container to terminate in the POD. Is that maybe similar to other here also who face the problem?

@gsantoro
Copy link
Contributor

hello everyone, I have just finished a performance test on the cloud. this is the environment:

  • elastic stack: 8.7.0-SNAPSHOT
  • K8S version: v.1.25.6
  • 3 nodes k8s cluster = 3 filebeat pods
  • filebeat with autodiscover

instead of a cronjob I have used the following job template since it is easier to deal with parallelism.

apiVersion: batch/v1
kind: Job
metadata:
  name: process-item-$ITEM
  namespace: kube-system
  labels:
    jobgroup: jobexample
spec:
  completions: 500
  parallelism: 50
  template:
    metadata:
      name: jobexample
      labels:
        jobgroup: jobexample
    spec:
      containers:
      - name: process-item
        image: busybox
        command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
      restartPolicy: Never

where the variable $ITEM has been replace with values a, b, c, d so that I can have a total of:

  • 50 parallel job x 4 (values for $ITEM)
  • total = 2000 jobs

And this are the results:

  • logs from 474 jobs were correctly ingested into Elasticsearch
  • over 800 jobs were started. Some where still running (around 100) and the other already completed

Screenshot 2023-02-27 at 13 04 08

As you can see from the previous screenshot, filebeat wasn't able to catchup with the amount of logs produced. I'll have another look with a smaller scale to check if it's just a scaling issue of filebeat or something else.

@gsantoro
Copy link
Contributor

I just finished another test with a lower parallism and total number of job completions

  • parallelism: 5 x 4 jobs
  • completions: 50
  • total jobs: 200
  • correctly ingested: 200
  • test duration: 2m

And this is what I noticed:

  • even at this scale, the k8s cluster cpu reached ~80% usage
  • with ~90% critical cpu usage for the last 30s
  • all events where ingested
  • I believe that the issue might have to do with the scale of events to be ingested and the performance capacity of filebeat.

@gsantoro
Copy link
Contributor

Similar results: all events ingested at the current scale

  • parallelism 10 x 4 jobs
  • completions: 100
  • total jobs: 400
  • correctly ingested: 400
  • test duration: 3m

I'll run another test for a longer period of time but a similar parallelism to see if anything changes but I am running out of ideas to test.

@StephanErb
Copy link

StephanErb commented Mar 23, 2023

@gsantoro did the longer running test help to reproduce the issue?

Looks like another user filed a similar issue recently #34789, including an hypothesis of what could cause missing logs for short lived pods.

@gsantoro
Copy link
Contributor

gsantoro commented Mar 24, 2023

hello @StephanErb , sorry for not updating this issue but the longer running test wasn't successful either in replicating the issue.
Everything in my tests seems to point out to a performance issue on filebeat not being able to catchup with the amount of logs to ingest.

I'll have a look at that issue. Thanks for pointing it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

10 participants