-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filebeat doesn't collect logs of CronJob pods #34045
Comments
Pinging @elastic/integrations (Team:Integrations) |
This issue doesn't have a |
We're running filebeat 8.5.2 on k8s and it looks like we are facing the same issue |
We're running filebeat 7.10.2 n k8s and it looks like we are facing the same issue |
Has anyone tried playing around with I am wondering if this could be related to the issue here. Assuming filebeat is running on a busy node, I could totally expect it to run into the query throttle. Discovery could then be delayed so much that we never end up picking up the logs before the container is terminated again. Never looked into it deeper, but still a hypothesis I'd like to validate one day. |
Hey folks, sorry for the late response here. |
@StephanErb , I believe that's a nice hypothesis. I've seen a case where some |
cc @gsantoro |
sorry for the late response. I've now assigned this issue to myself. It's now the highest priority item on my list. |
I would like to start with a simplified setup so that we can exclude possible root causes:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/" same cronjob manifest as in #22718. Only thing that changed is the apiVersion (which might have changed since then) apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
namespace: kube-system
spec:
schedule: "*/1 * * * *"
failedJobsHistoryLimit: 10
successfulJobsHistoryLimit: 20
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure This is the list of k8s events for a single run of that cronjob. The order of events is reversed (first event is the last to occurr) │ kube-system 12m Normal SawCompletedJob cronjob/hello 1 │
│ kube-system 12m Normal SuccessfulCreate job/hello-27952816 1 │
│ kube-system 12m Normal Completed job/hello-27952816 1 │
│ kube-system 12m Normal Started pod/hello-27952816-p2zxj 1 │
│ kube-system 12m Normal Scheduled pod/hello-27952816-p2zxj 1 │
│ kube-system 12m Normal Created pod/hello-27952816-p2zxj 1 │
│ kube-system 12m Normal Pulled pod/hello-27952816-p2zxj 1 │
│ kube-system 12m Normal Pulling pod/hello-27952816-p2zxj 1 │ and this is a screenshot of Kibana with the list of all logs for that single pod (and all the other cronjobs run as well). So this use case confirm that:
➜ k describe cronjob hello
Name: hello
Namespace: kube-system
Labels: <none>
Annotations: <none>
Schedule: */1 * * * *
Concurrency Policy: Allow
Suspend: False
Successful Job History Limit: 20
Failed Job History Limit: 10
Starting Deadline Seconds: <unset>
Selector: <unset>
Parallelism: <unset>
Completions: <unset>
Pod Template:
Labels: <none>
Containers:
hello:
Image: busybox
Port: <none>
Host Port: <none>
Args:
/bin/sh
-c
date; echo Hello from the Kubernetes cluster
Environment: <none>
Mounts: <none>
Volumes: <none>
Last Schedule Time: Thu, 23 Feb 2023 16:38:00 +0000
Active Jobs: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 22m cronjob-controller Created job hello-27952816
Normal SawCompletedJob 22m cronjob-controller Saw completed job: hello-27952816, status: Complete
Normal SuccessfulCreate 21m cronjob-controller Created job hello-27952817
Normal SawCompletedJob 21m cronjob-controller Saw completed job: hello-27952817, status: Complete
Normal SuccessfulCreate 20m cronjob-controller Created job hello-27952818 I'll move now to replicate the same setup but only changing the filebeat config with autodiscover. |
Here is the test with autodiscover. Everything else is the same from the previous comment filebeat config to replace filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
hints.enabled: true
hints.default_config:
type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log List of completed cron jobs │ 6m30s hello-27952850-gd26q 10.244.0.36 integrations-control-plane Completed 0/1 │
│ 5m30s hello-27952851-478zd 10.244.0.37 integrations-control-plane Completed 0/1 │
│ 4m30s hello-27952852-6nkr9 10.244.0.38 integrations-control-plane Completed 0/1 │
│ 3m30s hello-27952853-g6rhz 10.244.0.39 integrations-control-plane Completed 0/1 │
│ 2m30s hello-27952854-r6stl 10.244.0.40 integrations-control-plane Completed 0/1 │
│ 90s hello-27952855-cf55v 10.244.0.41 integrations-control-plane Completed 0/1 │
│ 30s hello-27952856-flk9w 10.244.0.42 integrations-control-plane Completed 0/1 Screenshots from Kibana with logs ingested in ES The number of completed cronjobs matches the number of reported events in Kibana. |
How to go forward:
Filebeat configuration:
filebeat.autodiscover:
providers:
- type: kubernetes
cleanup_timeout: 24h
add_resource_metadata:
namespace:
include_labels: [""]
include_annotations: [""]
node:
include_labels: [""]
include_annotations: [""]
deployment: true
cronjob: true
templates:
- config:
- type: container
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
|
This works as expected too
13m hello-27952878-btqkg 10.244.0.50 integrations-control-plane Completed 0/1 │
│ 12m hello-27952879-4n6rt 10.244.0.51 integrations-control-plane Completed 0/1 │
│ 11m hello-27952880-f6pc2 10.244.0.52 integrations-control-plane Completed 0/1 │
│ 10m hello-27952881-d5kl9 10.244.0.53 integrations-control-plane Completed 0/1 │
│ 9m34s hello-27952882-xjpl4 10.244.0.54 integrations-control-plane Completed 0/1 │
│ 8m34s hello-27952883-whxvw 10.244.0.55 integrations-control-plane Completed 0/1 │
│ 7m34s hello-27952884-bh65s 10.244.0.56 integrations-control-plane Completed 0/1 │
│ 6m34s hello-27952885-fqfn8 10.244.0.57 integrations-control-plane Completed 0/1 │
│ 5m34s hello-27952886-7whwf 10.244.0.58 integrations-control-plane Completed 0/1 │
│ 4m34s hello-27952887-b75hs 10.244.0.59 integrations-control-plane Completed 0/1 │
│ 3m34s hello-27952888-sgv85 10.244.0.60 integrations-control-plane Completed 0/1 │
│ 2m34s hello-27952889-qqtcs 10.244.0.61 integrations-control-plane Completed 0/1 │
│ 94s hello-27952890-56hxq 10.244.0.62 integrations-control-plane Completed 0/1 │
│ 34s hello-27952891-gc9gn 10.244.0.63 integrations-control-plane Completed 0/1 │ and the Kibana screenshot |
7m31s kube-system hello-27952913-c9z55 10.244.0.6 integrations-control-plane Completed 0/1 │
│ 6m31s kube-system hello-27952914-rcfjx 10.244.0.7 integrations-control-plane Completed 0/1 │
│ 5m31s kube-system hello-27952915-jlvrg 10.244.0.8 integrations-control-plane Completed 0/1 │
│ 4m31s kube-system hello-27952916-k8s9w 10.244.0.9 integrations-control-plane Completed 0/1 │
│ 3m31s kube-system hello-27952917-59ssr 10.244.0.10 integrations-control-plane Completed 0/1 │
│ 2m31s kube-system hello-27952918-tzwlt 10.244.0.11 integrations-control-plane Completed 0/1 │
│ 91s kube-system hello-27952919-j79zr 10.244.0.12 integrations-control-plane Completed 0/1 │
│ 31s kube-system hello-27952920-hqj54 10.244.0.13 integrations-control-plane Completed 0/1 │ and the usual Kibana page with all the cronjobs properly logged |
same thing here. all pods completed │ 3m6s kube-system hello-27952933-jvjr5 10.244.0.6 integrations-control-plane Completed 0/1 │
│ 2m6s kube-system hello-27952934-lxpxh 10.244.0.7 integrations-control-plane Completed 0/1 │
│ 66s kube-system hello-27952935-lm5r5 10.244.0.8 integrations-control-plane Completed 0/1 │ all pods accounted for Before I move on and start a more complicated setup with AWS EKS, I am wondering if I should pause and consider any other settings, config or setup that can still replicated locally without using AWS EKS. Does anyone here have a better idea? |
I still believe this is either load related or has a statistical element to it. I would try starting 50 cron jobs (on a 1-2 node cluster) and see how that behaves. |
Ok I think that @gsantoro verified that all work in a small scale. So lets choose one of the above tests. and run it for long time. In more details:
Lets see if something happens |
couple of questions:
|
|
There is a related thread on Filebeat autodiscover stopping too early when kubernetes pod terminates over at discuss.elstic.co. It raises the interesting questions: Could the problem be related to multiple within the same POD and how this affects the POD termination? At least in our case we have Istio running, so the command terminating within the cronjob is never the last container to terminate in the POD. Is that maybe similar to other here also who face the problem? |
hello everyone, I have just finished a performance test on the cloud. this is the environment:
instead of a cronjob I have used the following job template since it is easier to deal with parallelism. apiVersion: batch/v1
kind: Job
metadata:
name: process-item-$ITEM
namespace: kube-system
labels:
jobgroup: jobexample
spec:
completions: 500
parallelism: 50
template:
metadata:
name: jobexample
labels:
jobgroup: jobexample
spec:
containers:
- name: process-item
image: busybox
command: ["sh", "-c", "echo Processing item $ITEM && sleep 5"]
restartPolicy: Never where the variable
And this are the results:
As you can see from the previous screenshot, filebeat wasn't able to catchup with the amount of logs produced. I'll have another look with a smaller scale to check if it's just a scaling issue of filebeat or something else. |
I just finished another test with a lower parallism and total number of job completions
And this is what I noticed:
|
Similar results: all events ingested at the current scale
I'll run another test for a longer period of time but a similar parallelism to see if anything changes but I am running out of ideas to test. |
hello @StephanErb , sorry for not updating this issue but the longer running test wasn't successful either in replicating the issue. I'll have a look at that issue. Thanks for pointing it out. |
We experience the same issue mentioned here #22718
We use the latest Filebeat 8.5.1 on AWS EKS with Amazon Linux 2 EC2 worker nodes.
Sometimes Filebeat loses CronJob Pods log mesages.
For CronJobs that runs every minute we see only 54 message with one hour instead of 60.
Around 9-10% of CronJob messages are lost.
In comparison the legacy Filebeat 6.8.13 doesn't lose any message.
Filebeat version 8.5.1.
AWS EKS is v1.21.14
Filebeat configuration:
Cronjob example:
Example of messages on Kibana.
Here you can see that messages at 15:25 and 15:30 and 15:31 are missing:
The text was updated successfully, but these errors were encountered: