Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat autodiscover stopping too early when kubernetes pod terminates #34789

Open
cpaton opened this issue Mar 9, 2023 · 4 comments
Open
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team

Comments

@cpaton
Copy link

cpaton commented Mar 9, 2023

I am using filebeat with autodiscover within a Kubernetes cluster to capture logs. When a Kubernetes pod terminates filebeat immediately stops reading log entries which can result in log lines at the end of the logs not being published.

Filebeat configuration is setup as

filebeat.autodiscover:
  providers:
    - type: kubernetes
      node: ${NODE_NAME}
      resource: pod
      scope: node
      hints.enabled: true
      hints.default_config:
        type: container
        paths:
          - /var/log/containers/*-${data.container.id}.log
        fields_under_root: true          
      include_annotations:
        - app.kubernetes.io/version
        - app-release

This issue appears to the same as reported in Filebeat autodiscover should not stop docker prospectors until it reads all lines · Issue #6694 · elastic/beats (github.com) but for the kubernetes input. That was fixed by the introduction of the close_timeout configuration option which is still in place today.

From testing different versions it works as expected up to Filebeat 7.12.1, but is not working as expected from 7.13.0 onwards.

Logs from Filebeat 7.12.1 when a pod terminates

2023-02-13T20:27:52.120Z	DEBUG	[autodiscover.pod]	kubernetes/pod.go:129	Watcher Pod update for pod: log-provider-0, status: Running
2023-02-13T20:27:52.854Z	DEBUG	[autodiscover.pod]	kubernetes/pod.go:145	Watcher Pod update (terminating)
2023-02-13T20:28:52.856Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:253	Got a stop event
2023-02-13T20:28:52.857Z	INFO	input/input.go:136	input ticker stopped

Note the 60 second gap between the pod update event being received from Kubernetes as the pod terminates and the stop event being published within filebeat to stop reading the logs

Logs from Filebeat 7.13.0 when a pod terminates

2023-02-13T20:26:10.123Z	DEBUG	[autodiscover.pod]	kubernetes/pod.go:145	Watcher Pod update
2023-02-13T20:26:10.128Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:253	Got a stop event
2023-02-13T20:26:10.129Z	INFO	input/input.go:136	input ticker stopped

Note the stop event being published immediately

From some investigation I think it relates to changes in #24742 - Refactor kubernetes autodiscover to avoid skipping short-living pods . Before that change the stop event on pod termination would never be published immediately it was always delayed by the close timeout. Post that change the stop event will be published immediately if any container within the pod has a non-empty running container status. This is the bit of logic which I think has changed - below is the snippet before the above change

	if pod.GetObjectMeta().GetDeletionTimestamp() != nil {
		p.logger.Debugf("Watcher Pod update (terminating): %+v", obj)
		// Pod is terminating, don't reload its configuration and ignore the event
		// if some pod is still running, we will receive more events when containers
		// terminate.
		for _, container := range pod.Status.ContainerStatuses {
			if container.State.Running != nil {
				return
			}
		}
		time.AfterFunc(p.config.CleanupTimeout, func() { p.emit(pod, "stop") })
		return
	}

Now the stop event is not delayed as the pod hasn't terminated, its only in the process of terminating

delay := (flag == "stop" && podTerminated(pod, containers))

Note this was originally posted on the discussion forums but did not get any replies - https://discuss.elastic.co/t/filebeat-autodiscover-stopping-too-early-when-kubernetes-pod-terminates/325491

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 9, 2023
@StephanErb
Copy link

This could potentially be a duplicate of #34045

@StephanErb
Copy link

@cpaton do your PODs container a single container or multiple? Your analysis sounds as if the error would only be present if there are multiple containers?

@cpaton
Copy link
Author

cpaton commented Mar 29, 2023

In this example the pod has a single container.

In the first Kubernetes event seen by filebeat when the pod terminates the single containers status has a non-empty running state. Its this non-empty running state which I think prevents the delay as the pod is deemed to be terminating but not terminated resulting in the input stopping immediately. The code before #24742 explicitly checked for containers still running in a terminated pod and didn't emit an event

func (p *pod) emit(pod *kubernetes.Pod, flag string) {
	...
	delay := (flag == "stop" && podTerminated(pod, containers))
	...
}

// podTerminating returns true if a pod is marked for deletion or is in a phase beyond running.
func podTerminating(pod *kubernetes.Pod) bool {
	if pod.GetObjectMeta().GetDeletionTimestamp() != nil {
		return true
	}

	switch pod.Status.Phase {
	case kubernetes.PodRunning, kubernetes.PodPending:
	default:
		return true
	}

	return false
}

// podTerminated returns true if a pod is terminated, this method considers a
// pod as terminated if none of its containers are running (or going to be running).
func podTerminated(pod *kubernetes.Pod, containers []*containerInPod) bool {
	// Pod is not marked for termination, so it is not terminated.
	if !podTerminating(pod) {
		return false
	}

	// If any container is running, the pod is not terminated yet.
	for _, container := range containers {
		if container.status.State.Running != nil {
			return false
		}
	}

	return true
}

@ycombinator ycombinator added the Team:Elastic-Agent Label for the Agent team label Mar 27, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 27, 2024
@cmacknz cmacknz added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

5 participants