Refactor kubernetes autodiscover to avoid skipping short-living pods #24742

jsoriano · 2021-03-24T18:23:55Z

What does this PR do?

Refactor logic in kubernetes autodiscover that decides when to generate events to try to address #22718.

Kubernetes autodiscover can generate events without network information now (without host or port/ports). This allows to generate events for pods that haven't started yet, or have succeeded/failed before generating a running event. These events still include the container id, so they can be used to collect logs. Still, no start event is generated if no pod ip and no container ids are available.

Some helpers have been added to obtain relevant information from pods and their containers.

Some additional small refactors are done to improve readability.

Why is it important?

Current logic is checking similar things at the pod and container levels, try to simplify this logic focusing in containers only.

No matter what is the state of the pod, if there is a container running or trying to run, even if it is unhealthy, some configuration should be generated, so logs can be collected.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works. But more tests would be needed.
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Fix update so it definitely stops configurations, respecting the cleanup timeout.

Double-check that nothing breaks with events without data.host.
This is logged at the debug level:

2021-04-08T12:53:34.453+0200	DEBUG	[autodiscover]	template/config.go:156	Configuration template cannot be resolved: field 'data.host' not available in event or environment accessing 'hosts' (source:'/home/jaime/tmp/metricbeat-autodiscover-k8s.yml')

Actually check that the refactor solves the issues and doesn't break anything else.
Try to add more testing. A couple of cases added to current tests.
Check that nothing breaks with hints including data.host or data.ports.
- data.ports.* doesn't seem to be working, check if this is a regression. Named ports were not added to container events. They are now.
- Multiple configurations generated for pods with multiple containers, expected? compare with master. False alarm.
Check autodiscover with heartbeat.

Author's notes for the future

When using autodiscover templates, if the condition matches pod conditions, container events will match too because they also include pod metadata. In pods with multiple containers this may lead to events with metadata of the incorrect container. This could be possibly avoided by matching per pod metadata only in the pod events. This is also happening in master, not solving in this PR.
data.ports is not included in container-level events, so it doesn't work when conditions for specific containers are used.

How to test this PR locally

Check that everything keeps running with some autodiscover happy cases.
Check that logs are collected from a pod in CrashLoopBackOff (kubectl run --image=redis redis -- echo foo).

Related issues

Closes Kubernetes autodiscover doesn't discover short living jobs (and pods?) #22718.

Use cases

Collect logs from containers in short-living or failing pods.

elasticmachine · 2021-03-24T20:14:36Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: jsoriano commented: /test
Start Time: 2021-04-20T07:56:16.321+0000
Duration: 46 min 55 sec
Commit: 93995c7

Test stats 🧪

Test	Results
Failed	0
Passed	353
Skipped	115
Total	468

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	353
Skipped	115
Total	468

hukaixuan · 2021-03-31T06:30:39Z

Face the same issue, thanks for your great job! When will this refactor be done?

jsoriano · 2021-03-31T10:51:23Z

@hukaixuan thanks for your interest, I expect to dedicate some time to this in the following days, but I cannot give a fixed timeline, stay tunned to this PR 🙂

segevmatuti1 · 2021-04-01T08:24:52Z

Hey @jsoriano !
encountered the same issue as well, appreciate your work here!
do you have a docker image of your changes I can use to test on my end?

jsoriano · 2021-04-01T09:20:01Z

Hey @jsoriano !
encountered the same issue as well, appreciate your work here!
do you have a docker image of your changes I can use to test on my end?

Hi, this is still a draft, it still doesn't cover all cases, I actually know it doesn't work well with some basic cases.

libbeat/common/kubernetes/watcher.go

This reverts commit 8674201.

jsoriano · 2021-04-14T17:11:47Z

/test

ChrsMark

Overall looks good so far 🍻 :

spotted one tricky line that might be a bug (maybe not :)). I might need to have a second pass with a clearer mind for hidden bug hunting (I believe there are none though)
@jsoriano do you think you can improve testing notes a bit? I guess you have some specific scenarios you used to test this one. Would be nice to have them logged in this PR both for manual testing but also for future reference maybe.

ChrsMark · 2021-04-15T13:46:37Z

libbeat/autodiscover/providers/kubernetes/pod.go

+		// so templates that need network don't generate a config.
+		if c.status.State.Running != nil {
+			if port.Name != "" && port.ContainerPort != 0 {
+				event["ports"] = common.MapStr{port.Name: port.ContainerPort}


🤔 Will this just keep only the last "accessed" port of the loop because we create a new MapStr each time?

Yes, I have changed the behaviour here, I have moved data.ports from the pod event to each container/port event. Let me know if you think this can be a problem.

Before, data.ports was included only in the pod event, so when configurations matched they didn't include the container metadata, only the pod metadata.

Code in this PR doesn't set the whole collection of data.ports in the pod event, it sets a single data.ports.<portname> field in to each container/port event, so when configurations match they include also the container metadata. By doing this each event with a named port is going to contain a single port.

This also fixes something I found misleading during my tests. As data.ports was included only in pod events, templates with conditions matching container fields, like the following one, didn't work. They work now with the changes here.

templates: - condition: contains.kubernetes.container.image: redis config: - module: redis metricsets: ["info", "keyspace"] hosts: "${data.host}:${data.ports.redis}"

You can see more clearly how this is different now in these test cases: https://github.com/elastic/beats/pull/24742/files#diff-9f24ba274864b933a726842c2b9fb8ee487a13545579938f9e373d0463611142R557

Ok, what doesn't work now is this case. I will revert this change and do some more tests.

templates: - condition: contains.kubernetes.pod.name: redis config: - module: redis hosts: "${data.host}:${data.ports.redis}" - module: nginx hosts: "${data.host}:${data.ports.http}"

We need more testing to cover all these cases 😣

Ah I see, it makes sense!

Reverted in 3301827, I will follow with this in a separate PR.

jsoriano · 2021-04-15T14:12:47Z

2. @jroriano do you think you can improve testing notes a bit? I guess you have some specific scenarios you used to test this one. Would be nice to have them logged in this PR both for manual testing but also for future reference maybe.

@ChrsMark I have some quite simple kubernetes manifests, that represent some cases in a list in a private manual test case. I would like to translate this at some moment to system or e2e tests.

jsoriano · 2021-04-15T16:51:25Z

/test

ChrsMark

lgtm, thank you!

ChrsMark · 2021-04-19T07:22:28Z

/test

libbeat/autodiscover/providers/kubernetes/pod.go

exekias

LGTM! let's make sure we test this thoroughly before releasing 😇

jsoriano · 2021-04-19T12:18:27Z

@ChrsMark @exekias Thanks for the reviews!

I did some additional tests and I found that the issue was not actually solved for hints-based autodiscover in filebeat, c9a3515 was missing. Could any of you please do a quick review of this latest change?

mergify · 2021-04-19T22:41:25Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b short-living-pods upstream/short-living-pods
git merge upstream/master
git push upstream short-living-pods

jsoriano · 2021-04-20T07:55:56Z

/test

…lastic#24742) Refactor logic in kubernetes autodiscover that decides when to generate events to try to address issues with short-living containers. Kubernetes autodiscover can generate events without network information now (without host or port/ports). This allows to generate events for pods that haven't started yet, or have succeeded/failed before generating a running event. These events still include the container id, so they can be used to collect logs. Still, no start event is generated if no pod ip and no container ids are available. Some helpers have been added to obtain relevant information from pods and their containers. Some additional small refactors are done to improve readability. (cherry picked from commit b4b6e6e)

…-github-pr-comment-template * upstream/master: [Ingest Manager] Keep http and logging config during enroll (elastic#25132) Refactor kubernetes autodiscover to avoid skipping short-living pods (elastic#24742) [libbeat] New decode xml wineventlog processor (elastic#25115) Add svc to agent k8s clusterRole (elastic#25146) Add awsfargate module to collect container logs from Amazon ECS on Fargate (elastic#25041) [Filebeat][Cisco ASA] log enhancement and performance (elastic#24744) Watch kubernetes namespaces for autodiscover metadata for pods (elastic#25117) Cyberark Privileged Access Security module (elastic#24803) [Elastic Agent] Log the container command output with LOGS_PATH (elastic#25150) Fix for tests after `device...` field has been removed (elastic#25141) [Ingest Manager] Restart process on output change (elastic#24907) Set --insecure in container when FLEET_SERVER_ENABLE and FLEET_INSECURE set. (elastic#25137) [filebeat] Update documentation / changelog / beta warnings for the syslog input (elastic#25047) Add support for ignore_inactive in filestream input (elastic#25036) Fix bug with annotations dedot config on k8s not used (elastic#25111)

…24742) (#25167) Refactor logic in kubernetes autodiscover that decides when to generate events to try to address issues with short-living containers. Kubernetes autodiscover can generate events without network information now (without host or port/ports). This allows to generate events for pods that haven't started yet, or have succeeded/failed before generating a running event. These events still include the container id, so they can be used to collect logs. Still, no start event is generated if no pod ip and no container ids are available. Some helpers have been added to obtain relevant information from pods and their containers. Some additional small refactors are done to improve readability. (cherry picked from commit b4b6e6e)

…ng-versions-stack * upstream/master: (28 commits) Add support for parsers in filestream input (elastic#24763) Skip flaky test TestFilestreamTruncate (elastic#25218) backport: Add 7.13 branch (elastic#25189) Update decode_json_fields.asciidoc (elastic#25056) [Elastic Agent] Fix status and inspect command to work inside running container (elastic#25204) Check native environment before starting (elastic#25186) Change event.code and winlog.event_id type (elastic#25176) [Ingest Manager] Proxy processes/elastic-agent to stats (elastic#25193) Update mergify backporting to 7.x and 7.13 (elastic#25196) [Heartbeat]: ensure synthetics version co* [Heartbeat]: ensure synthetics version compatability for suites * address review and fix notice * fix lowercase struct * fix version conflict and rebase * update go.* stuff to master * fix notice.txt * move validate inside sourcempatability for suites (elastic#24777) [Filebeat] Ensure Kibana audit `event.category` and `event.type` are still processed as strings. (elastic#25101) Update replace.asciidoc (elastic#25055) Fix nil panic when overwriting metadata (elastic#24741) [Filebeat] Add Malware Bazaar to Threat Intel Module (elastic#24570) Fix k8s svc selectors mapping (elastic#25169) [Ingest Manager] Make agent retry values for bootstraping configurable (elastic#25163) [Metricbeat] Remove elasticsearc.index.created from the SM code (elastic#25113) [Ingest Manager] Keep http and logging config during enroll (elastic#25132) Refactor kubernetes autodiscover to avoid skipping short-living pods (elastic#24742) [libbeat] New decode xml wineventlog processor (elastic#25115) ...

jsoriano added the Team:Integrations Label for the Integrations team label Mar 24, 2021

jsoriano self-assigned this Mar 24, 2021

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 24, 2021

jsoriano added 5 commits April 6, 2021 19:04

Initial simplification

3b8fa45

Common code to instantiate shared informer

f034193

Use normal workqueue

2cda241

Recover cleanup timeout logic

fcce3fd

Fix termination of normal containers

b3aef73

jsoriano force-pushed the short-living-pods branch from bc37e73 to b3aef73 Compare April 6, 2021 19:52

jsoriano added 9 commits April 7, 2021 20:53

Generate configs when not having a host and for completed pods

b5f3ab3

Merge remote-tracking branch 'origin/master' into short-living-pods

ed1107f

Fix comments

a84b170

Don't include empty hosts in pod events

a80b9d0

Adjust tests

e274057

Refactor annotations collection

96db105

Add comment on publishAll

ba02668

Add test cases for succeeded pod

43cf27f

Reorder pod terminating check

8674201

jsoriano added the test-plan Add this PR to be manual test plan label Apr 8, 2021

jsoriano added 2 commits April 8, 2021 13:18

Add changelog

8df4283

Merge remote-tracking branch 'origin/master' into short-living-pods

abe84c8

jsoriano commented Apr 8, 2021

View reviewed changes

libbeat/common/kubernetes/watcher.go Show resolved Hide resolved

jsoriano added 2 commits April 8, 2021 13:44

Remove network info from pods if there is no running container

22266a7

Revert "Reorder pod terminating check"

000b1b9

This reverts commit 8674201.

ChrsMark reviewed Apr 15, 2021

View reviewed changes

jsoriano added 2 commits April 15, 2021 17:14

Revert change in data.ports.

3301827

Merge remote-tracking branch 'origin/master' into short-living-pods

964795b

ChrsMark approved these changes Apr 19, 2021

View reviewed changes

exekias reviewed Apr 19, 2021

View reviewed changes

libbeat/autodiscover/providers/kubernetes/pod.go Show resolved Hide resolved

exekias approved these changes Apr 19, 2021

View reviewed changes

Fix filebeat hints

c9a3515

ChrsMark approved these changes Apr 19, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into short-living-pods

c935db5

Merge remote-tracking branch 'origin/master' into short-living-pods

93995c7

jsoriano added the needs_backport PR is waiting to be backported to other branches. label Apr 19, 2021

jsoriano merged commit b4b6e6e into elastic:master Apr 20, 2021

jsoriano deleted the short-living-pods branch April 20, 2021 08:56

jsoriano mentioned this pull request Apr 20, 2021

Cherry-pick #24742 to 7.x: Refactor kubernetes autodiscover to avoid skipping short-living pods #25167

Merged

14 tasks

jsoriano added v7.13.0 and removed needs_backport PR is waiting to be backported to other branches. labels Apr 20, 2021

jsoriano mentioned this pull request Apr 21, 2021

Kubernetes autodiscover suite elastic/e2e-testing#1064

Merged

13 tasks

andresrc added the test-plan-added This PR has been added to the test plan label Apr 22, 2021

cpaton mentioned this pull request Mar 9, 2023

Filebeat autodiscover stopping too early when kubernetes pod terminates #34789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor kubernetes autodiscover to avoid skipping short-living pods #24742

Refactor kubernetes autodiscover to avoid skipping short-living pods #24742

jsoriano commented Mar 24, 2021 •

edited

Loading

elasticmachine commented Mar 24, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

hukaixuan commented Mar 31, 2021

jsoriano commented Mar 31, 2021

segevmatuti1 commented Apr 1, 2021

jsoriano commented Apr 1, 2021

jsoriano commented Apr 14, 2021

ChrsMark left a comment •

edited

Loading

ChrsMark Apr 15, 2021

jsoriano Apr 15, 2021

jsoriano Apr 15, 2021

jsoriano Apr 15, 2021

ChrsMark Apr 15, 2021

jsoriano Apr 15, 2021

jsoriano commented Apr 15, 2021

jsoriano commented Apr 15, 2021

ChrsMark left a comment

ChrsMark commented Apr 19, 2021

exekias left a comment

jsoriano commented Apr 19, 2021 •

edited

Loading

mergify bot commented Apr 19, 2021

jsoriano commented Apr 20, 2021

Refactor kubernetes autodiscover to avoid skipping short-living pods #24742

Refactor kubernetes autodiscover to avoid skipping short-living pods #24742

Conversation

jsoriano commented Mar 24, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

Author's notes for the future

How to test this PR locally

Related issues

Use cases

elasticmachine commented Mar 24, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

hukaixuan commented Mar 31, 2021

jsoriano commented Mar 31, 2021

segevmatuti1 commented Apr 1, 2021

jsoriano commented Apr 1, 2021

jsoriano commented Apr 14, 2021

ChrsMark left a comment • edited Loading

Choose a reason for hiding this comment

ChrsMark Apr 15, 2021

Choose a reason for hiding this comment

jsoriano Apr 15, 2021

Choose a reason for hiding this comment

jsoriano Apr 15, 2021

Choose a reason for hiding this comment

jsoriano Apr 15, 2021

Choose a reason for hiding this comment

ChrsMark Apr 15, 2021

Choose a reason for hiding this comment

jsoriano Apr 15, 2021

Choose a reason for hiding this comment

jsoriano commented Apr 15, 2021

jsoriano commented Apr 15, 2021

ChrsMark left a comment

Choose a reason for hiding this comment

ChrsMark commented Apr 19, 2021

exekias left a comment

Choose a reason for hiding this comment

jsoriano commented Apr 19, 2021 • edited Loading

mergify bot commented Apr 19, 2021

jsoriano commented Apr 20, 2021

jsoriano commented Mar 24, 2021 •

edited

Loading

elasticmachine commented Mar 24, 2021 •

edited by jenkins-beats-ci bot

Loading

ChrsMark left a comment •

edited

Loading

jsoriano commented Apr 19, 2021 •

edited

Loading