Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor kubernetes autodiscover to avoid skipping short-living pods #24742

Merged
merged 36 commits into from
Apr 20, 2021

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Mar 24, 2021

What does this PR do?

Refactor logic in kubernetes autodiscover that decides when to generate events to try to address #22718.

Kubernetes autodiscover can generate events without network information now (without host or port/ports). This allows to generate events for pods that haven't started yet, or have succeeded/failed before generating a running event. These events still include the container id, so they can be used to collect logs. Still, no start event is generated if no pod ip and no container ids are available.

Some helpers have been added to obtain relevant information from pods and their containers.

Some additional small refactors are done to improve readability.

Why is it important?

Current logic is checking similar things at the pod and container levels, try to simplify this logic focusing in containers only.

No matter what is the state of the pod, if there is a container running or trying to run, even if it is unhealthy, some configuration should be generated, so logs can be collected.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. But more tests would be needed.
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Fix update so it definitely stops configurations, respecting the cleanup timeout.
  • Double-check that nothing breaks with events without data.host.
    This is logged at the debug level:
    2021-04-08T12:53:34.453+0200	DEBUG	[autodiscover]	template/config.go:156	Configuration template cannot be resolved: field 'data.host' not available in event or environment accessing 'hosts' (source:'/home/jaime/tmp/metricbeat-autodiscover-k8s.yml')
    
  • Actually check that the refactor solves the issues and doesn't break anything else.
  • Try to add more testing. A couple of cases added to current tests.
  • Check that nothing breaks with hints including data.host or data.ports.
    • data.ports.* doesn't seem to be working, check if this is a regression. Named ports were not added to container events. They are now.
    • Multiple configurations generated for pods with multiple containers, expected? compare with master. False alarm.
  • Check autodiscover with heartbeat.

Author's notes for the future

  • When using autodiscover templates, if the condition matches pod conditions, container events will match too because they also include pod metadata. In pods with multiple containers this may lead to events with metadata of the incorrect container. This could be possibly avoided by matching per pod metadata only in the pod events. This is also happening in master, not solving in this PR.
  • data.ports is not included in container-level events, so it doesn't work when conditions for specific containers are used.

How to test this PR locally

  • Check that everything keeps running with some autodiscover happy cases.
  • Check that logs are collected from a pod in CrashLoopBackOff (kubectl run --image=redis redis -- echo foo).

Related issues

Use cases

Collect logs from containers in short-living or failing pods.

@jsoriano jsoriano added the Team:Integrations Label for the Integrations team label Mar 24, 2021
@jsoriano jsoriano self-assigned this Mar 24, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 24, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 24, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: jsoriano commented: /test

  • Start Time: 2021-04-20T07:56:16.321+0000

  • Duration: 46 min 55 sec

  • Commit: 93995c7

Test stats 🧪

Test Results
Failed 0
Passed 353
Skipped 115
Total 468

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 353
Skipped 115
Total 468

@hukaixuan
Copy link

Face the same issue, thanks for your great job! When will this refactor be done?

@jsoriano
Copy link
Member Author

@hukaixuan thanks for your interest, I expect to dedicate some time to this in the following days, but I cannot give a fixed timeline, stay tunned to this PR 🙂

@segevmatuti1
Copy link

Hey @jsoriano !
encountered the same issue as well, appreciate your work here!
do you have a docker image of your changes I can use to test on my end?

@jsoriano
Copy link
Member Author

jsoriano commented Apr 1, 2021

Hey @jsoriano !
encountered the same issue as well, appreciate your work here!
do you have a docker image of your changes I can use to test on my end?

Hi, this is still a draft, it still doesn't cover all cases, I actually know it doesn't work well with some basic cases.

@jsoriano jsoriano force-pushed the short-living-pods branch from bc37e73 to b3aef73 Compare April 6, 2021 19:52
@jsoriano jsoriano added the test-plan Add this PR to be manual test plan label Apr 8, 2021
@jsoriano
Copy link
Member Author

/test

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good so far 🍻 :

  1. spotted one tricky line that might be a bug (maybe not :)). I might need to have a second pass with a clearer mind for hidden bug hunting (I believe there are none though)
  2. @jsoriano do you think you can improve testing notes a bit? I guess you have some specific scenarios you used to test this one. Would be nice to have them logged in this PR both for manual testing but also for future reference maybe.

// so templates that need network don't generate a config.
if c.status.State.Running != nil {
if port.Name != "" && port.ContainerPort != 0 {
event["ports"] = common.MapStr{port.Name: port.ContainerPort}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Will this just keep only the last "accessed" port of the loop because we create a new MapStr each time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have changed the behaviour here, I have moved data.ports from the pod event to each container/port event. Let me know if you think this can be a problem.

Before, data.ports was included only in the pod event, so when configurations matched they didn't include the container metadata, only the pod metadata.

Code in this PR doesn't set the whole collection of data.ports in the pod event, it sets a single data.ports.<portname> field in to each container/port event, so when configurations match they include also the container metadata. By doing this each event with a named port is going to contain a single port.

This also fixes something I found misleading during my tests. As data.ports was included only in pod events, templates with conditions matching container fields, like the following one, didn't work. They work now with the changes here.

      templates:
        - condition:
            contains.kubernetes.container.image: redis
          config:
            - module: redis
              metricsets: ["info", "keyspace"]
              hosts: "${data.host}:${data.ports.redis}"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what doesn't work now is this case. I will revert this change and do some more tests.

      templates:
        - condition:
            contains.kubernetes.pod.name: redis
          config:
            - module: redis
              hosts: "${data.host}:${data.ports.redis}"
            - module: nginx
              hosts: "${data.host}:${data.ports.http}"

We need more testing to cover all these cases 😣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, it makes sense!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in 3301827, I will follow with this in a separate PR.

@jsoriano
Copy link
Member Author

2. @jroriano do you think you can improve testing notes a bit? I guess you have some specific scenarios you used to test this one. Would be nice to have them logged in this PR both for manual testing but also for future reference maybe.

@ChrsMark I have some quite simple kubernetes manifests, that represent some cases in a list in a private manual test case. I would like to translate this at some moment to system or e2e tests.

@jsoriano
Copy link
Member Author

/test

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!

@ChrsMark
Copy link
Member

/test

Copy link
Contributor

@exekias exekias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! let's make sure we test this thoroughly before releasing 😇

@jsoriano
Copy link
Member Author

jsoriano commented Apr 19, 2021

@ChrsMark @exekias Thanks for the reviews!

I did some additional tests and I found that the issue was not actually solved for hints-based autodiscover in filebeat, c9a3515 was missing. Could any of you please do a quick review of this latest change?

@mergify
Copy link
Contributor

mergify bot commented Apr 19, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b short-living-pods upstream/short-living-pods
git merge upstream/master
git push upstream short-living-pods

@jsoriano jsoriano added the needs_backport PR is waiting to be backported to other branches. label Apr 19, 2021
@jsoriano
Copy link
Member Author

/test

@jsoriano jsoriano merged commit b4b6e6e into elastic:master Apr 20, 2021
@jsoriano jsoriano deleted the short-living-pods branch April 20, 2021 08:56
jsoriano added a commit to jsoriano/beats that referenced this pull request Apr 20, 2021
…lastic#24742)

Refactor logic in kubernetes autodiscover that decides when
to generate events to try to address issues with short-living
containers.

Kubernetes autodiscover can generate events without network
information now (without host or port/ports). This allows to generate
events for pods that haven't started yet, or have succeeded/failed
before generating a running event. These events still include the
container id, so they can be used to collect logs. Still, no start event
is generated if no pod ip and no container ids are available.

Some helpers have been added to obtain relevant information from
pods and their containers.

Some additional small refactors are done to improve readability.

(cherry picked from commit b4b6e6e)
@jsoriano jsoriano added v7.13.0 and removed needs_backport PR is waiting to be backported to other branches. labels Apr 20, 2021
v1v added a commit to v1v/beats that referenced this pull request Apr 20, 2021
…-github-pr-comment-template

* upstream/master:
  [Ingest Manager] Keep http and logging config during enroll (elastic#25132)
  Refactor kubernetes autodiscover to avoid skipping short-living pods (elastic#24742)
  [libbeat] New decode xml wineventlog processor (elastic#25115)
  Add svc to agent k8s clusterRole (elastic#25146)
  Add awsfargate module to collect container logs from Amazon ECS on Fargate (elastic#25041)
  [Filebeat][Cisco ASA] log enhancement and performance (elastic#24744)
  Watch kubernetes namespaces for autodiscover metadata for pods (elastic#25117)
  Cyberark Privileged Access Security module (elastic#24803)
  [Elastic Agent] Log the container command output with LOGS_PATH (elastic#25150)
  Fix for tests after `device...` field has been removed (elastic#25141)
  [Ingest Manager] Restart process on output change (elastic#24907)
  Set --insecure in container when FLEET_SERVER_ENABLE and FLEET_INSECURE set. (elastic#25137)
  [filebeat] Update documentation / changelog / beta warnings for the syslog input (elastic#25047)
  Add support for ignore_inactive in filestream input (elastic#25036)
  Fix bug with annotations dedot config on k8s not used (elastic#25111)
jsoriano added a commit that referenced this pull request Apr 20, 2021
…24742) (#25167)

Refactor logic in kubernetes autodiscover that decides when
to generate events to try to address issues with short-living
containers.

Kubernetes autodiscover can generate events without network
information now (without host or port/ports). This allows to generate
events for pods that haven't started yet, or have succeeded/failed
before generating a running event. These events still include the
container id, so they can be used to collect logs. Still, no start event
is generated if no pod ip and no container ids are available.

Some helpers have been added to obtain relevant information from
pods and their containers.

Some additional small refactors are done to improve readability.

(cherry picked from commit b4b6e6e)
@andresrc andresrc added the test-plan-added This PR has been added to the test plan label Apr 22, 2021
v1v added a commit to v1v/beats that referenced this pull request Apr 22, 2021
…ng-versions-stack

* upstream/master: (28 commits)
  Add support for parsers in filestream input (elastic#24763)
  Skip flaky test TestFilestreamTruncate (elastic#25218)
  backport: Add 7.13 branch (elastic#25189)
  Update decode_json_fields.asciidoc (elastic#25056)
  [Elastic Agent] Fix status and inspect command to work inside running container (elastic#25204)
  Check native environment before starting (elastic#25186)
  Change event.code and winlog.event_id type (elastic#25176)
  [Ingest Manager] Proxy processes/elastic-agent to stats (elastic#25193)
  Update mergify backporting to 7.x and 7.13 (elastic#25196)
  [Heartbeat]: ensure synthetics version co* [Heartbeat]: ensure synthetics version compatability for suites  * address review and fix notice  * fix lowercase struct  * fix version conflict and rebase  * update go.* stuff to master  * fix notice.txt  * move validate inside sourcempatability for suites (elastic#24777)
  [Filebeat] Ensure Kibana audit `event.category` and `event.type` are still processed as strings. (elastic#25101)
  Update replace.asciidoc (elastic#25055)
  Fix nil panic when overwriting metadata (elastic#24741)
  [Filebeat] Add Malware Bazaar to Threat Intel Module (elastic#24570)
  Fix k8s svc selectors mapping (elastic#25169)
  [Ingest Manager] Make agent retry values for bootstraping configurable (elastic#25163)
  [Metricbeat] Remove elasticsearc.index.created from the SM code (elastic#25113)
  [Ingest Manager] Keep http and logging config during enroll (elastic#25132)
  Refactor kubernetes autodiscover to avoid skipping short-living pods (elastic#24742)
  [libbeat] New decode xml wineventlog processor (elastic#25115)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Integrations Label for the Integrations team test-plan Add this PR to be manual test plan test-plan-added This PR has been added to the test plan v7.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubernetes autodiscover doesn't discover short living jobs (and pods?)
7 participants