Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #23722 to 7.x: Fix leak caused by input runners created when checking their configuration #23803

Merged
merged 1 commit into from
Feb 2, 2021

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Feb 2, 2021

Cherry-pick of PR #23722 to 7.x branch. Original message:

What does this PR do?

Stop input v1 runners created to check config.

CheckConfig for v1 inputs actually calls the constructors of the inputs. In some cases, as in the log input, the constructor creates resources that are never released unless the runner is stopped. This causes goroutines leaks with autodiscover or other dynamic configurations.

As discovered in #23658, this started to cause leaks since 7.8.0 (specifically since #16715), but I am not sure why this was not an issue before as config checkers were moved there but not really changed. Perhaps before this change input configs were not actually checked.

Considered alternatives

  • Avoid creating goroutines in v1 input builders. I may give a try to this option, but the problem I see is that Start() Run() doesn't return errors, and it may be too late to do some checks expected now in the builder.
  • Add a runner builder that creates a runner using fake contexts and connectors that can be controlled from CheckConfig. Not sure if this would be very different at the end, and we would need to rely on inputs releasing their resources if the context is done.
  • Migrate all inputs to v2. Long term is the best solution, but it can be a big effort to be done now, and we would still need a fix to backport to released branches.
  • Do nothing on CheckConfig for inputs. A feature would be lost.

Why is it important?

Avoid goroutines and other possible leaks with autodiscover in Filebeat.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works.
    • Not easy to add a test for this. It needs at least the module loader, the autodiscover process, and a v1 input. Doing it as system test would be flaky because there are several other goroutines.
    • Added tests to check that v1 inputs don't leak goroutines if their context is stopped after being created.
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Confirm that all v1 inputs can be stopped without being started before.

How to test this PR locally

  • Start filebeat with docker (or kubernetes) autodiscover and -httpprof :6060.
  • Collect a base profile (curl http://localhost:6060/debug/pprof/goroutine -o /tmp/goroutines-base.prof).
  • Start a container, and check that filebeat starts collecting its logs.
  • Check with pprof that new goroutines have been started
    • go tool pprof -top -base /tmp/goroutines-base.prof http://localhost:6060/debug/pprof/goroutine
    • Look for goroutines like CloseOnSignal or SubOutlets.
  • Stop the container.
  • Confirm that eventually (~after cleanup_timeout) the started goroutines are not in pprof anymore.

Related issues

@jsoriano jsoriano added [zube]: In Review backport Team:Integrations Label for the Integrations team labels Feb 2, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 2, 2021
…ation (elastic#23722)

Stop input v1 runners created to check config.

`CheckConfig` for v1 inputs actually calls the constructors of the inputs.
In some cases, as in the log input, the constructor creates resources that
are never released unless the runner is stopped. This causes goroutines
leaks with autodiscover or other dynamic configurations.

(cherry picked from commit e6bb5c9)
urso
urso approved these changes Feb 2, 2021
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23803 updated

    • Start Time: 2021-02-02T12:03:32.512+0000
  • Duration: 45 min 53 sec

  • Commit: 89aadab

Test stats 🧪

Test Results
Failed 0
Passed 12947
Skipped 2047
Total 14994

Steps errors 1

Expand to view the steps failures

filebeat-build - Install Go/Mage/Python/Docker/Terraform 1.15.7
  • Took 0 min 1 sec . View more details on here
  • Description: .ci/scripts/install-tools.sh

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 12947
Skipped 2047
Total 14994

@jsoriano jsoriano merged commit 8cb3644 into elastic:7.x Feb 2, 2021
@jsoriano jsoriano deleted the backport_23722_7.x branch February 2, 2021 13:38
@zube zube bot removed the [zube]: Done label May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants