Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Set status Failed if configuration applying fails #23537

Merged
merged 3 commits into from
Jan 20, 2021

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Jan 15, 2021

What does this PR do?

Adjusted libbeat to report the failure of reloading the configuration as failed.

Why is it important?

Without this the running beat will stay degraded until the next configuration reload. If applying configuration fails then it is really an error and Elastic Agent should kill it and restart the beat (which it will do with this change).

Checklist

  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@blakerouse blakerouse added the Team:Elastic-Agent Label for the Agent team label Jan 15, 2021
@blakerouse blakerouse self-assigned this Jan 15, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 15, 2021
@blakerouse blakerouse marked this pull request as ready for review January 15, 2021 18:32
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

ph
ph previously approved these changes Jan 15, 2021
Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM haven't tested it or reproduce the mentioned issue with filebeat.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 15, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23537 updated

    • Start Time: 2021-01-19T20:49:31.909+0000
  • Duration: 50 min 31 sec

  • Commit: ccd8bea

Test stats 🧪

Test Results
Failed 0
Passed 5468
Skipped 358
Total 5826

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 5468
Skipped 358
Total 5826

@michalpristas
Copy link
Contributor

LGTM,
small question though, can this lead to restart loop when beat is incapable of recognizing the config?

@blakerouse
Copy link
Contributor Author

@michalpristas Yes that would be the case.

@EricDavisX
Copy link
Contributor

/package

@blakerouse
Copy link
Contributor Author

This seems to cause a restart/loop on report of failure back from filebeat.

[elastic_agent][warn] Elastic Agent status changed to: 'degraded'
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to RESTARTING: Restarting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to FAILED: 1 error: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::25689473-64768, Finished: false, Fileinfo: &{secure 2537 384 {230053202 63746680987 0x6827760} {64768 25689473 1 33152 0 0 0 0 2537 4096 8 {1600454102 163846939} {1611084187 230053202} {1611084187 230053202} [0 0 0]}}, Source: /var/log/secure, Offset: 5301, Timestamp: 2021-01-19 14:30:01.136606369 -0500 EST m=+401.357986507, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 25689473-64768}
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to CRASHED: exited with code: 1
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to RESTARTING: Restarting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to CRASHED: exited with code: 1
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to RESTARTING: Restarting
[elastic_agent][error] 2021-01-19T14:34:40-05:00: type: 'ERROR': sub_type: 'FAILED' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to CRASHED: exited with code: 1
[elastic_agent][info] 2021-01-19T14:34:40-05:00: type: 'STATE': sub_type: 'STARTING' message: Application: filebeat--8.0.0-SNAPSHOT[c13bd550-5a8b-11eb-bc07-4d3e7d66164c]: State changed to STARTING: Starting

@EricDavisX
Copy link
Contributor

shows this error when changing Agent policy:
{"log.level":"error","@timestamp":"2021-01-19T15:06:02.971-0500","log.origin":{"file.name":"instance/beat.go","file.line":952},"message":"Exiting: could not start the HTTP server for the API: listen unix /tmp/elastic-agent/default/filebeat/filebeat.sock: bind: no such file or directory","ecs.version":"1.6.0"}

Testing on a clean system, the Default Agent config was up and running on the centos Agent and it was healthy and had all logs monitoring in place as expected.

After changing policy to one with Endpoint included, the connection to ES seemed to drop for one of the Filebeats and got the host into a bad state.

@ph
Copy link
Contributor

ph commented Jan 19, 2021

@EricDavisX @blakerouse well this seems that without moving to filestream we cannot fix that problem?

@blakerouse blakerouse force-pushed the fix-libbeat-agent-degraded branch from 95e588a to ccd8bea Compare January 19, 2021 20:48
@blakerouse
Copy link
Contributor Author

/package

@blakerouse
Copy link
Contributor Author

@ph No I think there was another issue in the code, that with the restart cause a restart loop. I think with that fixed this will work correctly.

@EricDavisX going to give it a run through in the AM.

@mdelapenya
Copy link
Contributor

mdelapenya commented Jan 19, 2021

Just in case you need to manually re-run the e2e tests for a PR that broke them with potential flakiness : https://github.com/elastic/e2e-testing/tree/master/e2e#running-tests-for-a-beats-pull-request

Besides that, if you need to run them locally:

$> git clone https://github.com/elastic/e2e-testing.git
$> cd e2e-testing
$> SUITE="fleet" \
    TAGS="fleet_mode_agent" \ # this is optional and allows you to filter by scenario/test suite
    BEATS_USE_CI_SNAPSHOTS=true \   # will consume CI artifacts from GCP bucket
    ELASTIC_AGENT_VERSION="pr-23537" \ # pr-ID
    DEVELOPER_MODE=true \ # do not destroy services after tests run, to allow SSH'ing into them for logs
    TIMEOUT_FACTOR=3 \ # factor to be applied when waiting for resources or number of hits or processes (default: 1 * 3 minutes)
    LOG_LEVEL=TRACE \
    make -C e2e functional-test

@ph ph dismissed their stale review January 20, 2021 19:22

Lets get that tested. I will remove my review.

@ph ph requested a review from EricDavisX January 20, 2021 21:06
@ph
Copy link
Contributor

ph commented Jan 20, 2021

@EricDavisX Can you approve this PR?

Copy link
Contributor

@EricDavisX EricDavisX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i pulled the gcp beats-ci generated Agent file and tested on a linux centos system and find that the Fleet UI always shows Healthy when I think it should still be healthy... it is working in this regard. other issues are logged separately and being triaged. this one is good to go, it is being released in concert with the newer System package which has the conditional inputs needed

@blakerouse blakerouse merged commit e0881de into elastic:master Jan 20, 2021
@blakerouse blakerouse deleted the fix-libbeat-agent-degraded branch January 20, 2021 21:18
blakerouse added a commit to blakerouse/beats that referenced this pull request Jan 20, 2021
…astic#23537)

* Set status to Failed if configuration applying fails.

* Add changelog.

* Don't cleanup paths on crash, as it will be restart. Fix ownership.

(cherry picked from commit e0881de)
blakerouse added a commit to blakerouse/beats that referenced this pull request Jan 20, 2021
…astic#23537)

* Set status to Failed if configuration applying fails.

* Add changelog.

* Don't cleanup paths on crash, as it will be restart. Fix ownership.

(cherry picked from commit e0881de)
blakerouse added a commit that referenced this pull request Jan 21, 2021
…3537) (#23600)

* Set status to Failed if configuration applying fails.

* Add changelog.

* Don't cleanup paths on crash, as it will be restart. Fix ownership.

(cherry picked from commit e0881de)
blakerouse added a commit that referenced this pull request Jan 21, 2021
…3537) (#23601)

* Set status to Failed if configuration applying fails.

* Add changelog.

* Don't cleanup paths on crash, as it will be restart. Fix ownership.

(cherry picked from commit e0881de)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants