Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

Closed
EricDavisX opened this issue Jan 14, 2021 · 14 comments
Assignees
Labels
impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team v7.11.0

Comments

@EricDavisX
Copy link
Contributor

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy'

debugging with Blake we decided this is the relevant log from elastic agent:
{"log.level":"info","@timestamp":"2021-01-14T14:59:11.293-0500","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-01-14T14:59:11-05:00: type: 'STATE': sub_type: 'RUNNING' message: Application: filebeat--7.11.0-SNAPSHOT[2ab73e20-5695-11eb-9ccb-e9bbb39979c5]: State changed to DEGRADED: 1 error: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::1179673-2049, Finished: false, Fileinfo: &{syslog 676387 416 {18666649 63746251076 0x67387c0} {2049 1179673 1 33184 0 4 0 0 676387 4096 1328 {1610651828 288434822} {1610654276 18666649} {1610654276 18666649} [0 0 0]}}, Source: /var/log/syslog, Offset: 676387, Timestamp: 2021-01-14 14:58:02.86236151 -0500 EST m=+3168.279766984, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 1179673-2049}","ecs.version":"1.5.0"}

Please include configurations and logs if available.
7.11 BC3 Kibana in cloud + BC3 Agent Debian 9

Steps to Reproduce:

  • install Agent. sometimes that is all that it takes for it to go unhealthy with this error.
  • if not reproduced above, toggle to a new policy (even one with the same Integrations) and wait
  • and if that doesn't work, change it to yet a different policy somewhat quickly, to try to 'trap' Filebeat being in the middle of a state change when a new request comes in
@EricDavisX EricDavisX added impact:high Short-term priority; add to current release, or definitely next. v7.11.0 Team:Elastic-Agent Label for the Agent team labels Jan 14, 2021
@EricDavisX
Copy link
Contributor Author

Blakes theory / fix:
that is the issue, Elastic Agent tells filebeat to reload with a new config and it fails to reload

filebeat reports it as degraded when it should report it as a failure

I need to look into it, but I believe if it reported it as failure Agent would try it again, and they restart it if it needs to

@blakerouse
Copy link
Contributor

We originally set beats to report this type of error as a failure, but this caused issues because we did not have conditions on inputs. So we switched it to degraded status reporting so if an input fails to load because it was not supported on that platform it would not cause a failure.

Now with proper status reporting and conditions this is the wrong choice. We need to do the following:

  1. Report configuration reloading issues inside of beats as a failure when it does fail.
  2. Ensure that on failure reporting that the Agent re-tries the configuration reload. (I believe this to be the case already today).
  3. Filebeat should also be looked out so it doesn't have this issue internally.

@EricDavisX
Copy link
Contributor Author

testing notes; in most cases, the Agent process may be in a fine state and you can continue testing. If it is sending data, carry on. If not, you can attempt to change policy and change it back and see if it applies better the 2nd time.

confirmed tho, anytime we see ‘DEGRADED’ in the logs, that can now be assumed as the culprit for ‘unhealthy’ until we get the fix in.

@ph
Copy link
Contributor

ph commented Jan 15, 2021

@blakerouse Concerning 1, 2 This seems something we can do on our side? For the last item 3 this a more core problem with Filebeat @urso Do you have an idea at how we could fix that?

@blakerouse
Copy link
Contributor

I have confirmed that #2 from my list is already handled in Agent. It will force a restart of filebeat, because the configuration failed to load.

@ph correct #3 is a core libbeat/filebeat issue, would be great if @urso could take a look. I have tracked the error to here, where it seems to bubble up to:

https://github.com/elastic/beats/blob/master/libbeat/cfgfile/list.go#L95

and the source of the error bubbling up to that level comes from:

https://github.com/elastic/beats/blob/master/filebeat/input/log/input.go#L177

@urso
Copy link

urso commented Jan 18, 2021

From the logs it looks like the typical "logs" input problem. The logs input, its coordination and registry updates are all "linked" to the output making progress. If the output is not fast enough or blocking, then stopping, restarting, reconfiguring or similar operations can block because the state didn't make it to the registry yet.

@kvch Did refactor the logs input based on the v2 API. But the new input type is named filestream. The new input tries to decouple management updates from the output, such that changes can be applied independent of the output state. If 2 inputs try to read the same file, then the second instance waits instead of failing. A file has one inuque "harvester" all the time.

The file being collected is /var/log/syslog. Was it configured by the system module?
Updating the mapping in the agent might help.

@kvch any non-bc settings in the filestream input (settings renamed, added, removed)? If so, someone need to check our integrations.

@ph
Copy link
Contributor

ph commented Jan 18, 2021

Thanks @urso,

The file being collected is /var/log/syslog. Was it configured by the system module?
Updating the mapping in the agent might help.

Yes this is collected by the system module. The problem describe exist for a really long time. The problem is the code we have to collector error was subpart and now well it's show theses problem.

When you said updating the mapping are you suggesting just aliasing the log input to the filestream?

@urso
Copy link

urso commented Jan 18, 2021

When you said updating the mapping are you suggesting just aliasing the log input to the filestream?

This might be a breaking change, but yes.

But you won't be able to get rid of the log input completely. The event/docker translates to the docker input , which configures the log input + a few hacks. Btw. the docker input was superseded by the container input for quite some time I think. But it does the same tricks the docker input does, so you will still be stuck with the log input for container use-cases.

@ruflin
Copy link
Contributor

ruflin commented Jan 19, 2021

The unifished stated indicates that the previous harvester is still reading the file. As @urso described, this is caused by ES / output blocking. Do we know in the first place why this happens? Is ES available to receive the data? If yes, I would expect the issue to resolve it self after it is completed.

In case of a new policy with the same input config inside, I would also expect that the reload is not triggered in the first place as the hash of the input would still be identical?

Do we know what is blocking the input from finishing?

@EricDavisX
Copy link
Contributor Author

EricDavisX commented Jan 19, 2021

pr to fix item 1 from above, on Agent side: #23537

  • requires latest .10.x package of System Integration to avoid failures due to mis-matched design from package side. I'm glad to see the discussion above, as I was concerned we'd need to coordinate.

@ph
Copy link
Contributor

ph commented Jan 20, 2021

@EricDavisX I do not understand the following can you clarify?

requires latest .10.x package of System Integration to avoid failures due to mis-matched design from package side. I'm glad to see the discussion above, as I was concerned we'd need to coordinate.

@EricDavisX
Copy link
Contributor Author

I can clarify that without the latest System Integration that has the conditional inputs, that we expect a fail-citation-and-then-restart-loop to repeat, making usage / debugging harder if not impossible. The package is available in 8.0 cloud Kibana usage, and is in the staging / snapshot repos of package storage for those who want to set the Kibana epr url manually instead of using cloud to test.

@EricDavisX
Copy link
Contributor Author

This is merged into 7.11 now, and will be picked up in the next build candidate, BC5 slated for (at current) next week. The .0.10.7 System Integration package has been merged

@EricDavisX
Copy link
Contributor Author

I am seeing this fixed nicely in the 7.11 BC5 build so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team v7.11.0
Projects
None yet
Development

No branches or pull requests

5 participants