-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518
Comments
Blakes theory / fix: filebeat reports it as degraded when it should report it as a failure I need to look into it, but I believe if it reported it as failure Agent would try it again, and they restart it if it needs to |
We originally set beats to report this type of error as a Now with proper status reporting and conditions this is the wrong choice. We need to do the following:
|
testing notes; in most cases, the Agent process may be in a fine state and you can continue testing. If it is sending data, carry on. If not, you can attempt to change policy and change it back and see if it applies better the 2nd time. confirmed tho, anytime we see ‘DEGRADED’ in the logs, that can now be assumed as the culprit for ‘unhealthy’ until we get the fix in. |
@blakerouse Concerning 1, 2 This seems something we can do on our side? For the last item 3 this a more core problem with Filebeat @urso Do you have an idea at how we could fix that? |
I have confirmed that #2 from my list is already handled in Agent. It will force a restart of filebeat, because the configuration failed to load. @ph correct #3 is a core libbeat/filebeat issue, would be great if @urso could take a look. I have tracked the error to here, where it seems to bubble up to: https://github.com/elastic/beats/blob/master/libbeat/cfgfile/list.go#L95 and the source of the error bubbling up to that level comes from: https://github.com/elastic/beats/blob/master/filebeat/input/log/input.go#L177 |
From the logs it looks like the typical "logs" input problem. The logs input, its coordination and registry updates are all "linked" to the output making progress. If the output is not fast enough or blocking, then stopping, restarting, reconfiguring or similar operations can block because the state didn't make it to the registry yet. @kvch Did refactor the logs input based on the v2 API. But the new input type is named The file being collected is @kvch any non-bc settings in the filestream input (settings renamed, added, removed)? If so, someone need to check our integrations. |
Thanks @urso,
Yes this is collected by the system module. The problem describe exist for a really long time. The problem is the code we have to collector error was subpart and now well it's show theses problem. When you said updating the mapping are you suggesting just aliasing the log input to the filestream? |
This might be a breaking change, but yes. But you won't be able to get rid of the log input completely. The |
The unifished stated indicates that the previous harvester is still reading the file. As @urso described, this is caused by ES / output blocking. Do we know in the first place why this happens? Is ES available to receive the data? If yes, I would expect the issue to resolve it self after it is completed. In case of a new policy with the same input config inside, I would also expect that the reload is not triggered in the first place as the hash of the input would still be identical? Do we know what is blocking the input from finishing? |
pr to fix item 1 from above, on Agent side: #23537
|
@EricDavisX I do not understand the following can you clarify?
|
I can clarify that without the latest System Integration that has the conditional inputs, that we expect a fail-citation-and-then-restart-loop to repeat, making usage / debugging harder if not impossible. The package is available in 8.0 cloud Kibana usage, and is in the staging / snapshot repos of package storage for those who want to set the Kibana epr url manually instead of using cloud to test. |
This is merged into 7.11 now, and will be picked up in the next build candidate, BC5 slated for (at current) next week. The .0.10.7 System Integration package has been merged |
I am seeing this fixed nicely in the 7.11 BC5 build so far |
[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy'
debugging with Blake we decided this is the relevant log from elastic agent:
{"log.level":"info","@timestamp":"2021-01-14T14:59:11.293-0500","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-01-14T14:59:11-05:00: type: 'STATE': sub_type: 'RUNNING' message: Application: filebeat--7.11.0-SNAPSHOT[2ab73e20-5695-11eb-9ccb-e9bbb39979c5]: State changed to DEGRADED: 1 error: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::1179673-2049, Finished: false, Fileinfo: &{syslog 676387 416 {18666649 63746251076 0x67387c0} {2049 1179673 1 33184 0 4 0 0 676387 4096 1328 {1610651828 288434822} {1610654276 18666649} {1610654276 18666649} [0 0 0]}}, Source: /var/log/syslog, Offset: 676387, Timestamp: 2021-01-14 14:58:02.86236151 -0500 EST m=+3168.279766984, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 1179673-2049}","ecs.version":"1.5.0"}
Please include configurations and logs if available.
7.11 BC3 Kibana in cloud + BC3 Agent Debian 9
Steps to Reproduce:
The text was updated successfully, but these errors were encountered: