[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

EricDavisX · 2021-01-14T20:17:01Z

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy'

debugging with Blake we decided this is the relevant log from elastic agent:
{"log.level":"info","@timestamp":"2021-01-14T14:59:11.293-0500","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-01-14T14:59:11-05:00: type: 'STATE': sub_type: 'RUNNING' message: Application: filebeat--7.11.0-SNAPSHOT[2ab73e20-5695-11eb-9ccb-e9bbb39979c5]: State changed to DEGRADED: 1 error: 1 error: Error creating runner from config: Can only start an input when all related states are finished: {Id: native::1179673-2049, Finished: false, Fileinfo: &{syslog 676387 416 {18666649 63746251076 0x67387c0} {2049 1179673 1 33184 0 4 0 0 676387 4096 1328 {1610651828 288434822} {1610654276 18666649} {1610654276 18666649} [0 0 0]}}, Source: /var/log/syslog, Offset: 676387, Timestamp: 2021-01-14 14:58:02.86236151 -0500 EST m=+3168.279766984, TTL: -1ns, Type: log, Meta: map[], FileStateOS: 1179673-2049}","ecs.version":"1.5.0"}

Please include configurations and logs if available.
7.11 BC3 Kibana in cloud + BC3 Agent Debian 9

Steps to Reproduce:

install Agent. sometimes that is all that it takes for it to go unhealthy with this error.
if not reproduced above, toggle to a new policy (even one with the same Integrations) and wait
and if that doesn't work, change it to yet a different policy somewhat quickly, to try to 'trap' Filebeat being in the middle of a state change when a new request comes in

The text was updated successfully, but these errors were encountered:

EricDavisX · 2021-01-14T20:22:32Z

Blakes theory / fix:
that is the issue, Elastic Agent tells filebeat to reload with a new config and it fails to reload

filebeat reports it as degraded when it should report it as a failure

I need to look into it, but I believe if it reported it as failure Agent would try it again, and they restart it if it needs to

blakerouse · 2021-01-14T20:48:54Z

We originally set beats to report this type of error as a failure, but this caused issues because we did not have conditions on inputs. So we switched it to degraded status reporting so if an input fails to load because it was not supported on that platform it would not cause a failure.

Now with proper status reporting and conditions this is the wrong choice. We need to do the following:

Report configuration reloading issues inside of beats as a failure when it does fail.
Ensure that on failure reporting that the Agent re-tries the configuration reload. (I believe this to be the case already today).
Filebeat should also be looked out so it doesn't have this issue internally.

EricDavisX · 2021-01-15T01:41:45Z

testing notes; in most cases, the Agent process may be in a fine state and you can continue testing. If it is sending data, carry on. If not, you can attempt to change policy and change it back and see if it applies better the 2nd time.

confirmed tho, anytime we see ‘DEGRADED’ in the logs, that can now be assumed as the culprit for ‘unhealthy’ until we get the fix in.

ph · 2021-01-15T13:06:54Z

@blakerouse Concerning 1, 2 This seems something we can do on our side? For the last item 3 this a more core problem with Filebeat @urso Do you have an idea at how we could fix that?

blakerouse · 2021-01-15T14:00:06Z

I have confirmed that #2 from my list is already handled in Agent. It will force a restart of filebeat, because the configuration failed to load.

@ph correct #3 is a core libbeat/filebeat issue, would be great if @urso could take a look. I have tracked the error to here, where it seems to bubble up to:

https://github.com/elastic/beats/blob/master/libbeat/cfgfile/list.go#L95

and the source of the error bubbling up to that level comes from:

https://github.com/elastic/beats/blob/master/filebeat/input/log/input.go#L177

urso · 2021-01-18T19:16:25Z

From the logs it looks like the typical "logs" input problem. The logs input, its coordination and registry updates are all "linked" to the output making progress. If the output is not fast enough or blocking, then stopping, restarting, reconfiguring or similar operations can block because the state didn't make it to the registry yet.

@kvch Did refactor the logs input based on the v2 API. But the new input type is named filestream. The new input tries to decouple management updates from the output, such that changes can be applied independent of the output state. If 2 inputs try to read the same file, then the second instance waits instead of failing. A file has one inuque "harvester" all the time.

The file being collected is /var/log/syslog. Was it configured by the system module?
Updating the mapping in the agent might help.

@kvch any non-bc settings in the filestream input (settings renamed, added, removed)? If so, someone need to check our integrations.

ph · 2021-01-18T19:22:36Z

Thanks @urso,

The file being collected is /var/log/syslog. Was it configured by the system module?
Updating the mapping in the agent might help.

Yes this is collected by the system module. The problem describe exist for a really long time. The problem is the code we have to collector error was subpart and now well it's show theses problem.

When you said updating the mapping are you suggesting just aliasing the log input to the filestream?

urso · 2021-01-18T19:39:06Z

When you said updating the mapping are you suggesting just aliasing the log input to the filestream?

This might be a breaking change, but yes.

But you won't be able to get rid of the log input completely. The event/docker translates to the docker input , which configures the log input + a few hacks. Btw. the docker input was superseded by the container input for quite some time I think. But it does the same tricks the docker input does, so you will still be stuck with the log input for container use-cases.

ruflin · 2021-01-19T13:09:25Z

The unifished stated indicates that the previous harvester is still reading the file. As @urso described, this is caused by ES / output blocking. Do we know in the first place why this happens? Is ES available to receive the data? If yes, I would expect the issue to resolve it self after it is completed.

In case of a new policy with the same input config inside, I would also expect that the reload is not triggered in the first place as the hash of the input would still be identical?

Do we know what is blocking the input from finishing?

EricDavisX · 2021-01-19T15:22:44Z

pr to fix item 1 from above, on Agent side: #23537

requires latest .10.x package of System Integration to avoid failures due to mis-matched design from package side. I'm glad to see the discussion above, as I was concerned we'd need to coordinate.

ph · 2021-01-20T17:50:46Z

@EricDavisX I do not understand the following can you clarify?

requires latest .10.x package of System Integration to avoid failures due to mis-matched design from package side. I'm glad to see the discussion above, as I was concerned we'd need to coordinate.

EricDavisX · 2021-01-20T18:26:56Z

I can clarify that without the latest System Integration that has the conditional inputs, that we expect a fail-citation-and-then-restart-loop to repeat, making usage / debugging harder if not impossible. The package is available in 8.0 cloud Kibana usage, and is in the staging / snapshot repos of package storage for those who want to set the Kibana epr url manually instead of using cloud to test.

EricDavisX · 2021-01-21T15:24:11Z

This is merged into 7.11 now, and will be picked up in the next build candidate, BC5 slated for (at current) next week. The .0.10.7 System Integration package has been merged

EricDavisX · 2021-01-28T01:31:49Z

I am seeing this fixed nicely in the 7.11 BC5 build so far

EricDavisX added impact:high Short-term priority; add to current release, or definitely next. v7.11.0 Team:Elastic-Agent Label for the Agent team labels Jan 14, 2021

EricDavisX assigned blakerouse Jan 14, 2021

EricDavisX mentioned this issue Jan 15, 2021

[Agent] Frequent degrade from missing last check-in #21598

Closed

blakerouse mentioned this issue Jan 15, 2021

[Elastic Agent] Set status Failed if configuration applying fails #23537

Merged

2 tasks

This was referenced Jan 20, 2021

Cherry-pick #23537 to 7.x: [Elastic Agent] Set status Failed if configuration applying fails #23600

Merged

Cherry-pick #23537 to 7.11: [Elastic Agent] Set status Failed if configuration applying fails #23601

Merged

blakerouse closed this as completed Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

EricDavisX commented Jan 14, 2021

EricDavisX commented Jan 14, 2021

blakerouse commented Jan 14, 2021

EricDavisX commented Jan 15, 2021

ph commented Jan 15, 2021

blakerouse commented Jan 15, 2021

urso commented Jan 18, 2021

ph commented Jan 18, 2021

urso commented Jan 18, 2021

ruflin commented Jan 19, 2021

EricDavisX commented Jan 19, 2021 •

edited

Loading

ph commented Jan 20, 2021

EricDavisX commented Jan 20, 2021

EricDavisX commented Jan 21, 2021

EricDavisX commented Jan 28, 2021

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

[Agent] [Filebeat] when Agent changes policy Filebeat config can trip up and Agent gets stuck on 'unhealthy' #23518

Comments

EricDavisX commented Jan 14, 2021

EricDavisX commented Jan 14, 2021

blakerouse commented Jan 14, 2021

EricDavisX commented Jan 15, 2021

ph commented Jan 15, 2021

blakerouse commented Jan 15, 2021

urso commented Jan 18, 2021

ph commented Jan 18, 2021

urso commented Jan 18, 2021

ruflin commented Jan 19, 2021

EricDavisX commented Jan 19, 2021 • edited Loading

ph commented Jan 20, 2021

EricDavisX commented Jan 20, 2021

EricDavisX commented Jan 21, 2021

EricDavisX commented Jan 28, 2021

EricDavisX commented Jan 19, 2021 •

edited

Loading