Display which integration is causing the agent to become unhealthy #100

jlind23 · 2021-12-15T07:56:05Z

When an Elastic Agent becomes unhealthy due to an integration, the only way to understand which integration is causing is to remove integrations one by one and/or check logs for a particular error.

Elastic Agent should be able to catch when an integration is failing and must be able to log this within the status command and the diagnostics command.

This is a design tasks between Elastic Agent Data Plane and Elastic Agent Control Plane.

elasticmachine · 2021-12-15T08:41:02Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2022-01-13T15:24:21Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

jlind23 · 2022-01-13T15:24:47Z

ping @ph @cmacknz this is a design tasks to prepare the 8.3 implementation as it's a cross topic.

cmacknz · 2022-03-01T19:42:12Z

@ph is this something can be handled as part of the V2 control protocol design?

ph · 2022-03-01T19:50:41Z

I think there are 3 parts:

Is to allow the elastic agent to have the state of each inputs and return them back. This is part of the v2 control protocol.
Is to modify the input to correctly record their state, this requires changes in the inputs framework, so maybe on data plane side?
Is to changes Fleet status to map multiples inputs to a single integrations.

ph · 2022-03-01T19:50:58Z

@blakerouse and I have been discussing 1.

cmacknz · 2022-03-01T19:55:47Z

Is to modify the input to correctly record their state, this requires changes in the inputs framework, so maybe on data plane side?

Yes, I think this is (or will be) part of the input V2 proposal.

ph · 2022-03-01T20:00:59Z

We are looking at the inputs too, we will have to sync with you and @kvch.

jlind23 · 2022-03-21T08:32:54Z

@ph should I keep it in 8.2 or should I rather reconsider it for another release?

ph · 2022-03-21T12:11:34Z

@jlind23 Agree, we need changes in v2 and in the input to be able to better report their state, it's a joined effort from data and control plane.

jlind23 · 2022-04-12T06:40:18Z

@cmacknz @kvch if we start working on the input in 8.4. Do you think this can be part of the work achieved?
Whatever the input we take, we should give more meaningful informations when it comes to this particular input state. Thoughts?

cc @ph

philippkahr · 2022-05-02T17:59:15Z

Will the agent go unhealthy only if the integration is not "installable", or will it also catch if an integration is writing error.message fields since it cannot connect to xyz? (e.g. elastic/integrations#3074)

jlind23 · 2022-05-03T06:30:13Z

@philippkahr the goal is to catch also when the integration is failing. But then it's going to be up to the integrations developer to properly report statuses.

zez3 · 2022-05-25T05:36:30Z

I've got the same abnormal behavior when I change the output. From ES default to a copy if my defaut + dead letters enable

WiegerElastic · 2022-06-20T12:13:47Z

While we are deploying Agent at Elastic, we notice that we have an increasing need to have better logging between Agent and Fleet and within Fleet itself. Please let me know if this is the correct place to put these FR's.

We really need to have some sort of clear indicator of the status of the agent. For example, if the agent cannot connect to the Fleet server or to the output ES cluster because the API keys have been revoked, it has certificate issues, network issues, etc. that should be clearly indicated in the elastic-agent status output. Currently, it mainly says degraded and that's it about it. (Generic errors like Authentication using apikey failed - api key has been invalidated aren't specific enough and can also mean something else has failed within the ES instead of just something between Agent and Fleet.
There should be a clear failure status when the agent is unable to connect to the fleet-server due to being unenrolled (or other failure modes).
It would be nice if modifications made in the Fleet UI were better reflected in the Kibana audit log. For example, you can see that we unenrolled some agents with the current logs, but not how many, or which ones.

ghost · 2022-07-29T12:28:01Z

Hi @jlind23,

As per the feedback from @kevinlog, we have created test case scenarios for Endpoint Security Integration as changes are available for Endpoint Security only

[C167461]: Validate that the errors are shown on the 'Endpoint Security' integration dropdown on Agent details page for Mac OS when 'System extension' and 'Full Disk Access' are not granted
[C167465]: Validate that no error is shown on the 'Endpoint Security' integration dropdown on Agent details page for Mac OS when 'System extension' and 'Full Disk Access' are granted

CC: @joshdover

Thanks!

kevinlog · 2022-07-31T14:01:10Z

@prachigupta-qasource

Sounds good, thank you! Also note that @muskangulati-qasource and @harshitgupta-qasource are testing this functionality in Fleet as a part of OLM testing efforts. I think its a good idea for both teams to be aware of the functionality and have a test plan, but we may also be able to work together and not duplicate too much effort.

cc @manishgupta-qasource

ghost · 2022-08-12T06:40:49Z

Hi Team,

We have executed 02 testcases for this feature under our Fleet Test run at Fleet 8.4.0-BC3 Feature test plan and found that it's working fine.

Build details:

Version: 8.4.0-BC3
Build: 55281
Commit: e42c547d7ab545472fd978383c2c43fa203a5b06

Thanks!

aleksmaus · 2022-08-15T17:19:30Z

I pushed the feature-arch-v2 branch to fleet server with initial cut of propagating the detailed status information
https://github.com/elastic/fleet-server/tree/feature-arch-v2

The corresponding agent side PR is posted against the agent branch
#916

This should cover the agent/fleet-server side of things for:
#100

@joshdover @blake.rouse @ph let me know if there is anything else that needs to be addressed related to this feature, can iterate and update things as needed.

The .fleet-agent document looks like the following at the moment:

aleksmaus · 2022-08-16T13:06:39Z

The fleet server and the elastic agent work is complete at the moment, the agent posts the new extended health information to the stack.

The work is merged to the fleet server and the elastic agent branches respectively:
fleet server feature-arch-v2
elastic agent feature-arch-v2

It's expected that we will iterate on this feature few more times before release.
@pierrehilbert let me know if you want to keep this open or reassign etc.

jlind23 mentioned this issue Dec 15, 2021

[Elastic Agent] Improve Elastic Agent debuggability elastic/beats#26930

Open

30 tasks

jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Dec 15, 2021

jlind23 added the 8.2-candidate label Dec 15, 2021

jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 13, 2022

jlind23 added v8.2.0 and removed 8.2-candidate labels Jan 28, 2022

jlind23 transferred this issue from elastic/beats Mar 7, 2022

jlind23 removed the v8.2.0 label Mar 23, 2022

jlind23 added the 8.4-candidate label Apr 12, 2022

cmacknz mentioned this issue Apr 14, 2022

Propagate java-attacher errors to Kibana elastic/apm-server#7832

Closed

jlind23 mentioned this issue Apr 27, 2022

Improve Elastic Agent integrations logging #380

Open

ph mentioned this issue Apr 27, 2022

[DESIGN][Enhancement][Elastic Agent] Write down Elastic Agent status report structure #79

Closed

jlind23 added v8.4.0 and removed 8.4-candidate labels May 24, 2022

zez3 mentioned this issue May 29, 2022

Fleet policy deployment on multiple Agents in a consecutive/sequenced 1by1 way #474

Open

jlind23 added the estimation:Month Task that represents a month of work. label Jun 1, 2022

eyalkraft mentioned this issue Jun 23, 2022

Support health status interface elastic/cloudbeat#239

Closed

3 tasks

cmacknz mentioned this issue Jul 12, 2022

[Meta] Elastic Agent Inputs elastic/elastic-agent-inputs#1

Closed

33 tasks

aleksmaus mentioned this issue Aug 12, 2022

Expand check-in payload for V2 #916

Merged

2 tasks

aleksmaus mentioned this issue Aug 15, 2022

Support for Elastic Agent V2 status elastic/fleet-server#1747

Merged

2 tasks

pierrehilbert closed this as completed Aug 16, 2022

joshdover mentioned this issue Aug 24, 2022

Update .fleet-agents mappings for integration health status elastic/elasticsearch#89574

Closed

cmacknz mentioned this issue Sep 22, 2022

[Meta] V2 Feature Architecture (8.5/8.6) #836

Closed

52 tasks

juliaElastic mentioned this issue Sep 29, 2022

Report the status message field to Fleet Server checkins #1151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display which integration is causing the agent to become unhealthy #100

Display which integration is causing the agent to become unhealthy #100

jlind23 commented Dec 15, 2021 •

edited

Loading

elasticmachine commented Dec 15, 2021

elasticmachine commented Jan 13, 2022

jlind23 commented Jan 13, 2022

cmacknz commented Mar 1, 2022

ph commented Mar 1, 2022

ph commented Mar 1, 2022

cmacknz commented Mar 1, 2022

ph commented Mar 1, 2022

jlind23 commented Mar 21, 2022

ph commented Mar 21, 2022

jlind23 commented Apr 12, 2022

philippkahr commented May 2, 2022

jlind23 commented May 3, 2022

zez3 commented May 25, 2022

WiegerElastic commented Jun 20, 2022

ghost commented Jul 29, 2022 •

edited by ghost

Loading

kevinlog commented Jul 31, 2022

ghost commented Aug 12, 2022

aleksmaus commented Aug 15, 2022

aleksmaus commented Aug 16, 2022

Display which integration is causing the agent to become unhealthy #100

Display which integration is causing the agent to become unhealthy #100

Comments

jlind23 commented Dec 15, 2021 • edited Loading

elasticmachine commented Dec 15, 2021

elasticmachine commented Jan 13, 2022

jlind23 commented Jan 13, 2022

cmacknz commented Mar 1, 2022

ph commented Mar 1, 2022

ph commented Mar 1, 2022

cmacknz commented Mar 1, 2022

ph commented Mar 1, 2022

jlind23 commented Mar 21, 2022

ph commented Mar 21, 2022

jlind23 commented Apr 12, 2022

philippkahr commented May 2, 2022

jlind23 commented May 3, 2022

zez3 commented May 25, 2022

WiegerElastic commented Jun 20, 2022

ghost commented Jul 29, 2022 • edited by ghost Loading

kevinlog commented Jul 31, 2022

ghost commented Aug 12, 2022

aleksmaus commented Aug 15, 2022

aleksmaus commented Aug 16, 2022

jlind23 commented Dec 15, 2021 •

edited

Loading

ghost commented Jul 29, 2022 •

edited by ghost

Loading