Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display which integration is causing the agent to become unhealthy #100

Closed
Tracked by #26930 ...
jlind23 opened this issue Dec 15, 2021 · 20 comments
Closed
Tracked by #26930 ...

Display which integration is causing the agent to become unhealthy #100

jlind23 opened this issue Dec 15, 2021 · 20 comments
Labels
estimation:Month Task that represents a month of work. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0

Comments

@jlind23
Copy link
Contributor

jlind23 commented Dec 15, 2021

When an Elastic Agent becomes unhealthy due to an integration, the only way to understand which integration is causing is to remove integrations one by one and/or check logs for a particular error.

Elastic Agent should be able to catch when an integration is failing and must be able to log this within the status command and the diagnostics command.

This is a design tasks between Elastic Agent Data Plane and Elastic Agent Control Plane.

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@jlind23 jlind23 added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jan 13, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jlind23
Copy link
Contributor Author

jlind23 commented Jan 13, 2022

ping @ph @cmacknz this is a design tasks to prepare the 8.3 implementation as it's a cross topic.

@cmacknz
Copy link
Member

cmacknz commented Mar 1, 2022

@ph is this something can be handled as part of the V2 control protocol design?

@ph
Copy link
Contributor

ph commented Mar 1, 2022

I think there are 3 parts:

  1. Is to allow the elastic agent to have the state of each inputs and return them back. This is part of the v2 control protocol.
  2. Is to modify the input to correctly record their state, this requires changes in the inputs framework, so maybe on data plane side?
  3. Is to changes Fleet status to map multiples inputs to a single integrations.

@ph
Copy link
Contributor

ph commented Mar 1, 2022

@blakerouse and I have been discussing 1.

@cmacknz
Copy link
Member

cmacknz commented Mar 1, 2022

  1. Is to modify the input to correctly record their state, this requires changes in the inputs framework, so maybe on data plane side?

Yes, I think this is (or will be) part of the input V2 proposal.

@ph
Copy link
Contributor

ph commented Mar 1, 2022

We are looking at the inputs too, we will have to sync with you and @kvch.

@jlind23 jlind23 transferred this issue from elastic/beats Mar 7, 2022
@jlind23
Copy link
Contributor Author

jlind23 commented Mar 21, 2022

@ph should I keep it in 8.2 or should I rather reconsider it for another release?

@ph
Copy link
Contributor

ph commented Mar 21, 2022

@jlind23 Agree, we need changes in v2 and in the input to be able to better report their state, it's a joined effort from data and control plane.

@jlind23
Copy link
Contributor Author

jlind23 commented Apr 12, 2022

@cmacknz @kvch if we start working on the input in 8.4. Do you think this can be part of the work achieved?
Whatever the input we take, we should give more meaningful informations when it comes to this particular input state. Thoughts?

cc @ph

@philippkahr
Copy link
Contributor

Will the agent go unhealthy only if the integration is not "installable", or will it also catch if an integration is writing error.message fields since it cannot connect to xyz? (e.g. elastic/integrations#3074)

@jlind23
Copy link
Contributor Author

jlind23 commented May 3, 2022

@philippkahr the goal is to catch also when the integration is failing. But then it's going to be up to the integrations developer to properly report statuses.

@zez3
Copy link

zez3 commented May 25, 2022

I've got the same abnormal behavior when I change the output. From ES default to a copy if my defaut + dead letters enable

@WiegerElastic
Copy link

While we are deploying Agent at Elastic, we notice that we have an increasing need to have better logging between Agent and Fleet and within Fleet itself. Please let me know if this is the correct place to put these FR's.

  • We really need to have some sort of clear indicator of the status of the agent. For example, if the agent cannot connect to the Fleet server or to the output ES cluster because the API keys have been revoked, it has certificate issues, network issues, etc. that should be clearly indicated in the elastic-agent status output. Currently, it mainly says degraded and that's it about it. (Generic errors like Authentication using apikey failed - api key has been invalidated aren't specific enough and can also mean something else has failed within the ES instead of just something between Agent and Fleet.
  • There should be a clear failure status when the agent is unable to connect to the fleet-server due to being unenrolled (or other failure modes).
  • It would be nice if modifications made in the Fleet UI were better reflected in the Kibana audit log. For example, you can see that we unenrolled some agents with the current logs, but not how many, or which ones.

image

@ghost
Copy link

ghost commented Jul 29, 2022

@kevinlog
Copy link

@prachigupta-qasource

Sounds good, thank you! Also note that @muskangulati-qasource and @harshitgupta-qasource are testing this functionality in Fleet as a part of OLM testing efforts. I think its a good idea for both teams to be aware of the functionality and have a test plan, but we may also be able to work together and not duplicate too much effort.

cc @manishgupta-qasource

@ghost
Copy link

ghost commented Aug 12, 2022

Hi Team,

We have executed 02 testcases for this feature under our Fleet Test run at Fleet 8.4.0-BC3 Feature test plan and found that it's working fine.

Build details:

Version: 8.4.0-BC3
Build: 55281
Commit: e42c547d7ab545472fd978383c2c43fa203a5b06

Thanks!

@aleksmaus
Copy link
Member

I pushed the feature-arch-v2 branch to fleet server with initial cut of propagating the detailed status information
https://github.com/elastic/fleet-server/tree/feature-arch-v2

The corresponding agent side PR is posted against the agent branch
#916

This should cover the agent/fleet-server side of things for:
#100

@joshdover @blake.rouse @ph let me know if there is anything else that needs to be addressed related to this feature, can iterate and update things as needed.

The .fleet-agent document looks like the following at the moment:
Screen Shot 2022-08-15 at 12 53 36 PM

@aleksmaus
Copy link
Member

The fleet server and the elastic agent work is complete at the moment, the agent posts the new extended health information to the stack.

The work is merged to the fleet server and the elastic agent branches respectively:
fleet server feature-arch-v2
elastic agent feature-arch-v2

It's expected that we will iterate on this feature few more times before release.
@pierrehilbert let me know if you want to keep this open or reassign etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimation:Month Task that represents a month of work. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0
Projects
None yet
Development

No branches or pull requests

10 participants