-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841
Comments
Pinging @elastic/ingest-management (Team:Ingest Management) |
@ruflin I see some complexity and I am not sure how we can resolve that, based on elastic/beats#21000 the restriction is a limit that an Agent decides and imposes on fleet, this mean with that proposal that two agents associated to the same Agent policy could have different restrictictions. Where could possible show that information:
For the Agent policy related views we probably need some caching to be efficient. But @ruflin should that limitation be set from the Agent? cc @mostlyjason |
There can be indeed conflicting input configs. I think we have 2 options here:
I would rather go with the second approach and show a warning that not all Agents support parts of a specific integration. For the locations on where should be shown what, here a brain dump:
Can a user add an integration for a non supported input? I think the answer is likely yes as it will make the UI simpler but we should show the user a warning that it will only run on a subset. This feature is also important in case one day we have certain Agents in a policy that only support apm-server for example, so not all of them can run a certain input. |
Use case Show an unhealthy agent status An enhancement would be to additionally show a warning message to users when they attempt to add an integration with blocked inputs in Fleet. It will inform them it will only run on a subset of agents or no agents. This reduces the steps to identify and fix the issue. However, I'm on the fence if this should be MVP because it will require extra effort to implement. Also, its an uncommon scenario to optimize for. On ESS/ECE it will mainly happen if the user adds an integration that doesn't make sense (user error) and for self-managed we don't know how often operators will use this feature. It would require extra work to pass the allowlist from the agent to fleet and duplicate business logic to evaluate it. We could do it later if we see users running into this scenario often. Health status For integrations with deployment type "one", we want the coordinator to allocate inputs to eligible agents. It would only be unhealthy if it cannot allocate an input to any agents. One option is to mark all the agents in the policy as unhealthy to communicate the problem. Would fleet server send the status for these agents? Alternatively, we could mark the integration policy itself as unhealthy. We don't have any status information on integration policies today and I'm assuming it'd be more work to add it? Where we show status info I agree with @ruflin that the agent details page is a good place to show the blocked inputs. We'll show more detailed status information about each integration and input soon. This is also a convenient location since users will hopefully be able to see inputs disabled due to input conditions here too. I think we also have a message field or at least a link to logs so the user can learn why an input is disabled? It'd be nice to show health status on the agent and integration polices, but we could do it as a separate enhancement if we have a good way to show it at the agent level now. |
@mukeshelastic @ph @ruflin my proposal above is for an MVP solution that shows blocked inputs in the agent status information. I'm suggesting we implement the minimal solution first, and later enhance it. Do you agree? |
I do agree that reporting blocked input via the Agent via the status information is an OK MVP. @ruflin to be clear what @mostlyjason proposes to means that we don't have anything to do in the UI, just to be clear. |
@ph Do we already have everything in place needed in the UI for the above? |
This will also be important for managed policies/agents on cloud.
I don't believe we should expect users to know which integrations are supported for managed policies in which versions, but rather provide better user feedback - as a follow-up to this MVP. |
@ruflin It could be reported as a normal status error, @michalpristas Can you confirm or correct me if it's possible? |
we can definitely make information about filtering an input part of the health status. |
looking at the kibana code, it wont allow us to send any additional text to status. then with upcoming status work we will refine status API either way |
After discussion in our sync, we have decided that the error returned by the Elastic Agent code would be enough for an MVP. |
@ruflin @mostlyjason @hbharding Can we confirm what's included in the I created a quick loom https://www.loom.com/share/54b034d8351545c0b89293928694f840 showing the UI and confirming some assumptions and asking some questions like: Agent list page
Agent details page
Agent overview page
Can you take a look at that video and let me know what you think and if I've missed anything or have something incorrect? |
@jfsiii No, since we don't have input/integration-level status reporting yet, overall Unhealthy status reporting is the interim solution for this. The rest of the list looks accurate. Edit: I would also check that we get some error logging back from the agent that can be viewed in Agent details > Logs. |
Thanks for making the Loom @jfsiii! ++ on what Jen said. Also, I'm surprised anything is needed on the fleet side, because I thought it just reported the status given to it from the agent. I'm hoping this doesn't require some special case in the code. You know the internals better than I do though. |
@mostlyjason We're not sure whether it needs additional work yet or not, so the initial step for this ticket is to double check that this type of error does already bubble up to the UI. |
As you say, Kibana should report the status the agent sends to Fleet server, so I'm confirming that (a) the agent sends an I'm still investigating but the initial debugging seems to indicate at least one of those isn't happening; maybe neither. I ran this in one terminal:
and this was logged in another (I added some logging to a local checkout of
The
I'm still confirming, but we might need to make some updates in one/both fleet-server & elastic-agent /cc @nchaulet & @michalpristas |
While working on this I discovered that a property Kibana uses to determine status ( However, any agents which hit these capability restrictions aren't reflected in the counts on the Agent overview page, because their status is @jen-huang @mostlyjason Should we add the count for agents in a |
@jen-huang ok, cool. I still see it in that PR, but I haven't seen the designs and you have more context about that change than I do. Does that mean we can close this as resolved or should we add tests? If so are they a) Fleet tests which assert that an API response of |
I think #102821 incidentally added the test for this ("Fleet tests which assert that an API response of "status": "degraded" shows an Unhealthy badge"), so closing this. |
The requirements below from PH should already work with our current Agent/Fleet Server/Kibana workflow. This issue is now a dev task to test and confirm that if the agent reports an error, it will check in with error to Fleet server, and Kibana indeed displays the agent as unhealthy.
Originally posted by @ph in #76841 (comment)
After discussion in our sync, we have decided that the error returned by the Elastic Agent code would be enough for an MVP.
We need to test to make sure the elastic/beats#23848 is correctly reporting the error back in the Log and change the status of the Elastic Agent.
Some of the Agents connected to Fleet might only support a subset of the available inputs: elastic/beats#21000 This information is sent up to Fleet. Fleet must be able to make decision on this information on what the user can configure and show the user info if certain inputs are not supported.The text was updated successfully, but these errors were encountered: