[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841

ruflin · 2020-09-07T06:54:45Z

The requirements below from PH should already work with our current Agent/Fleet Server/Kibana workflow. This issue is now a dev task to test and confirm that if the agent reports an error, it will check in with error to Fleet server, and Kibana indeed displays the agent as unhealthy.

Originally posted by @ph in #76841 (comment)
After discussion in our sync, we have decided that the error returned by the Elastic Agent code would be enough for an MVP.
We need to test to make sure the elastic/beats#23848 is correctly reporting the error back in the Log and change the status of the Elastic Agent.

Some of the Agents connected to Fleet might only support a subset of the available inputs: elastic/beats#21000 This information is sent up to Fleet. Fleet must be able to make decision on this information on what the user can configure and show the user info if certain inputs are not supported.

elasticmachine · 2020-09-07T06:54:47Z

Pinging @elastic/ingest-management (Team:Ingest Management)

ph · 2020-12-18T15:07:18Z

@ruflin I see some complexity and I am not sure how we can resolve that, based on elastic/beats#21000 the restriction is a limit that an Agent decides and imposes on fleet, this mean with that proposal that two agents associated to the same Agent policy could have different restrictictions.

Where could possible show that information:

On edit the Agent policy?
On the Agent details pages
Agent list
Agent policy list

For the Agent policy related views we probably need some caching to be efficient.

But @ruflin should that limitation be set from the Agent?

cc @mostlyjason

ruflin · 2020-12-21T08:08:46Z

There can be indeed conflicting input configs. I think we have 2 options here:

Block the full policy to be shipped to all Elastic Agents
Ship the full policy to all Agents but show a warning

I would rather go with the second approach and show a warning that not all Agents support parts of a specific integration.

For the locations on where should be shown what, here a brain dump:

Which inputs an Agent supports: Elastic Agent details page
An Agent gets a policy that is not fully compatible: Policy list, Agent list, Agent details

Can a user add an integration for a non supported input? I think the answer is likely yes as it will make the UI simpler but we should show the user a warning that it will only run on a subset. This feature is also important in case one day we have certain Agents in a policy that only support apm-server for example, so not all of them can run a certain input.

mostlyjason · 2021-01-15T12:33:32Z

Use case
As an integration user, I want to know when one of my inputs is blocked on the agent side so I can adjust my integration policy to run on allowed agents.

Show an unhealthy agent status
As an MVP solution, if a user saves a blocked integration to a hosted agent policy, they will see status information saying the agent is unhealthy because the inputs are blocked. To make those agents healthy, they'll have to remove or edit the integration. This satisfies our minimal requirements for the use case. Also, it is the minimal effort solution. We decided in elastic/beats#21000 that this allowlist needs to be enforced on the agent so that we can restrict what inputs run on Elastic Cloud. We already have the ability to pass agent status information back to Fleet, so no extra work is needed on the Fleet side.

An enhancement would be to additionally show a warning message to users when they attempt to add an integration with blocked inputs in Fleet. It will inform them it will only run on a subset of agents or no agents. This reduces the steps to identify and fix the issue. However, I'm on the fence if this should be MVP because it will require extra effort to implement. Also, its an uncommon scenario to optimize for. On ESS/ECE it will mainly happen if the user adds an integration that doesn't make sense (user error) and for self-managed we don't know how often operators will use this feature. It would require extra work to pass the allowlist from the agent to fleet and duplicate business logic to evaluate it. We could do it later if we see users running into this scenario often.

Health status
Another decision is how the allowlist impacts agent health status. I think a service owner would expect a host-based integration they added to run on every host. This is different than input conditions because the service owner will be able to view/change them, but if an operator added a blocklist it may be unexpected. Marking the agent as unhealthy would communicate there is a mismatch.

For integrations with deployment type "one", we want the coordinator to allocate inputs to eligible agents. It would only be unhealthy if it cannot allocate an input to any agents. One option is to mark all the agents in the policy as unhealthy to communicate the problem. Would fleet server send the status for these agents? Alternatively, we could mark the integration policy itself as unhealthy. We don't have any status information on integration policies today and I'm assuming it'd be more work to add it?

Where we show status info
The agent list page is a place to show blocked integrations, because they will show up with unhealthy status. Users can drill down into agent details to learn why.

I agree with @ruflin that the agent details page is a good place to show the blocked inputs. We'll show more detailed status information about each integration and input soon. This is also a convenient location since users will hopefully be able to see inputs disabled due to input conditions here too. I think we also have a message field or at least a link to logs so the user can learn why an input is disabled?

It'd be nice to show health status on the agent and integration polices, but we could do it as a separate enhancement if we have a good way to show it at the agent level now.

mostlyjason · 2021-01-27T21:13:34Z

@mukeshelastic @ph @ruflin my proposal above is for an MVP solution that shows blocked inputs in the agent status information. I'm suggesting we implement the minimal solution first, and later enhance it. Do you agree?

ph · 2021-01-28T17:39:40Z

I do agree that reporting blocked input via the Agent via the status information is an OK MVP. @ruflin to be clear what @mostlyjason proposes to means that we don't have anything to do in the UI, just to be clear.

ruflin · 2021-01-29T09:30:43Z

@ph Do we already have everything in place needed in the UI for the above?

simitt · 2021-02-01T10:14:29Z

This will also be important for managed policies/agents on cloud.

On ESS/ECE it will mainly happen if the user adds an integration that doesn't make sense (user error)

I don't believe we should expect users to know which integrations are supported for managed policies in which versions, but rather provide better user feedback - as a follow-up to this MVP.

ph · 2021-02-01T20:49:23Z

@ruflin It could be reported as a normal status error, @michalpristas Can you confirm or correct me if it's possible?

michalpristas · 2021-02-02T17:19:44Z

we can definitely make information about filtering an input part of the health status.
agent will be unhealthy error will be that agent blocked these inputs and more detail will be in logs

michalpristas · 2021-02-03T13:19:58Z

looking at the kibana code, it wont allow us to send any additional text to status.
i will report unhealthy and details will be in logs as a MVP

then with upcoming status work we will refine status API either way

ph · 2021-02-08T16:26:51Z

After discussion in our sync, we have decided that the error returned by the Elastic Agent code would be enough for an MVP.
We need to test to make sure the elastic/beats#23848 is correctly reporting the error back in the Log and change the status of the Elastic Agent.

jfsiii · 2021-06-01T16:30:53Z

@ruflin @mostlyjason @hbharding Can we confirm what's included in the Kibana indeed displays the agent as unhealthy from the description?

I created a quick loom https://www.loom.com/share/54b034d8351545c0b89293928694f840 showing the UI and confirming some assumptions and asking some questions like:

Agent list page

Display Unhealthy badge in table row

Agent details page

Display Unhealthy badge in page header
Should there be any UI changes in the Integrations section? e.g. at the integration and/or input level?

Agent overview page

Affected agents are included in Error count

Can you take a look at that video and let me know what you think and if I've missed anything or have something incorrect?

jen-huang · 2021-06-01T18:27:57Z

Should there be any UI changes in the Integrations section? e.g. at the integration and/or input level?

@jfsiii No, since we don't have input/integration-level status reporting yet, overall Unhealthy status reporting is the interim solution for this. The rest of the list looks accurate.

Edit: I would also check that we get some error logging back from the agent that can be viewed in Agent details > Logs.

mostlyjason · 2021-06-01T20:35:47Z

Thanks for making the Loom @jfsiii! ++ on what Jen said. Also, I'm surprised anything is needed on the fleet side, because I thought it just reported the status given to it from the agent. I'm hoping this doesn't require some special case in the code. You know the internals better than I do though.

jen-huang · 2021-06-01T20:44:57Z

@mostlyjason We're not sure whether it needs additional work yet or not, so the initial step for this ticket is to double check that this type of error does already bubble up to the UI.

jfsiii · 2021-06-02T00:36:17Z

As you say, Kibana should report the status the agent sends to Fleet server, so I'm confirming that (a) the agent sends an error status to Fleet server (b) Fleet server records that status in ES.

I'm still investigating but the initial debugging seems to indicate at least one of those isn't happening; maybe neither.

I ran this in one terminal:

sudo ./elastic-agent install -f --url=http://localhost:8220 --enrollment-token=dTFpMXgza0JaRUZyLTNDdGg4cTg6X3FtcnBhZkVTTXFsLXFYZXJhSFMzZw== --insecure

and this was logged in another (I added some logging to a local checkout of fleet-server and rebuilt with make release)

{"log.level":"debug","url.full":"/api/fleet/agents/enroll?","http.version":"1.1","http.request.method":"POST","http.response.status_code":200,"http.request.body.bytes":1309,"http.response.body.bytes":1646,"client.address":"127.0.0.1:59867","client.ip":"127.0.0.1","client.port":59867,"tls.established":false,"event.duration":637783498,"@timestamp":"2021-06-01T20:27:07.252Z","message":"HTTP handler"}
{"log.level":"debug","index":".fleet-actions","ctx":"index monitor","index":".fleet-actions","@timestamp":"2021-06-01T20:27:07.411Z","message":"index not found"}
{"log.level":"debug","@timestamp":"2021-06-01T20:27:08.742Z","message":"JFSIII CHECKIN GOT BODY: {{\"status\":\"online\",\"events\":[],\"local_metadata\":{\"elastic\":{\"agent\":{\"id\":\"0656e231-7c02-4c9f-b0fc-5ad545c7a08b\",\"version\":\"8.0.0\",\"snapshot\":true,\"build.original\":\"8.0.0-SNAPSHOT (build: 2ee21d95aef89af7f7e7aef8d07f679a24d690b4 at 2021-05-27 16:09:41 +0000 UTC)\",\"upgradeable\":true,\"log_level\":\"info\"}},\"host\":{\"architecture\":\"x86_64\",\"hostname\":\"JFSIII.local\",\"name\":\"JFSIII.local\",\"id\":\"209252E1-587B-5756-ADBC-E72BF11A8C98\",\"ip\":[\"127.0.0.1/8\",\"::1/128\",\"fe80::1/64\",\"fe80::aede:48ff:fe00:1122/64\",\"fe80::1001:6878:9d12:ae80/64\",\"2601:155:8300:b360:e4:ffc8:ee04:99c8/64\",\"2601:155:8300:b360:c8f6:6246:450:2b90/64\",\"2601:155:8300:b360::dc93/64\",\"10.0.0.183/24\",\"2601:155:8300:b360:1bc:b0be:3913:e825/64\",\"2601:155:8300:b360:7848:e011:8f6:87e2/64\",\"2601:155:8300:b360:2869:6e62:4e68:188f/64\",\"2601:155:8300:b360:4c80:b1e1:ff66:72ae/64\",\"fe80::ec4d:c7ff:fea0:d969/64\",\"fe80::ec4d:c7ff:fea0:d969/64\",\"fe80::bdf8:8015:2827:6af2/64\",\"fe80::225d:ea4:8e7f:cf83/64\"],\"mac\":[\"3a:f9:d3:a6:29:52\",\"ac:de:48:00:11:22\",\"38:f9:d3:a6:29:52\",\"82:1d:9f:e5:90:05\",\"82:1d:9f:e5:90:04\",\"82:1d:9f:e5:90:01\",\"82:1d:9f:e5:90:00\",\"82:1d:9f:e5:90:01\",\"ee:4d:c7:a0:d9:69\",\"ee:4d:c7:a0:d9:69\"]},\"os\":{\"family\":\"darwin\",\"kernel\":\"20.4.0\",\"platform\":\"darwin\",\"version\":\"10.16\",\"name\":\"Mac OS X\",\"full\":\"Mac OS X(10.16)\"}}}}"}
{"log.level":"info","error.message":"EOF","id":"0656e231-7c02-4c9f-b0fc-5ad545c7a08b","http.response.status_code":400,"http.request.id":"","event.duration":252339782,"@timestamp":"2021-06-01T20:27:08.994Z","message":"fail checkin"}
{"log.level":"debug","url.full":"/api/fleet/agents/0656e231-7c02-4c9f-b0fc-5ad545c7a08b/checkin?","http.version":"1.1","http.request.method":"POST","http.response.status_code":400,"http.request.body.bytes":1296,"http.response.body.bytes":39,"client.address":"[::1]:59880","client.ip":"::1","client.port":59880,"tls.established":false,"event.duration":252375883,"@timestamp":"2021-06-01T20:27:08.994Z","message":"HTTP handler"}

The JFSIII CHECKIN GOT BODY is in the handler for the checkin route and shows a payload with "status":"online" which I believe is incorrect.

checkin also responds with a 400 status code which seems to be ignored by the install command.

I'm still confirming, but we might need to make some updates in one/both fleet-server & elastic-agent

/cc @nchaulet & @michalpristas

jfsiii · 2021-06-03T13:59:50Z

While working on this I discovered that a property Kibana uses to determine status (last_checkin_status in the .fleet-agents index) was missing. There's a PR to restore it. When I ran that locally, without changing anything in Kibana, the two "Display Unhealthy badge" items were resolved

Agent list page

Agent details page

However, any agents which hit these capability restrictions aren't reflected in the counts on the Agent overview page, because their status is degraded; not error.

Agent overview page

@jen-huang @mostlyjason Should we add the count for agents in a degraded state or leave it as-is?

jen-huang · 2021-06-03T16:33:20Z

@jfsiii I would leave that as-is because the Overview page is going away with the move to top-level Integrations UI (#99848).

jfsiii · 2021-06-04T15:19:08Z

@jen-huang ok, cool. I still see it in that PR, but I haven't seen the designs and you have more context about that change than I do.

Does that mean we can close this as resolved or should we add tests? If so are they

a) Fleet tests which assert that an API response of "status": "degraded" shows an Unhealthy badge
b) E2E tests(?) which have a capabilities.yml file with a deny rule, run the install or enroll command, and check for certain side effects
c) something else?

jen-huang · 2021-06-04T16:56:27Z

@jfsiii Let's do #1 to close out this issue. For #2, it would be great to check on if we have similar tests already in the e2e suite and file an issue if not.

jen-huang · 2021-06-23T22:50:59Z

I think #102821 incidentally added the test for this ("Fleet tests which assert that an API response of "status": "degraded" shows an Unhealthy badge"), so closing this.

ruflin added the Team:Fleet Team label for Observability Data Collection Fleet team label Sep 7, 2020

ph added the v7.12.0 label Oct 29, 2020

ph mentioned this issue Dec 18, 2020

[Elastic Agent] Allowlist / Blocklist for inputs an Agent supports with Fleet elastic/beats#21000

Closed

mostlyjason mentioned this issue Jan 15, 2021

[Fleet] Compatibility checks #72707

Closed

simitt mentioned this issue Feb 1, 2021

[Fleet] Fleet server cannot be removed from a managed agent policy #89617

Closed

michalpristas mentioned this issue Feb 4, 2021

[Ingest Management] Agent supports capabilities definition elastic/beats#23848

Merged

6 tasks

michalpristas mentioned this issue Feb 15, 2021

Cherry-pick #23848 to 7.x: Agent supports capabilities definition elastic/beats#24037

Merged

6 tasks

ruflin added v7.14.0 and removed v7.12.0 labels Feb 22, 2021

jen-huang changed the title ~~[Fleet] Support allowlist / blocklist of inputs~~ [Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) Apr 27, 2021

jen-huang assigned jfsiii May 25, 2021

jfsiii mentioned this issue Jun 4, 2021

Some field mapping in .fleet-agents index are unused elastic/fleet-server#391

Closed

jen-huang closed this as completed Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841

[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841

ruflin commented Sep 7, 2020 •

edited by jen-huang

Loading

elasticmachine commented Sep 7, 2020

ph commented Dec 18, 2020

ruflin commented Dec 21, 2020

mostlyjason commented Jan 15, 2021 •

edited

Loading

mostlyjason commented Jan 27, 2021

ph commented Jan 28, 2021

ruflin commented Jan 29, 2021

simitt commented Feb 1, 2021

ph commented Feb 1, 2021

michalpristas commented Feb 2, 2021

michalpristas commented Feb 3, 2021

ph commented Feb 8, 2021

jfsiii commented Jun 1, 2021

jen-huang commented Jun 1, 2021 •

edited

Loading

mostlyjason commented Jun 1, 2021

jen-huang commented Jun 1, 2021

jfsiii commented Jun 2, 2021 •

edited

Loading

jfsiii commented Jun 3, 2021

jen-huang commented Jun 3, 2021

jfsiii commented Jun 4, 2021

jen-huang commented Jun 4, 2021

jen-huang commented Jun 23, 2021

[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841

[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841

Comments

ruflin commented Sep 7, 2020 • edited by jen-huang Loading

elasticmachine commented Sep 7, 2020

ph commented Dec 18, 2020

ruflin commented Dec 21, 2020

mostlyjason commented Jan 15, 2021 • edited Loading

mostlyjason commented Jan 27, 2021

ph commented Jan 28, 2021

ruflin commented Jan 29, 2021

simitt commented Feb 1, 2021

ph commented Feb 1, 2021

michalpristas commented Feb 2, 2021

michalpristas commented Feb 3, 2021

ph commented Feb 8, 2021

jfsiii commented Jun 1, 2021

jen-huang commented Jun 1, 2021 • edited Loading

mostlyjason commented Jun 1, 2021

jen-huang commented Jun 1, 2021

jfsiii commented Jun 2, 2021 • edited Loading

jfsiii commented Jun 3, 2021

jen-huang commented Jun 3, 2021

jfsiii commented Jun 4, 2021

jen-huang commented Jun 4, 2021

jen-huang commented Jun 23, 2021

ruflin commented Sep 7, 2020 •

edited by jen-huang

Loading

mostlyjason commented Jan 15, 2021 •

edited

Loading

jen-huang commented Jun 1, 2021 •

edited

Loading

jfsiii commented Jun 2, 2021 •

edited

Loading