Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) #76841

Closed
ruflin opened this issue Sep 7, 2020 · 22 comments
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team v7.14.0

Comments

@ruflin
Copy link
Contributor

ruflin commented Sep 7, 2020

The requirements below from PH should already work with our current Agent/Fleet Server/Kibana workflow. This issue is now a dev task to test and confirm that if the agent reports an error, it will check in with error to Fleet server, and Kibana indeed displays the agent as unhealthy.

Originally posted by @ph in #76841 (comment)
After discussion in our sync, we have decided that the error returned by the Elastic Agent code would be enough for an MVP.
We need to test to make sure the elastic/beats#23848 is correctly reporting the error back in the Log and change the status of the Elastic Agent.

Some of the Agents connected to Fleet might only support a subset of the available inputs: elastic/beats#21000 This information is sent up to Fleet. Fleet must be able to make decision on this information on what the user can configure and show the user info if certain inputs are not supported.

@ruflin ruflin added the Team:Fleet Team label for Observability Data Collection Fleet team label Sep 7, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@ph
Copy link
Contributor

ph commented Dec 18, 2020

@ruflin I see some complexity and I am not sure how we can resolve that, based on elastic/beats#21000 the restriction is a limit that an Agent decides and imposes on fleet, this mean with that proposal that two agents associated to the same Agent policy could have different restrictictions.

Where could possible show that information:

  • On edit the Agent policy?
  • On the Agent details pages
  • Agent list
  • Agent policy list

For the Agent policy related views we probably need some caching to be efficient.

But @ruflin should that limitation be set from the Agent?

cc @mostlyjason

@ruflin
Copy link
Contributor Author

ruflin commented Dec 21, 2020

There can be indeed conflicting input configs. I think we have 2 options here:

  • Block the full policy to be shipped to all Elastic Agents
  • Ship the full policy to all Agents but show a warning

I would rather go with the second approach and show a warning that not all Agents support parts of a specific integration.

For the locations on where should be shown what, here a brain dump:

  • Which inputs an Agent supports: Elastic Agent details page
  • An Agent gets a policy that is not fully compatible: Policy list, Agent list, Agent details

Can a user add an integration for a non supported input? I think the answer is likely yes as it will make the UI simpler but we should show the user a warning that it will only run on a subset. This feature is also important in case one day we have certain Agents in a policy that only support apm-server for example, so not all of them can run a certain input.

@mostlyjason
Copy link
Contributor

mostlyjason commented Jan 15, 2021

Use case
As an integration user, I want to know when one of my inputs is blocked on the agent side so I can adjust my integration policy to run on allowed agents.

Show an unhealthy agent status
As an MVP solution, if a user saves a blocked integration to a hosted agent policy, they will see status information saying the agent is unhealthy because the inputs are blocked. To make those agents healthy, they'll have to remove or edit the integration. This satisfies our minimal requirements for the use case. Also, it is the minimal effort solution. We decided in elastic/beats#21000 that this allowlist needs to be enforced on the agent so that we can restrict what inputs run on Elastic Cloud. We already have the ability to pass agent status information back to Fleet, so no extra work is needed on the Fleet side.

An enhancement would be to additionally show a warning message to users when they attempt to add an integration with blocked inputs in Fleet. It will inform them it will only run on a subset of agents or no agents. This reduces the steps to identify and fix the issue. However, I'm on the fence if this should be MVP because it will require extra effort to implement. Also, its an uncommon scenario to optimize for. On ESS/ECE it will mainly happen if the user adds an integration that doesn't make sense (user error) and for self-managed we don't know how often operators will use this feature. It would require extra work to pass the allowlist from the agent to fleet and duplicate business logic to evaluate it. We could do it later if we see users running into this scenario often.

Health status
Another decision is how the allowlist impacts agent health status. I think a service owner would expect a host-based integration they added to run on every host. This is different than input conditions because the service owner will be able to view/change them, but if an operator added a blocklist it may be unexpected. Marking the agent as unhealthy would communicate there is a mismatch.

For integrations with deployment type "one", we want the coordinator to allocate inputs to eligible agents. It would only be unhealthy if it cannot allocate an input to any agents. One option is to mark all the agents in the policy as unhealthy to communicate the problem. Would fleet server send the status for these agents? Alternatively, we could mark the integration policy itself as unhealthy. We don't have any status information on integration policies today and I'm assuming it'd be more work to add it?

Where we show status info
The agent list page is a place to show blocked integrations, because they will show up with unhealthy status. Users can drill down into agent details to learn why.

I agree with @ruflin that the agent details page is a good place to show the blocked inputs. We'll show more detailed status information about each integration and input soon. This is also a convenient location since users will hopefully be able to see inputs disabled due to input conditions here too. I think we also have a message field or at least a link to logs so the user can learn why an input is disabled?

It'd be nice to show health status on the agent and integration polices, but we could do it as a separate enhancement if we have a good way to show it at the agent level now.

@mostlyjason
Copy link
Contributor

@mukeshelastic @ph @ruflin my proposal above is for an MVP solution that shows blocked inputs in the agent status information. I'm suggesting we implement the minimal solution first, and later enhance it. Do you agree?

@ph
Copy link
Contributor

ph commented Jan 28, 2021

I do agree that reporting blocked input via the Agent via the status information is an OK MVP. @ruflin to be clear what @mostlyjason proposes to means that we don't have anything to do in the UI, just to be clear.

@ruflin
Copy link
Contributor Author

ruflin commented Jan 29, 2021

@ph Do we already have everything in place needed in the UI for the above?

@simitt
Copy link
Contributor

simitt commented Feb 1, 2021

This will also be important for managed policies/agents on cloud.

On ESS/ECE it will mainly happen if the user adds an integration that doesn't make sense (user error)

I don't believe we should expect users to know which integrations are supported for managed policies in which versions, but rather provide better user feedback - as a follow-up to this MVP.

@ph
Copy link
Contributor

ph commented Feb 1, 2021

@ruflin It could be reported as a normal status error, @michalpristas Can you confirm or correct me if it's possible?

@michalpristas
Copy link

we can definitely make information about filtering an input part of the health status.
agent will be unhealthy error will be that agent blocked these inputs and more detail will be in logs

@michalpristas
Copy link

looking at the kibana code, it wont allow us to send any additional text to status.
i will report unhealthy and details will be in logs as a MVP

then with upcoming status work we will refine status API either way

@ph
Copy link
Contributor

ph commented Feb 8, 2021

After discussion in our sync, we have decided that the error returned by the Elastic Agent code would be enough for an MVP.
We need to test to make sure the elastic/beats#23848 is correctly reporting the error back in the Log and change the status of the Elastic Agent.

@ruflin ruflin added v7.14.0 and removed v7.12.0 labels Feb 22, 2021
@jen-huang jen-huang changed the title [Fleet] Support allowlist / blocklist of inputs [Fleet] Show agent as Unhealthy if agent reports an error about incompatible input(s) Apr 27, 2021
@jfsiii
Copy link
Contributor

jfsiii commented Jun 1, 2021

@ruflin @mostlyjason @hbharding Can we confirm what's included in the Kibana indeed displays the agent as unhealthy from the description?

I created a quick loom https://www.loom.com/share/54b034d8351545c0b89293928694f840 showing the UI and confirming some assumptions and asking some questions like:

Agent list page

  1. Display Unhealthy badge in table row

Agent details page

  1. Display Unhealthy badge in page header
  2. Should there be any UI changes in the Integrations section? e.g. at the integration and/or input level?

Agent overview page

  1. Affected agents are included in Error count

Can you take a look at that video and let me know what you think and if I've missed anything or have something incorrect?

@jen-huang
Copy link
Contributor

jen-huang commented Jun 1, 2021

Should there be any UI changes in the Integrations section? e.g. at the integration and/or input level?

@jfsiii No, since we don't have input/integration-level status reporting yet, overall Unhealthy status reporting is the interim solution for this. The rest of the list looks accurate.

Edit: I would also check that we get some error logging back from the agent that can be viewed in Agent details > Logs.

@mostlyjason
Copy link
Contributor

Thanks for making the Loom @jfsiii! ++ on what Jen said. Also, I'm surprised anything is needed on the fleet side, because I thought it just reported the status given to it from the agent. I'm hoping this doesn't require some special case in the code. You know the internals better than I do though.

@jen-huang
Copy link
Contributor

@mostlyjason We're not sure whether it needs additional work yet or not, so the initial step for this ticket is to double check that this type of error does already bubble up to the UI.

@jfsiii
Copy link
Contributor

jfsiii commented Jun 2, 2021

As you say, Kibana should report the status the agent sends to Fleet server, so I'm confirming that (a) the agent sends an error status to Fleet server (b) Fleet server records that status in ES.

I'm still investigating but the initial debugging seems to indicate at least one of those isn't happening; maybe neither.

I ran this in one terminal:

sudo ./elastic-agent install -f --url=http://localhost:8220 --enrollment-token=dTFpMXgza0JaRUZyLTNDdGg4cTg6X3FtcnBhZkVTTXFsLXFYZXJhSFMzZw== --insecure

and this was logged in another (I added some logging to a local checkout of fleet-server and rebuilt with make release)

{"log.level":"debug","url.full":"/api/fleet/agents/enroll?","http.version":"1.1","http.request.method":"POST","http.response.status_code":200,"http.request.body.bytes":1309,"http.response.body.bytes":1646,"client.address":"127.0.0.1:59867","client.ip":"127.0.0.1","client.port":59867,"tls.established":false,"event.duration":637783498,"@timestamp":"2021-06-01T20:27:07.252Z","message":"HTTP handler"}
{"log.level":"debug","index":".fleet-actions","ctx":"index monitor","index":".fleet-actions","@timestamp":"2021-06-01T20:27:07.411Z","message":"index not found"}
{"log.level":"debug","@timestamp":"2021-06-01T20:27:08.742Z","message":"JFSIII CHECKIN GOT BODY: {{\"status\":\"online\",\"events\":[],\"local_metadata\":{\"elastic\":{\"agent\":{\"id\":\"0656e231-7c02-4c9f-b0fc-5ad545c7a08b\",\"version\":\"8.0.0\",\"snapshot\":true,\"build.original\":\"8.0.0-SNAPSHOT (build: 2ee21d95aef89af7f7e7aef8d07f679a24d690b4 at 2021-05-27 16:09:41 +0000 UTC)\",\"upgradeable\":true,\"log_level\":\"info\"}},\"host\":{\"architecture\":\"x86_64\",\"hostname\":\"JFSIII.local\",\"name\":\"JFSIII.local\",\"id\":\"209252E1-587B-5756-ADBC-E72BF11A8C98\",\"ip\":[\"127.0.0.1/8\",\"::1/128\",\"fe80::1/64\",\"fe80::aede:48ff:fe00:1122/64\",\"fe80::1001:6878:9d12:ae80/64\",\"2601:155:8300:b360:e4:ffc8:ee04:99c8/64\",\"2601:155:8300:b360:c8f6:6246:450:2b90/64\",\"2601:155:8300:b360::dc93/64\",\"10.0.0.183/24\",\"2601:155:8300:b360:1bc:b0be:3913:e825/64\",\"2601:155:8300:b360:7848:e011:8f6:87e2/64\",\"2601:155:8300:b360:2869:6e62:4e68:188f/64\",\"2601:155:8300:b360:4c80:b1e1:ff66:72ae/64\",\"fe80::ec4d:c7ff:fea0:d969/64\",\"fe80::ec4d:c7ff:fea0:d969/64\",\"fe80::bdf8:8015:2827:6af2/64\",\"fe80::225d:ea4:8e7f:cf83/64\"],\"mac\":[\"3a:f9:d3:a6:29:52\",\"ac:de:48:00:11:22\",\"38:f9:d3:a6:29:52\",\"82:1d:9f:e5:90:05\",\"82:1d:9f:e5:90:04\",\"82:1d:9f:e5:90:01\",\"82:1d:9f:e5:90:00\",\"82:1d:9f:e5:90:01\",\"ee:4d:c7:a0:d9:69\",\"ee:4d:c7:a0:d9:69\"]},\"os\":{\"family\":\"darwin\",\"kernel\":\"20.4.0\",\"platform\":\"darwin\",\"version\":\"10.16\",\"name\":\"Mac OS X\",\"full\":\"Mac OS X(10.16)\"}}}}"}
{"log.level":"info","error.message":"EOF","id":"0656e231-7c02-4c9f-b0fc-5ad545c7a08b","http.response.status_code":400,"http.request.id":"","event.duration":252339782,"@timestamp":"2021-06-01T20:27:08.994Z","message":"fail checkin"}
{"log.level":"debug","url.full":"/api/fleet/agents/0656e231-7c02-4c9f-b0fc-5ad545c7a08b/checkin?","http.version":"1.1","http.request.method":"POST","http.response.status_code":400,"http.request.body.bytes":1296,"http.response.body.bytes":39,"client.address":"[::1]:59880","client.ip":"::1","client.port":59880,"tls.established":false,"event.duration":252375883,"@timestamp":"2021-06-01T20:27:08.994Z","message":"HTTP handler"}

The JFSIII CHECKIN GOT BODY is in the handler for the checkin route and shows a payload with "status":"online" which I believe is incorrect.

checkin also responds with a 400 status code which seems to be ignored by the install command.

I'm still confirming, but we might need to make some updates in one/both fleet-server & elastic-agent

/cc @nchaulet & @michalpristas

@jfsiii
Copy link
Contributor

jfsiii commented Jun 3, 2021

While working on this I discovered that a property Kibana uses to determine status (last_checkin_status in the .fleet-agents index) was missing. There's a PR to restore it. When I ran that locally, without changing anything in Kibana, the two "Display Unhealthy badge" items were resolved

Agent list page
Screen Shot 2021-06-03 at 9 33 16 AM

Agent details page
Screen Shot 2021-06-03 at 9 33 28 AM

However, any agents which hit these capability restrictions aren't reflected in the counts on the Agent overview page, because their status is degraded; not error.

Agent overview page
Screen Shot 2021-06-03 at 9 55 45 AM

@jen-huang @mostlyjason Should we add the count for agents in a degraded state or leave it as-is?

@jen-huang
Copy link
Contributor

@jfsiii I would leave that as-is because the Overview page is going away with the move to top-level Integrations UI (#99848).

@jfsiii
Copy link
Contributor

jfsiii commented Jun 4, 2021

@jen-huang ok, cool. I still see it in that PR, but I haven't seen the designs and you have more context about that change than I do.

Does that mean we can close this as resolved or should we add tests? If so are they

a) Fleet tests which assert that an API response of "status": "degraded" shows an Unhealthy badge
b) E2E tests(?) which have a capabilities.yml file with a deny rule, run the install or enroll command, and check for certain side effects
c) something else?

@jen-huang
Copy link
Contributor

@jfsiii Let's do #1 to close out this issue. For #2, it would be great to check on if we have similar tests already in the e2e suite and file an issue if not.

@jen-huang
Copy link
Contributor

I think #102821 incidentally added the test for this ("Fleet tests which assert that an API response of "status": "degraded" shows an Unhealthy badge"), so closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team v7.14.0
Projects
None yet
Development

No branches or pull requests

8 participants