-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduced failureThreshold causes agents to become unhealthy #42672
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
cc @pchila |
beats/metricbeat/mb/module/wrapper.go Lines 254 to 269 in 1f033c9
The
As @Alphayeeeet mentions the errors can be making the datastream unhealthy but this is not due to the introduction of |
The first one with the context cancelled error was a bug in the status reporting that was fixed in #42435.
This is something that should be fixed in our prometheus input, not at the Elastic Agent level. If this can happen as part of normal operation it probably shouldn't be marking the input as degraded. If we were to do something generic to help this problem of annoying status changes, it would probably be adding the ability to mute or ignore particular error types. That is probably best done from the Beats as they are the source of most of this. |
Due to the introduced failureThreshold (#41570 & elastic/elastic-agent#5999) our agents now become unhealthy duo to runtime errors in Kubernetes environment.
Expected runtime errors include the following:
[elastic_agent][warn] Unit state changed filestream-default-filestream-container-logs-3b56171a-c46b-475e-9a80-02708c67ce0c-kubernetes-755830fc-44b9-4b02-9333-bb2653b9ae47.podxyz (HEALTHY->DEGRADED): error while reading from source: context cancel
This might be caused by Kubernetes removing container log-files from containers, that dont necessarily exist anymore.
[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused
This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see elastic/elastic-agent#6154 for reference).
For confirmed bugs, please report:
The text was updated successfully, but these errors were encountered: