Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduced failureThreshold causes agents to become unhealthy #42672

Open
Alphayeeeet opened this issue Feb 11, 2025 · 4 comments
Open

Introduced failureThreshold causes agents to become unhealthy #42672

Alphayeeeet opened this issue Feb 11, 2025 · 4 comments
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@Alphayeeeet
Copy link

Due to the introduced failureThreshold (#41570 & elastic/elastic-agent#5999) our agents now become unhealthy duo to runtime errors in Kubernetes environment.

Expected runtime errors include the following:

[elastic_agent][warn] Unit state changed filestream-default-filestream-container-logs-3b56171a-c46b-475e-9a80-02708c67ce0c-kubernetes-755830fc-44b9-4b02-9333-bb2653b9ae47.podxyz (HEALTHY->DEGRADED): error while reading from source: context cancel

This might be caused by Kubernetes removing container log-files from containers, that dont necessarily exist anymore.

[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see elastic/elastic-agent#6154 for reference).

For confirmed bugs, please report:

  • Version: 8.16.X-8.17.X
  • Operating System: RHEL 8.9
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 11, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pierrehilbert
Copy link
Collaborator

cc @pchila

@pchila
Copy link
Member

pchila commented Feb 11, 2025

failureThreshold introduced in #41570 and used in elastic/elastic-agent#5332 is only used for elastic-agent monitoring inputs.
The mechanism for marking the stream as DEGRADED when encountering errors fetching metrics was already implemented in

case mb.ReportingMetricSetV2Error:
reporter.StartFetchTimer()
err := fetcher.Fetch(reporter.V2())
if err != nil {
reporter.V2().Error(err)
if errors.As(err, &mb.PartialMetricsError{}) {
// mark module as running if metrics are partially available and display the error message
msw.module.UpdateStatus(status.Running, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
} else {
// mark it as degraded for any other issue encountered
msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err))
}
logp.Err("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.Name(), err)
} else {
msw.module.UpdateStatus(status.Running, "")
}

The failureThreshold gives the possibility to wait for a given number of consecutive errors before marking the stream as degraded but it defaults to 1 to keep the existing behaviour introduced in PR #40400, see the snippet https://github.com/elastic/beats/pull/40400/files#diff-b51accf349a564390dca49aaee107c0a4c2c89cdd480fe5880121205598a243bR258

Expected runtime errors include the following:

[elastic_agent][warn] Unit state changed filestream-default-filestream-container-logs-3b56171a-c46b-475e-9a80-02708c67ce0c-kubernetes-755830fc-44b9-4b02-9333-bb2653b9ae47.podxyz (HEALTHY->DEGRADED): error while reading from source: context cancel

This might be caused by Kubernetes removing container log-files from containers, that dont necessarily exist anymore.

[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see #6154 for reference).

As @Alphayeeeet mentions the errors can be making the datastream unhealthy but this is not due to the introduction of failureThreshold but rather by PR #40400 that changes the state to DEGRADED because of the connection error Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

@cmacknz
Copy link
Member

cmacknz commented Feb 11, 2025

The first one with the context cancelled error was a bug in the status reporting that was fixed in #42435.

[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see elastic/elastic-agent#6154 for reference).

This is something that should be fixed in our prometheus input, not at the Elastic Agent level. If this can happen as part of normal operation it probably shouldn't be marking the input as degraded.

If we were to do something generic to help this problem of annoying status changes, it would probably be adding the ability to mute or ignore particular error types. That is probably best done from the Beats as they are the source of most of this.

@cmacknz cmacknz transferred this issue from elastic/elastic-agent Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests

5 participants