Introduced failureThreshold causes agents to become unhealthy #42672

Alphayeeeet · 2025-02-11T08:24:36Z

Due to the introduced failureThreshold (#41570 & elastic/elastic-agent#5999) our agents now become unhealthy duo to runtime errors in Kubernetes environment.

Expected runtime errors include the following:

[elastic_agent][warn] Unit state changed filestream-default-filestream-container-logs-3b56171a-c46b-475e-9a80-02708c67ce0c-kubernetes-755830fc-44b9-4b02-9333-bb2653b9ae47.podxyz (HEALTHY->DEGRADED): error while reading from source: context cancel

This might be caused by Kubernetes removing container log-files from containers, that dont necessarily exist anymore.

[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see elastic/elastic-agent#6154 for reference).

For confirmed bugs, please report:

Version: 8.16.X-8.17.X
Operating System: RHEL 8.9

The text was updated successfully, but these errors were encountered:

elasticmachine · 2025-02-11T09:57:06Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

pierrehilbert · 2025-02-11T09:57:14Z

cc @pchila

pchila · 2025-02-11T10:14:54Z

failureThreshold introduced in #41570 and used in elastic/elastic-agent#5332 is only used for elastic-agent monitoring inputs.
The mechanism for marking the stream as DEGRADED when encountering errors fetching metrics was already implemented in

beats/metricbeat/mb/module/wrapper.go

Lines 254 to 269 in 1f033c9

    
           case mb.ReportingMetricSetV2Error: 
        
           	reporter.StartFetchTimer() 
        
           	err := fetcher.Fetch(reporter.V2()) 
        
           	if err != nil { 
        
           		reporter.V2().Error(err) 
        
           		if errors.As(err, &mb.PartialMetricsError{}) { 
        
           			// mark module as running if metrics are partially available and display the error message 
        
           			msw.module.UpdateStatus(status.Running, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err)) 
        
           		} else { 
        
           			// mark it as degraded for any other issue encountered 
        
           			msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %v", msw.module.Name(), msw.MetricSet.Name(), err)) 
        
           		} 
        
           		logp.Err("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.Name(), err) 
        
           	} else { 
        
           		msw.module.UpdateStatus(status.Running, "") 
        
           	}

The failureThreshold gives the possibility to wait for a given number of consecutive errors before marking the stream as degraded but it defaults to 1 to keep the existing behaviour introduced in PR #40400, see the snippet https://github.com/elastic/beats/pull/40400/files#diff-b51accf349a564390dca49aaee107c0a4c2c89cdd480fe5880121205598a243bR258

Expected runtime errors include the following:

[elastic_agent][warn] Unit state changed filestream-default-filestream-container-logs-3b56171a-c46b-475e-9a80-02708c67ce0c-kubernetes-755830fc-44b9-4b02-9333-bb2653b9ae47.podxyz (HEALTHY->DEGRADED): error while reading from source: context cancel

This might be caused by Kubernetes removing container log-files from containers, that dont necessarily exist anymore.

[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see #6154 for reference).

As @Alphayeeeet mentions the errors can be making the datastream unhealthy but this is not due to the introduction of failureThreshold but rather by PR #40400 that changes the state to DEGRADED because of the connection error Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

cmacknz · 2025-02-11T18:26:22Z

The first one with the context cancelled error was a bug in the status reporting that was fixed in #42435.

[elastic_agent][warn] Unit state changed prometheus/metrics-default-prometheus/metrics-prometheus-cfa93471-3f87-4d04-babb-ef2a62a85cd4-kubernetes-3009cd60-7f0c-44d8-8b85-4257683395ef (HEALTHY->DEGRADED): Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get "http://255.255.255.255:8888/metrics": dial tcp 255.255.255.255:8888: connect: connection refused

This can be caused by prometheus-integration trying to scrape completed jobs and therefore completed pods (see elastic/elastic-agent#6154 for reference).

This is something that should be fixed in our prometheus input, not at the Elastic Agent level. If this can happen as part of normal operation it probably shouldn't be marking the input as degraded.

If we were to do something generic to help this problem of annoying status changes, it would probably be adding the ability to mute or ignore particular error types. That is probably best done from the Beats as they are the source of most of this.

Alphayeeeet added the bug label Feb 11, 2025

pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 11, 2025

cmacknz transferred this issue from elastic/elastic-agent Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduced failureThreshold causes agents to become unhealthy #42672

Introduced failureThreshold causes agents to become unhealthy #42672

Alphayeeeet commented Feb 11, 2025

elasticmachine commented Feb 11, 2025

pierrehilbert commented Feb 11, 2025

pchila commented Feb 11, 2025 •

edited

Loading

cmacknz commented Feb 11, 2025

Introduced failureThreshold causes agents to become unhealthy #42672

Introduced failureThreshold causes agents to become unhealthy #42672

Comments

Alphayeeeet commented Feb 11, 2025

elasticmachine commented Feb 11, 2025

pierrehilbert commented Feb 11, 2025

pchila commented Feb 11, 2025 • edited Loading

cmacknz commented Feb 11, 2025

pchila commented Feb 11, 2025 •

edited

Loading