Scrapes hang when deadlocked #2413

Pokom · 2024-06-05T19:04:10Z

What happened:

Occasionally in larger clusters, kube-state-metrics will fail to be scraped by both Prometheus and Grafana's Alloy. When attempting to curl the /metrics endpoint we'll get to a certain point(usually pod disruption budgets) and then just hangs. The only way to recover from the scenario is to restart the kube-state-metric pods impacted. This is also similar to #995 and #1028, but the difference is we're not having a high churn on pods. Our ksm deployment is sharded and not all of the shards will fail at the same time.

What you expected to happen:
I would expect the server to eventually timeout the connections, and not block future scrapes from being blocked.

How to reproduce it (as minimally and precisely as possible):

Run the tip of kube-state-metrics and have it access a decently large cluster. With kube-state-metrics up and running, launch the following go program which is meant to emulate a client failing to fully fetch metrics:

package main

import (
	"fmt"
	"log"
	"net/http"
	"sync"
	"time"
)

func main() {
	wg := sync.WaitGroup{}
	wg.Add(25)
	for i := 0; i < 5; i++ {
		for j := 0; j < 5; j++ {
			go func() {
				defer wg.Done()
				now := time.Now()
				c := http.Client{}
				req, err := http.NewRequest("GET", "http://localhost:8080/metrics", nil)
				if err != nil {
					log.Printf("Error creating request: %s", err)
					return
				}
				req.Header.Set("Range", "bytes=0-500")
				resp, err := c.Do(req)
				if err != nil {
					fmt.Printf("Error sending request: %s", err)
				}
				// _, err = io.Copy(ioutil.Discard, resp.Body)

				log.Printf("Downloaded %d bytes in %fs", resp.ContentLength, time.Since(now).Seconds())
			}()
		}
		time.Sleep(15 * time.Second)
	}
	wg.Wait()
	log.Println("Done, going to sleep for a bits")
	time.Sleep(1 * time.Minute)
}

After a few clients fail to close their body, accessing the /metrics endpoint will stall and you'll get partial results from curl. If you look at /debug/pprof/goroutines you'll notice that the number of goroutines will continue increasing. You'll find a single goroutine that's blocking all the write goroutines that looks like:

goroutine 101 [sync.RWMutex.Lock, 17 minutes]:
sync.runtime_SemacquireRWMutex(0xc002454000?, 0xb0?, 0x0?)
	/usr/local/go/src/runtime/sema.go:87 +0x25
sync.(*RWMutex).Lock(0x1ab02c0?)
	/usr/local/go/src/sync/rwmutex.go:152 +0x6a
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc00052e140, {0x1ab02c0, 0xc002454000})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:68 +0x74
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0x1ab02c0?, {0x1ab02c0?, 0xc002454000?})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:86 +0x1d
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2abb480?}, {0x1d6e550, 0xc004995b00}, {0x1d86780, 0xc00052e140}, {0x1d93688?, 0x1ab02c0}, 0x0, ...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:761 +0x91b

Anything else we need to know?:

Here are goroutine dumps that show that all goroutines are stuck and blocked on reading.
ksm-groutine-dump-2024-06-04-regular.txt
ksm-groutine-dump-2024-06-04-debug-2.txt

Environment:

kube-state-metrics version: v2.12.0
Kubernetes version (use kubectl version): v1.28.7-gke.1026000
Cloud provider or hardware configuration: GKE + EKS
Other info: Running in sharded mode with upwards of 7 shards.

The text was updated successfully, but these errors were encountered:

tkent · 2024-06-09T05:01:08Z

👋 We're also seeing this on some very small clusters that we just upgraded to k8s 1.29 (3-4 nodes, less than 100 pods).

For us....

It's happening on all versions we've tested (2.10.1 through 2.12.0)
We didn't notice this happening until after the upgrade
We don't have a clear cause for it yet, but the symptom is the same as described here and can be reproduced by just scraping in a for loop for a bit.

Maybe we've just been getting lucky, until now?

Pokom · 2024-06-12T10:22:36Z

@tkent I've created a PR(#2412) that I believe resolves the root cause. By adding timeouts to the metric servers requests, it ensures clients that hang close in a timely fashion and don't block other writes.

tkent · 2024-06-12T20:37:00Z

@Pokom - thanks for both the write up and addressing it in a PR!

I'm still puzzled about why we only started seeing this on clusters after upgrading them from 1.27 -> 1.29, and why it happens pretty much all the time (the pods are unusable for us in each cluster we've upgraded). But, that's probably something ignorable about our configurations.

Anyway, just recording it here in case somebody else happens across it. Looking forward to your PR making it in!

Update

While our symptoms match up exactly with the issue described here, the frequency was much higher. We found that the cause of our hangs turned out to be different (a bit related, but different). During our 1.27 -> 1.29 k8s upgrades, we also upgraded our cilium installations and some problem with the transparent IPSEC feature in the new version caused larger communications (those over about 200K) to frequently stall. The only thing regularly doing that size of communication in our small test clusters was ksm pods, and that's how we ended up here. For now, we've disabled cilium's IPSEC feature and we're back to working again.

Pokom · 2024-06-13T12:58:40Z

@tkent Appreciate confirming that you're experiencing the same symptoms I am! I'm running dozens of ksm instances that are sharded, so we have probably close to 100 pods running out in the wild, and it's very sporadic when it happens. Even when it occurs, it's usually only a few shards at once, never all of the shards. So I believe the problem has existed for quite some time, it's just rare enough to occur that it's transient.

dgrisonnet · 2024-06-13T16:53:03Z

/triage accepted
/help wanted

dgrisonnet · 2024-06-13T16:53:23Z

/assign @Pokom

Pokom added the kind/bug Categorizes issue or PR as related to a bug. label Jun 5, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 5, 2024

Pokom mentioned this issue Jun 5, 2024

fix(server): Add read and write timeouts #2412

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024

k8s-ci-robot assigned Pokom Jun 13, 2024

k8s-ci-robot closed this as completed in #2412 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapes hang when deadlocked #2413

Scrapes hang when deadlocked #2413

Pokom commented Jun 5, 2024

tkent commented Jun 9, 2024 •

edited

Loading

Pokom commented Jun 12, 2024

tkent commented Jun 12, 2024 •

edited

Loading

Pokom commented Jun 13, 2024

dgrisonnet commented Jun 13, 2024

dgrisonnet commented Jun 13, 2024

Scrapes hang when deadlocked #2413

Scrapes hang when deadlocked #2413

Comments

Pokom commented Jun 5, 2024

tkent commented Jun 9, 2024 • edited Loading

Pokom commented Jun 12, 2024

tkent commented Jun 12, 2024 • edited Loading

Pokom commented Jun 13, 2024

dgrisonnet commented Jun 13, 2024

dgrisonnet commented Jun 13, 2024

tkent commented Jun 9, 2024 •

edited

Loading

tkent commented Jun 12, 2024 •

edited

Loading