Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrapes hang when deadlocked #2413

Closed
Pokom opened this issue Jun 5, 2024 · 6 comments · Fixed by #2412
Closed

Scrapes hang when deadlocked #2413

Pokom opened this issue Jun 5, 2024 · 6 comments · Fixed by #2412
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Pokom
Copy link
Contributor

Pokom commented Jun 5, 2024

What happened:

Occasionally in larger clusters, kube-state-metrics will fail to be scraped by both Prometheus and Grafana's Alloy. When attempting to curl the /metrics endpoint we'll get to a certain point(usually pod disruption budgets) and then just hangs. The only way to recover from the scenario is to restart the kube-state-metric pods impacted. This is also similar to #995 and #1028, but the difference is we're not having a high churn on pods. Our ksm deployment is sharded and not all of the shards will fail at the same time.

What you expected to happen:
I would expect the server to eventually timeout the connections, and not block future scrapes from being blocked.

How to reproduce it (as minimally and precisely as possible):

Run the tip of kube-state-metrics and have it access a decently large cluster. With kube-state-metrics up and running, launch the following go program which is meant to emulate a client failing to fully fetch metrics:

package main

import (
	"fmt"
	"log"
	"net/http"
	"sync"
	"time"
)

func main() {
	wg := sync.WaitGroup{}
	wg.Add(25)
	for i := 0; i < 5; i++ {
		for j := 0; j < 5; j++ {
			go func() {
				defer wg.Done()
				now := time.Now()
				c := http.Client{}
				req, err := http.NewRequest("GET", "http://localhost:8080/metrics", nil)
				if err != nil {
					log.Printf("Error creating request: %s", err)
					return
				}
				req.Header.Set("Range", "bytes=0-500")
				resp, err := c.Do(req)
				if err != nil {
					fmt.Printf("Error sending request: %s", err)
				}
				// _, err = io.Copy(ioutil.Discard, resp.Body)

				log.Printf("Downloaded %d bytes in %fs", resp.ContentLength, time.Since(now).Seconds())
			}()
		}
		time.Sleep(15 * time.Second)
	}
	wg.Wait()
	log.Println("Done, going to sleep for a bits")
	time.Sleep(1 * time.Minute)
}

After a few clients fail to close their body, accessing the /metrics endpoint will stall and you'll get partial results from curl. If you look at /debug/pprof/goroutines you'll notice that the number of goroutines will continue increasing. You'll find a single goroutine that's blocking all the write goroutines that looks like:

goroutine 101 [sync.RWMutex.Lock, 17 minutes]:
sync.runtime_SemacquireRWMutex(0xc002454000?, 0xb0?, 0x0?)
	/usr/local/go/src/runtime/sema.go:87 +0x25
sync.(*RWMutex).Lock(0x1ab02c0?)
	/usr/local/go/src/sync/rwmutex.go:152 +0x6a
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Add(0xc00052e140, {0x1ab02c0, 0xc002454000})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:68 +0x74
k8s.io/kube-state-metrics/v2/pkg/metrics_store.(*MetricsStore).Update(0x1ab02c0?, {0x1ab02c0?, 0xc002454000?})
	/go/src/k8s.io/kube-state-metrics/pkg/metrics_store/metrics_store.go:86 +0x1d
k8s.io/client-go/tools/cache.watchHandler({0x0?, 0x0?, 0x2abb480?}, {0x1d6e550, 0xc004995b00}, {0x1d86780, 0xc00052e140}, {0x1d93688?, 0x1ab02c0}, 0x0, ...)
	/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:761 +0x91b

Anything else we need to know?:

Here are goroutine dumps that show that all goroutines are stuck and blocked on reading.
ksm-groutine-dump-2024-06-04-regular.txt
ksm-groutine-dump-2024-06-04-debug-2.txt

Environment:

  • kube-state-metrics version: v2.12.0
  • Kubernetes version (use kubectl version): v1.28.7-gke.1026000
  • Cloud provider or hardware configuration: GKE + EKS
  • Other info: Running in sharded mode with upwards of 7 shards.
@Pokom Pokom added the kind/bug Categorizes issue or PR as related to a bug. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 5, 2024
@tkent
Copy link

tkent commented Jun 9, 2024

👋 We're also seeing this on some very small clusters that we just upgraded to k8s 1.29 (3-4 nodes, less than 100 pods).

For us....

  1. It's happening on all versions we've tested (2.10.1 through 2.12.0)
  2. We didn't notice this happening until after the upgrade
  3. We don't have a clear cause for it yet, but the symptom is the same as described here and can be reproduced by just scraping in a for loop for a bit.

Maybe we've just been getting lucky, until now?

@Pokom
Copy link
Contributor Author

Pokom commented Jun 12, 2024

@tkent I've created a PR(#2412) that I believe resolves the root cause. By adding timeouts to the metric servers requests, it ensures clients that hang close in a timely fashion and don't block other writes.

@tkent
Copy link

tkent commented Jun 12, 2024

@Pokom - thanks for both the write up and addressing it in a PR!

I'm still puzzled about why we only started seeing this on clusters after upgrading them from 1.27 -> 1.29, and why it happens pretty much all the time (the pods are unusable for us in each cluster we've upgraded). But, that's probably something ignorable about our configurations.

Anyway, just recording it here in case somebody else happens across it. Looking forward to your PR making it in!


Update

While our symptoms match up exactly with the issue described here, the frequency was much higher. We found that the cause of our hangs turned out to be different (a bit related, but different). During our 1.27 -> 1.29 k8s upgrades, we also upgraded our cilium installations and some problem with the transparent IPSEC feature in the new version caused larger communications (those over about 200K) to frequently stall. The only thing regularly doing that size of communication in our small test clusters was ksm pods, and that's how we ended up here. For now, we've disabled cilium's IPSEC feature and we're back to working again.

@Pokom
Copy link
Contributor Author

Pokom commented Jun 13, 2024

@tkent Appreciate confirming that you're experiencing the same symptoms I am! I'm running dozens of ksm instances that are sharded, so we have probably close to 100 pods running out in the wild, and it's very sporadic when it happens. Even when it occurs, it's usually only a few shards at once, never all of the shards. So I believe the problem has existed for quite some time, it's just rare enough to occur that it's transient.

@dgrisonnet
Copy link
Member

/triage accepted
/help wanted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
@dgrisonnet
Copy link
Member

/assign @Pokom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants