-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(server): Add read and write timeouts #2412
fix(server): Add read and write timeouts #2412
Conversation
There are a few documented scenarios where `kube-state-metrics` will lock up(kubernetes#995, kubernetes#1028). I believe a much simpler solution to ensure `kube-state-metrics` doesn't lock up and require a restart to server `/metrics` requests is to add default read and write timeouts and to allow them to be configurable. At Grafana, we've experienced a few scenarios where `kube-state-metrics` running in larger clusters falls behind and starts getting scraped multiple times. When this occurs, `kube-state-metrics` becomes completely unresponsive and requires a reboot. This is somewhat easily reproduceable(I'll provide a script in an issue) and causes other critical workloads(KEDA, VPA) to fail in weird ways. Adds two flags: - `server-read-timeout` - `server-write-timeout` Updates the metrics http server to set the `ReadTimeout` and `WriteTimeout` to the configured values.
Welcome @Pokom! |
pkg/options/options.go
Outdated
@@ -146,6 +151,11 @@ func (o *Options) AddFlags(cmd *cobra.Command) { | |||
o.cmd.Flags().Var(&o.Namespaces, "namespaces", fmt.Sprintf("Comma-separated list of namespaces to be enabled. Defaults to %q", &DefaultNamespaces)) | |||
o.cmd.Flags().Var(&o.NamespacesDenylist, "namespaces-denylist", "Comma-separated list of namespaces not to be enabled. If namespaces and namespaces-denylist are both set, only namespaces that are excluded in namespaces-denylist will be used.") | |||
o.cmd.Flags().Var(&o.Resources, "resources", fmt.Sprintf("Comma-separated list of Resources to be enabled. Defaults to %q", &DefaultResources)) | |||
|
|||
o.cmd.Flags().DurationVar(&o.ServerReadTimeout, "server-read-timeout", 30*time.Second, "The maximum duration for reading the entire request, including the body.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the durations into separate variables at the top of the doc? I feel like it's easier to spot when they are not hiding inside the Flag setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I'm following: Are you suggesting moving the default(30*time.Second
) to a variable and place them at the top? That seems entirely reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrueg I pushed a new commit with moving the default values into separate variables at the top of the doc. Let me know how they look 👍🏻
I've also made the defaults to be 60 seconds, as I think this is closer aligned with default scrape intervals from Prometheus. I'm wary to setting them to 10s
which is the default timeout value for prometheus scrape configs.
Given how long kube-state-metrics
has existed without these values, this could be a problematic change for folks that have scrape intervals of over >60s. I'm not sure you'd like to handle that, or if we opt to have the default value being 0 which would effectively mean there is no timeout.
Should we set timeouts for the telemetryserver as well? https://github.com/kubernetes/kube-state-metrics/pull/2412/files#diff-dc8fe36bd1edf1b4cd03bacbd94b02cd4226717fe7e1a6474c534f5b5db30227R315 How did you chose the duration for each timeout setting? Should we provide any additional suggestions? |
I'm sure it would be a good idea to set timeouts for the telemetry server as well, but I don't have enough experience with that type of scraping to provide meaningful defaults 😅 I can follow up with an issue to set those in a future PR, how does that sound?
These were picked somewhat arbitrarily, but the intent is to align them some sane scrape times. When running in our dev cluster with dozens+ nodes, I have both aligned with our scrape interval and timeout(15s). As far as guidance, I think the minimum value should be the scrape timeout. The safest option for the defaults would be following the default scrape_interval from Prometheus which is 60s. |
/ok-to-test What happens if a request takes longer than a timeout? Does the server send any log output? Is it visible on the telemetry server metrics? |
That works for me as well. Just want to make sure that we don't have
That makes sense to me, thanks for updating! |
Co-authored-by: Manuel Rüger <[email protected]>
I'm not sure if anything is logged on
|
/hold for others to review. Changes /lgtm |
/assign @dgrisonnet @richabanker |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrueg, Pokom, richabanker The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What this PR does / why we need it:
There are a few documented scenarios where
kube-state-metrics
will lock up(#995, #1028). I believe a much simpler solution to ensurekube-state-metrics
doesn't lock up and require a restart to server/metrics
requests is to add default read and write timeouts and to allow them to be configurable. At Grafana, we've experienced a few scenarios wherekube-state-metrics
running in larger clusters falls behind and starts getting scraped multiple times. When this occurs,kube-state-metrics
becomes completely unresponsive and requires a reboot. This is somewhat easily reproduceable(I'll provide a script in an issue) and causes other critical workloads(KEDA, VPA) to fail in weird ways.Adds two flags:
server-read-timeout
server-write-timeout
Updates the metrics http server to set the
ReadTimeout
andWriteTimeout
to the configured values.How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
Does not change.
Fixes #2413