Add additional metrics for worker saturation analysis and scaling #11

jeisen · 2020-05-22T17:34:15Z

This adds three new metrics that can be used to implement scaling rules:

sidekiq_concurrency -- Total concurrency reported by sidekiq processes (i.e. max available)
sidekiq_available_workers -- The number of threads able to handle jobs, calculated from each sidekiq process by subtracting its busy workers from its concurrency. In the case of a "quiet" process, which is not accepting any jobs, this metric reports 0 available.
sidekiq_saturation -- The percentage of workers currently in use, calculated using the available_workers formula above. This is so that we count quiet processes as fully saturated; they cannot accept any more jobs, so from a scaling perspective they should not be counted as available.

We then scale in Kubernetes using the following metric:

((max(sidekiq_jobs_waiting_count) / max(sidekiq_concurrency) + max(sidekiq_saturation)) * 100

This lets us scale pods based on the current saturation plus the additional saturation required for jobs waiting in the queue.

Envek

Hi! Thank you for contributing to Yabeda!

My main concern here is that every worker process reports aggregated data about all workers (and there may be hundreds of them).

Isn't it better to report data about only this worker and let monitoring system to aggregate it? (By the way, what are you using?)

Envek · 2020-05-24T11:53:11Z

lib/yabeda/sidekiq.rb

@@ -34,6 +34,10 @@ module Sidekiq
      gauge     :active_processes,     tags: [],        comment: "The number of active Sidekiq worker processes."
      gauge     :queue_latency,        tags: %i[queue], comment: "The queue latency, the difference in seconds since the oldest job in the queue was enqueued"

+      gauge     :concurrency,          tags: [],        comment: "The total number of jobs that can be run at a time across all processes."
+      gauge     :available_workers,    tags: [],        comment: "The number of workers available for new jobs across all processes."


From the metric name I can't understand whether it is about processes or threads. I need to read the collector's code below to get it. Let's clarify:

Suggested change

gauge :available_workers, tags: [], comment: "The number of workers available for new jobs across all processes."

gauge :available_threads, tags: [], comment: "The number of threads available for job processing across all processes."

This name was chosen to be consistent with how it’s reporting active workers — it looks like Sidekiq uses the term “worker” to mean “thread”.

jeisen · 2020-05-24T12:48:41Z

I originally tried to do it per-process and let Prometheus aggregate it, but Sidekiq isn’t reporting it that way so I had to do it like this.

The scaling is done with the Prometheus Exporter process making these stats available as custom metrics for the Kubernetes Horizontal Pod Autoscaler. To avoid it overloading Prometheus with queries, it’s set up as a service-level metric instead of per-pod.

hbontempo-br · 2024-08-15T06:23:36Z

Hi!

I saw the feature and specially the total number of threads is interesting (we already have the busy sidekiq_active_workers_count, therefore the available and saturation metrics are not really required since we can derive then from other metrics).

My main concern here is that every worker process reports aggregated data about all workers (and there may be hundreds of them).

By what I saw when this PR was opened there was no collect_cluster_metrics configuration yet.
Since we have it now, I believe that this is not a big concern as it was before, right?

@Envek are you open to consider this feature again? Maybe a bit smaller (just a sidekiq_workers_count) and under the collect_cluster_metrics configuration?
I open to help with code if you think the addition is valid and want help

Add additional metrics for worker saturation analysis and scaling

30fbdf7

Envek reviewed May 24, 2020

View reviewed changes

Envek mentioned this pull request Apr 30, 2021

Change metric and label names to get rid of worker term if possible #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add additional metrics for worker saturation analysis and scaling #11

Add additional metrics for worker saturation analysis and scaling #11

jeisen commented May 22, 2020

Envek left a comment

Envek May 24, 2020

jeisen May 24, 2020

jeisen commented May 24, 2020

hbontempo-br commented Aug 15, 2024

	gauge :available_workers, tags: [], comment: "The number of workers available for new jobs across all processes."
	gauge :available_threads, tags: [], comment: "The number of threads available for job processing across all processes."

Add additional metrics for worker saturation analysis and scaling #11

Are you sure you want to change the base?

Add additional metrics for worker saturation analysis and scaling #11

Conversation

jeisen commented May 22, 2020

Envek left a comment

Choose a reason for hiding this comment

Envek May 24, 2020

Choose a reason for hiding this comment

jeisen May 24, 2020

Choose a reason for hiding this comment

jeisen commented May 24, 2020

hbontempo-br commented Aug 15, 2024