feat!: add operator-metrics port #171

fstr · 2024-09-02T13:59:13Z

Description

Expose Pyrra Kubernetes operator container metrics on port 8080. With these metrics, we can also get kube-builder metrics like controller_runtime_reconcile_errors_total on which we can build an alert.

The alert can be optionally enabled. Due to the reconciliation loop interval in the operator, we use a 20 minute interval on the rate function, as it is long enough to avoid dropping to 0 and having a flapping alert while the operator is not reconciling (and reporting a reconciliation error). The alert with resolve slightly delayed because of this, but it is much better than having no alert.

This feature is especially useful, as the Pyrra WebUI currently breaks and shows nothing if a broken SLO has been applied to your Kubernetes cluster.

How can this be tested

Enable the PyrraReconciliationError alert via prometheusRule.enabled: true and deploy a broken/invalid SLO. Even if the ValidatingWebhook is active, this SLO will be accepted but won't reconcile, as the errors.metric is not a valid Vector Selector due to the or expr clause.

---
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: broken-slo-test
  namespace: monitoring
spec:
  description: ""
  alerting:
    absent: false
  indicator:
    ratio:
      errors:
        # This (or clause) is not supported and will lead to an error
        metric: prometheus_notifications_errors_total{job="prometheus-k8s"} or up{job="prometheus-k8s"} == 0
      total:
        metric: prometheus_notifications_sent_total{job="prometheus-k8s"}
  target: "99"
  window: 4w

BREAKING CHANGE: This is a breaking change, as both containers expose the default Golang metrics. Users that built monitoring on the existing metrics now have to separate them by container label.

rlex · 2024-09-02T15:07:25Z

Might be worth bumping version in Chart.yml (major one?) and re-running helm-docs.

charts/pyrra/templates/prometheusrule.yaml

charts/pyrra/templates/servicemonitor.yaml

fstr · 2024-09-09T19:59:18Z

What are your thoughts on the "shared" ServiceMonitor? We now have a single ServiceMonitor for the pyrra-server and the pyrra-operator. Both containers run as part of a single pod. I think it's acceptable, but the jobLabel is pyrra-server which doesn't fully match anymore, as it also has the pyrra-operator metrics.

Changing the jobLabel to something generic like pyrra would be another backwards breaking change. Introducing a second ServiceMonitor just for the operator could have the jobLabel: pyrra-operator.

rlex · 2024-09-10T00:04:47Z

maybe we should just split configuration between operator and pyrra?

ie pyrra.serviceMonitor, pyrra.prometheusUrl, pyrraOperator.serviceMonitor

sebastiangaiser · 2024-09-10T07:01:55Z

maybe we should just split configuration between operator and pyrra?

ie pyrra.serviceMonitor, pyrra.prometheusUrl, pyrraOperator.serviceMonitor

I think this is a good idea. But when doing this we could also split them up into 2 deployments. On the other hand this might unnecessarily expand the original problem and should be done in another PR. What do you think?

fstr · 2024-09-10T10:22:41Z

I can split the ServiceMonitor as part of this PR. This will already allow separate configuration and jobNames. In a follow-up PR it can be changed to use two deployments.

fstr · 2024-10-02T10:29:19Z

I finally came back to this and split the ServiceMonitor into one for operator and one for the server. I moved the config properties under serviceMonitorOperator for now to keep the value file flat.

If you want to move forward with the idea to split the operator and server into separate deployments, which I think it the right way to do, then most properties of the value file can be duplicated moved under server: and operator: respectively.

charts/pyrra/templates/deployment.yaml

charts/pyrra/templates/service.yaml

charts/pyrra/templates/servicemonitor-operator.yaml

sebastiangaiser · 2024-10-02T14:35:55Z

charts/pyrra/values.yaml

@@ -20,6 +20,8 @@ additionalLabels: {}
 extraApiArgs: []
 # -- Extra args for Pyrra's Kubernetes container
 extraKubernetesArgs: []
+# -- Address to expose operator metrics
+operatorMetricsAddress: ":8080"


I would either use the operatorMetricsAddress or operatorMetricsPort as they have to align. If you want you can make this possible to overwrite, something like if operatorMetricsPort is "" then use {{ include "pyrra.operatorMetricsPort" . }}. Could possibly be done in the helpers.

Right now, the service port can be configured independently of the container port via .Values.service.operatorMetricsPort. The container port is taken from operatorMetricsAddress, so they are aligned.

The service port and the container port do not have to align.

My idea was to extract operatorMetricsPort from operatorMetricsAddress but should be also fine to leave it for now like it is

sebastiangaiser · 2024-10-02T14:39:07Z

Thank you for pushing this @fstr . I added two small nits, can you please have a look.

sebastiangaiser

LGTM

sebastiangaiser · 2024-10-08T16:32:29Z

@rlex what do you think?

rlex · 2024-10-09T15:11:27Z

So far so good i think, but it's probably worth bumping it to 1.0.0 since it's pretty major change.
Also, now CI is failing because CRDs aren't present during install.

charts/pyrra/values.yaml

fstr · 2024-10-14T06:38:46Z

As we've discussed also splitting the Deployments and subsequently the Services, should we wait with the bump to 1.0.0 until that is done?

Then we have separated them fully.

Operator: ServiceMonitor -> Service -> Deployment
Server: ServiceMonitor -> Service -> Deployment

sebastiangaiser · 2024-10-17T18:21:52Z

@fstr can you please fix the docs as stated in the CI.
For me bumping a minor should be fine.

sebastiangaiser · 2024-10-22T12:34:39Z

@rlex do you have anything to add?

sebastiangaiser · 2024-11-08T19:31:12Z

@rlex can we get this merged?

rlex · 2024-11-14T11:26:32Z

Sorry for delay! Will merge as soon as CI passes.

gracco approved these changes Sep 2, 2024

View reviewed changes

sebastiangaiser reviewed Sep 8, 2024

View reviewed changes

charts/pyrra/templates/prometheusrule.yaml Outdated Show resolved Hide resolved

charts/pyrra/templates/servicemonitor.yaml Outdated Show resolved Hide resolved

charts/pyrra/templates/servicemonitor.yaml Outdated Show resolved Hide resolved

feat!: add operator-metrics port

70c64ed

fstr force-pushed the fstr/add-operator-metrics-port branch from 1bc2716 to 8e958eb Compare October 2, 2024 10:07

fix: document fields, bump version, fix indent

b6bc47f

fstr force-pushed the fstr/add-operator-metrics-port branch from 8e958eb to b6bc47f Compare October 2, 2024 10:08

feat: add dedicated ServiceMonitor for operator

266ccce

sebastiangaiser reviewed Oct 2, 2024

View reviewed changes

fix: use different names on ServiceMonitors

e23a457

sebastiangaiser reviewed Oct 8, 2024

View reviewed changes

sebastiangaiser reviewed Oct 9, 2024

View reviewed changes

charts/pyrra/values.yaml Outdated Show resolved Hide resolved

fix: disable ServiceMonitors by default

56e6b97

fix: update README

8c03070

rlex merged commit a754d64 into rlex:master Nov 14, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: add operator-metrics port #171

feat!: add operator-metrics port #171

fstr commented Sep 2, 2024 •

edited

Loading

rlex commented Sep 2, 2024

fstr commented Sep 9, 2024

rlex commented Sep 10, 2024

sebastiangaiser commented Sep 10, 2024 •

edited

Loading

fstr commented Sep 10, 2024

fstr commented Oct 2, 2024

sebastiangaiser Oct 2, 2024

fstr Oct 7, 2024

sebastiangaiser Oct 8, 2024

sebastiangaiser commented Oct 2, 2024

sebastiangaiser left a comment

sebastiangaiser commented Oct 8, 2024

rlex commented Oct 9, 2024

fstr commented Oct 14, 2024

sebastiangaiser commented Oct 17, 2024

sebastiangaiser commented Oct 22, 2024

sebastiangaiser commented Nov 8, 2024

rlex commented Nov 14, 2024

feat!: add operator-metrics port #171

feat!: add operator-metrics port #171

Conversation

fstr commented Sep 2, 2024 • edited Loading

Description

How can this be tested

rlex commented Sep 2, 2024

fstr commented Sep 9, 2024

rlex commented Sep 10, 2024

sebastiangaiser commented Sep 10, 2024 • edited Loading

fstr commented Sep 10, 2024

fstr commented Oct 2, 2024

sebastiangaiser Oct 2, 2024

Choose a reason for hiding this comment

fstr Oct 7, 2024

Choose a reason for hiding this comment

sebastiangaiser Oct 8, 2024

Choose a reason for hiding this comment

sebastiangaiser commented Oct 2, 2024

sebastiangaiser left a comment

Choose a reason for hiding this comment

sebastiangaiser commented Oct 8, 2024

rlex commented Oct 9, 2024

fstr commented Oct 14, 2024

sebastiangaiser commented Oct 17, 2024

sebastiangaiser commented Oct 22, 2024

sebastiangaiser commented Nov 8, 2024

rlex commented Nov 14, 2024

fstr commented Sep 2, 2024 •

edited

Loading

sebastiangaiser commented Sep 10, 2024 •

edited

Loading