Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: add operator-metrics port #171

Merged
merged 6 commits into from
Nov 14, 2024

Conversation

fstr
Copy link
Contributor

@fstr fstr commented Sep 2, 2024

Description

Expose Pyrra Kubernetes operator container metrics on port 8080. With these metrics, we can also get kube-builder metrics like controller_runtime_reconcile_errors_total on which we can build an alert.

The alert can be optionally enabled. Due to the reconciliation loop interval in the operator, we use a 20 minute interval on the rate function, as it is long enough to avoid dropping to 0 and having a flapping alert while the operator is not reconciling (and reporting a reconciliation error). The alert with resolve slightly delayed because of this, but it is much better than having no alert.

This feature is especially useful, as the Pyrra WebUI currently breaks and shows nothing if a broken SLO has been applied to your Kubernetes cluster.

How can this be tested

Enable the PyrraReconciliationError alert via prometheusRule.enabled: true and deploy a broken/invalid SLO. Even if the ValidatingWebhook is active, this SLO will be accepted but won't reconcile, as the errors.metric is not a valid Vector Selector due to the or expr clause.

---
apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  name: broken-slo-test
  namespace: monitoring
spec:
  description: ""
  alerting:
    absent: false
  indicator:
    ratio:
      errors:
        # This (or clause) is not supported and will lead to an error
        metric: prometheus_notifications_errors_total{job="prometheus-k8s"} or up{job="prometheus-k8s"} == 0
      total:
        metric: prometheus_notifications_sent_total{job="prometheus-k8s"}
  target: "99"
  window: 4w

BREAKING CHANGE: This is a breaking change, as both containers expose the default Golang metrics. Users that built monitoring on the existing metrics now have to separate them by container label.

@rlex
Copy link
Owner

rlex commented Sep 2, 2024

Might be worth bumping version in Chart.yml (major one?) and re-running helm-docs.

charts/pyrra/templates/prometheusrule.yaml Outdated Show resolved Hide resolved
charts/pyrra/templates/servicemonitor.yaml Outdated Show resolved Hide resolved
charts/pyrra/templates/servicemonitor.yaml Outdated Show resolved Hide resolved
@fstr
Copy link
Contributor Author

fstr commented Sep 9, 2024

What are your thoughts on the "shared" ServiceMonitor? We now have a single ServiceMonitor for the pyrra-server and the pyrra-operator. Both containers run as part of a single pod. I think it's acceptable, but the jobLabel is pyrra-server which doesn't fully match anymore, as it also has the pyrra-operator metrics.

Changing the jobLabel to something generic like pyrra would be another backwards breaking change. Introducing a second ServiceMonitor just for the operator could have the jobLabel: pyrra-operator.

@rlex
Copy link
Owner

rlex commented Sep 10, 2024

maybe we should just split configuration between operator and pyrra?

ie pyrra.serviceMonitor, pyrra.prometheusUrl, pyrraOperator.serviceMonitor

@sebastiangaiser
Copy link
Contributor

sebastiangaiser commented Sep 10, 2024

maybe we should just split configuration between operator and pyrra?

ie pyrra.serviceMonitor, pyrra.prometheusUrl, pyrraOperator.serviceMonitor

I think this is a good idea. But when doing this we could also split them up into 2 deployments. On the other hand this might unnecessarily expand the original problem and should be done in another PR. What do you think?

@fstr
Copy link
Contributor Author

fstr commented Sep 10, 2024

I can split the ServiceMonitor as part of this PR. This will already allow separate configuration and jobNames. In a follow-up PR it can be changed to use two deployments.

@fstr fstr force-pushed the fstr/add-operator-metrics-port branch from 1bc2716 to 8e958eb Compare October 2, 2024 10:07
@fstr fstr force-pushed the fstr/add-operator-metrics-port branch from 8e958eb to b6bc47f Compare October 2, 2024 10:08
@fstr
Copy link
Contributor Author

fstr commented Oct 2, 2024

I finally came back to this and split the ServiceMonitor into one for operator and one for the server. I moved the config properties under serviceMonitorOperator for now to keep the value file flat.

If you want to move forward with the idea to split the operator and server into separate deployments, which I think it the right way to do, then most properties of the value file can be duplicated moved under server: and operator: respectively.

charts/pyrra/templates/deployment.yaml Show resolved Hide resolved
charts/pyrra/templates/service.yaml Show resolved Hide resolved
@@ -20,6 +20,8 @@ additionalLabels: {}
extraApiArgs: []
# -- Extra args for Pyrra's Kubernetes container
extraKubernetesArgs: []
# -- Address to expose operator metrics
operatorMetricsAddress: ":8080"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either use the operatorMetricsAddress or operatorMetricsPort as they have to align. If you want you can make this possible to overwrite, something like if operatorMetricsPort is "" then use {{ include "pyrra.operatorMetricsPort" . }}. Could possibly be done in the helpers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, the service port can be configured independently of the container port via .Values.service.operatorMetricsPort. The container port is taken from operatorMetricsAddress, so they are aligned.

The service port and the container port do not have to align.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to extract operatorMetricsPort from operatorMetricsAddress but should be also fine to leave it for now like it is

@sebastiangaiser
Copy link
Contributor

Thank you for pushing this @fstr . I added two small nits, can you please have a look.

Copy link
Contributor

@sebastiangaiser sebastiangaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sebastiangaiser
Copy link
Contributor

@rlex what do you think?

@rlex
Copy link
Owner

rlex commented Oct 9, 2024

So far so good i think, but it's probably worth bumping it to 1.0.0 since it's pretty major change.
Also, now CI is failing because CRDs aren't present during install.

@fstr
Copy link
Contributor Author

fstr commented Oct 14, 2024

As we've discussed also splitting the Deployments and subsequently the Services, should we wait with the bump to 1.0.0 until that is done?

Then we have separated them fully.

Operator: ServiceMonitor -> Service -> Deployment
Server: ServiceMonitor -> Service -> Deployment

@sebastiangaiser
Copy link
Contributor

@fstr can you please fix the docs as stated in the CI.
For me bumping a minor should be fine.

@sebastiangaiser
Copy link
Contributor

@rlex do you have anything to add?

@sebastiangaiser
Copy link
Contributor

@rlex can we get this merged?

@rlex
Copy link
Owner

rlex commented Nov 14, 2024

Sorry for delay! Will merge as soon as CI passes.

@rlex rlex merged commit a754d64 into rlex:master Nov 14, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants