Service monitor #501

nicolasochem · 2022-10-28T05:47:06Z

Add a service monitor to target the new metrics endpoint of v14.

Also remove the old "metrics" container since it is deprecated and replaced by the native endpoint.

Also add a serviceMonitor for convenience. We already have a service monitor in the pyrometer chart, I'm adding a similar one here. This helps the user not writing their own monitor, and saves a step in the upcoming guide.

tie servicemonitor with service with a label, allow custom labels for servicemonitor

orcutt989

Looks great! Declutter FTW.

harryttd · 2022-11-01T00:22:46Z

charts/tezos/templates/_containers.tpl

+    - containerPort: 9932
+      name: metrics


{{- if .Values.serviceMonitor.enabled }} ?

harryttd · 2022-11-01T00:24:03Z

charts/tezos/templates/static.yaml

+    - port: 9732
+      name: p2p


I'm curious, why wasn't this needed before?

The metrics endpoint needs to be here so the servicemonitor can pick up the metrics. While debugging this, I noticed there was no endpoint associated with the p2p port either, so I added it. It never mattered, because:

internally we target pods directly

when we want external p2p access we deploy a different service anyway, of type LoadBalancer

I guess this service was pretty useless?

The service is a headless service. K8s then allows you to target each pod individually instead of creating a clusterIP that would load balance across the pods.

I wonder because ports are now defined, rpc port isn't, so would that break? i'm not sure i get why we need to set ports here. Being we can anyways target the pod and use the port number.

I just tried mkchain with 2 bakers and 1 node, and nodes are peering with each other and everything is working. So, it didn't break. I'm not sure why p2p is here though. Maybe safe to remove.

RPC is served from a different service.

It makes sense to have a p2p service targeting several pods, because it's fine to load balance the p2p connections. But, I did forget that the purpose of this (headless) service was to be able to target pods individually.

Knowing that, maybe it would have been better to leave it alone, and set the metrics port on the rpc service.

I'm pretty sure I did this for a reason. Probably the service monitor would not work if the port was not named on the service.

an example where we target a specific pod rpc endpoints is indexers:

tezos-k8s/charts/tezos/values.yaml

Line 449 in cb8eed6

# rpc_url: http://archive-node-0.archive-node:8732

So we need to know that we can still target any port we want. If we can, i don't know why we need to specify any ports now. 9732, and the metrics port. As long as the sts pod containers are exposing the port.

It appears to me the tezos-service-monitor is load balancing across all nodes. Does this allow us to monitor each node individually instead of conflating each nodes' metric data together?

it's labeling per pods

so every 15s it sends a request to /metrics and each pod will eventuallly be hit. What if there are many many pods? They won't have their metrics updated for a while.

So we need to know that we can still target any port we want. If we can, i don't know why we need to specify any ports now. 9732, and the metrics port. As long as the sts pod containers are exposing the port.

once we deploy this release, we can prob test rpc of a specific node just by sshing into another one. That is prob sufficient. And then we can also remove the ports from the headless svc later just so it isn't confusing why we are exposing some ports on the svc and not others. not a big deal

I don't think that's what is going on. I am aware of "pod monitor" but I've never used it, ServiceMonitor is what I have been using.

See these heavily thumbed-up comments: prometheus-operator/prometheus-operator#3119 (comment)

charts/tezos/templates/static.yaml

charts/tezos/values.yaml

charts/tezos/templates/_containers.tpl

nicolasochem · 2022-11-01T16:32:17Z

@harryttd answering all of your remaining comments at once:

I have enabled metrics port by default on tezos-k8s, by choice.

I have also ensured that octez default config in tezos-k8s is exposing its prometheus port.

It's better this way, than making it dependent of ServiceMonitor being enabled in the values: you could want to scrape this ports with something else than the prometheus operator (for example a manually provisioned prometheus): then you would need no service monitor but you would still needs the services to be aware of these ports.

So then, we would have metricsExposed: true separate from serviceMonitor: enabled: true but I feel that it's simpler to have it enabled all the time. It's not a security concern (it stays within the cluster and you can't kill the node with such metrics requests, in theory)

A future NetworkPolicy in our chart will tighten security appropriately, I think.

nicolasochem added 6 commits October 27, 2022 16:35

add service monitor for octez node

f3fe996

add metrics to mkchain chain by default

a864e3b

change port to 9932, make accessible from anywhere

f25a0f0

tie servicemonitor with service with a label, allow custom labels for servicemonitor

fix tests

c973e51

fix black

cbe4867

remove old "metrics" container

0f87cc4

nicolasochem marked this pull request as ready for review October 31, 2022 18:24

nicolasochem requested review from elric1, orcutt989 and harryttd October 31, 2022 18:24

orcutt989 approved these changes Oct 31, 2022

View reviewed changes

harryttd suggested changes Nov 1, 2022

View reviewed changes

nicolasochem merged commit 6bde524 into master Nov 2, 2022

harryttd deleted the serviceMonitor branch March 8, 2023 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service monitor #501

Service monitor #501

nicolasochem commented Oct 28, 2022 •

edited

Loading

orcutt989 left a comment

harryttd Nov 1, 2022

harryttd Nov 1, 2022

nicolasochem Nov 1, 2022

harryttd Nov 3, 2022

nicolasochem Nov 3, 2022

harryttd Nov 3, 2022

nicolasochem Nov 3, 2022

harryttd Nov 3, 2022

harryttd Nov 3, 2022

nicolasochem Nov 3, 2022

nicolasochem commented Nov 1, 2022 •

edited

Loading

Service monitor #501

Service monitor #501

Conversation

nicolasochem commented Oct 28, 2022 • edited Loading

orcutt989 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolasochem commented Nov 1, 2022 • edited Loading

nicolasochem commented Oct 28, 2022 •

edited

Loading

nicolasochem commented Nov 1, 2022 •

edited

Loading