[WIP] add node metrics controller #701

cjerad · 2021-09-23T17:05:05Z

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a controller that watches node events and updates counters across multiple dimensions, e.g. provisioner, readiness, and instance type

3. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-09-23T17:05:12Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: fd532e0

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/614df149f1d159000714b234

bwagner5

Looks really good, and very readable!

pkg/controllers/metrics/node/counter/counter.go

cmd/controller/main.go

ellistarn · 2021-09-23T21:37:08Z

Would love to review this. Can you give me until Friday?

cjerad · 2021-09-24T13:36:28Z

Would love to review this. Can you give me until Friday?

Sure

pkg/controllers/metrics/node/counter/counter.go

ellistarn · 2021-09-24T15:48:08Z

pkg/controllers/metrics/node/controller.go

+	node := &v1.Node{}
+	if err := c.KubeClient.Get(ctx, req.NamespacedName, node); err != nil {
+		if errors.IsNotFound(err) {
+			// Pass `nil` to UpdateCount rather than a "zeroed" node.


The typical reconciler pattern is to return reconcile.Result{} (terminate), since it only occurs in a race condition where the object is deleted after the watch event arrives but before the loop is executed.

I can potentially see some value in tracking this case, but doing so with node = nil seems awkward and confusing to folks who might be looking at prometheus.

ellistarn · 2021-09-24T15:49:57Z

pkg/controllers/metrics/node/controller.go

+		}
+	}
+	counter.UpdateCount(
+		logging.WithLogger(ctx, logging.FromContext(ctx).Named("Count")),


Why redecorate the logger with a new name? Simpler to just counter.Update(ctx, node)?

ellistarn · 2021-09-24T15:51:00Z

pkg/controllers/metrics/node/controller.go

+		For(&v1.Node{}).
+		WithOptions(
+			controller.Options{
+				RateLimiter: workqueue.NewMaxOfRateLimiter(


I know Karpenter defines this in other controllers, but I'd just leave this nil and rely on the default backoff retry settings. Not a bad idea to keep the concurrency though controller.Options{MaxConcurrentReconciles: 4}.

ellistarn · 2021-09-24T15:53:16Z

pkg/controllers/metrics/node/counter/counter.go

+		[]string{
+			metrics.ProvisionerLabel,
+			metricArchLabel,
+			metricConditionDiskPressureLabel,


Consider leaving out DiskPressure, MemoryPressure, and PIDPressure, since they're out of scope of something Karpenter can help with. I'd keep these metrics focused on Karpenter's domain.

ellistarn · 2021-09-24T15:54:27Z

pkg/controllers/metrics/node/counter/counter.go

+			Namespace: metrics.KarpenterNamespace,
+			Subsystem: "cluster",
+			Name:      "node_count",
+			Help:      "Count of cluster nodes. Broken out by topology and status.",


Nit: cluster nodes is a bit redundant. Consider something like Nodes by topology and status

ellistarn · 2021-09-24T15:55:56Z

pkg/controllers/metrics/node/counter/counter.go

+	nodeCountGaugeVec = prometheus.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Namespace: metrics.KarpenterNamespace,
+			Subsystem: "cluster",


Thoughts on calling this capacity? Conceptually

"Karpenter Capacity Metrics" = "How good of a job is Karpenter doing managing my capacity?

"Karpenter Controller Metrics" = "How healthy is the karpenter process itself?"

ellistarn · 2021-09-24T15:56:54Z

pkg/controllers/metrics/node/counter/counter.go

+	karpenterapi "github.com/awslabs/karpenter/pkg/apis/provisioning/v1alpha4"
+	"github.com/awslabs/karpenter/pkg/metrics"
+	"github.com/prometheus/client_golang/prometheus"
+	coreapi "k8s.io/api/core/v1"


Kubernetes golang conventions recommend calling this v1 or corev1 if there's a conflict.

ellistarn · 2021-09-24T15:59:42Z

pkg/controllers/metrics/node/counter/counter.go

+	"reflect"
+	"strings"
+
+	karpenterapi "github.com/awslabs/karpenter/pkg/apis/provisioning/v1alpha4"


Import are recommended to be of the form v1alpha4 or provisioningv1alpha4 (walking up the chain, if there's a conflict). Typically you'd only alias an import if there was some sort of conflict. This give developers a common convention to avoid alias sprawl of the same concept. It does mean that package names (and heirarchy) is critical, but it's a good sign of project organization if this convention yields english readable names. FWIW, I think you've done excellently re: package naming in this PR.

ellistarn · 2021-09-24T16:01:21Z

pkg/controllers/metrics/node/counter/counter.go

+// UpdateCount updates the emitted metric based on the node's current status relative to the
+// past status. If the data for `node` cannot be populated then `nil` should be passed as the
+// argument.
+func UpdateCount(ctx context.Context, name types.NamespacedName, node *coreapi.Node) {


Consider counter.Update() instead of counter.UpdateCount(), which stutters. You can also get the entire namespaced name from the node object node.Name (every kubernetes object has it). In this case, nodes are a cluster global resource -- namespace for nodes will never make sense.

ellistarn · 2021-09-24T16:10:07Z

pkg/controllers/metrics/node/counter/counter.go

+func UpdateCount(ctx context.Context, name types.NamespacedName, node *coreapi.Node) {
+	currLabels := getLabels(node)
+	pastLabels, isKnown := prometheusLabelsFor[name]
+	switch {


I see now why you're passing node as nil if not found.

This is a bit tricky since there's no guarantee that your controller will catch the deleted event (events are not guaranteed to arrive -- it could be delivered five times or never). The controller pattern is best effort and employs a combination of intervals (e.g. resync interval) and watches.

It feels a bit dirty, but I might just statelessly recompute everything on some interval. Even at 10k node scale, this should be pretty quick. If you're unfamiliar with the client.Client, all objects are kept in a cache and synced with incoming events (a.k.a ListWatch), so it's cheap to call client.List(nodes).

ellistarn · 2021-09-24T16:24:30Z

pkg/controllers/metrics/node/counter/counter.go

+	prometheusLabelsFor[name] = currLabels
+
+	if err := decrementNodeCount(pastLabels); err != nil {
+		logging.FromContext(ctx).Warnf("Failed to decrement previous count for updated node [labels=%s]: error=%s", pastLabels, err.Error())


I'd prefer we continue to follow the convention of ErrorF instead of WarnF.

ellistarn · 2021-09-24T16:31:17Z

pkg/controllers/metrics/node/counter/counter.go

+	if node == nil {
+		return labels
+	}
+


consider just inlining all of this

labels := prometheus.Labels{
key: value,
...
}

bwagner5 · 2021-09-27T22:04:11Z

pkg/controllers/metrics/node/counter/counter.go

+	prometheusLabelsFor[name] = currLabels
+
+	if err := decrementNodeCount(pastLabels); err != nil {
+		logging.FromContext(ctx).Warnf("Failed to decrement previous count for updated node [labels=%s]: error=%s", pastLabels, err.Error())


I was testing this code out and I think there's a false assumption in here with the node properties being set. It could be that node properties like arch, os, instance type, etc are not set.

I'm getting floods of this message:

karpenter-controller-9f4878f7b-8rt9b manager 2021-09-27T21:54:04.263Z WARN controller.NodeMetrics.Count Failed to decrement previous count for updated node [labels=map[arch: instancetype: os: provisioner:default region: zone:]]: error=inconsistent label cardinality: expected 10 label values but got 6 in prometheus.Labels{"arch":"", "instancetype":"", "os":"", "provisioner":"default", "region":"", "zone":""} {"commit": "5655c94"}

cjerad · 2021-10-05T18:28:00Z

There are some gaps in this approach. Closing this PR and will open a new PR with the revised approach.

add node metrics controller

106c5fa

cjerad requested review from ellistarn and bwagner5 September 23, 2021 17:05

refactor to reduce gocyclo score

5211659

bwagner5 previously approved these changes Sep 23, 2021

View reviewed changes

cjerad added 2 commits September 23, 2021 16:03

fix typo

7dee04b

condense code

cb3a828

cjerad added 2 commits September 24, 2021 08:11

refactor to 'fail early' structure

3e579f1

add constructor

61fb971

cjerad dismissed bwagner5’s stale review via 61fb971 September 24, 2021 14:01

bwagner5 reviewed Sep 24, 2021

View reviewed changes

pkg/controllers/metrics/node/counter/counter.go Outdated Show resolved Hide resolved

use label constants

fd532e0

ellistarn reviewed Sep 24, 2021

View reviewed changes

bwagner5 reviewed Sep 27, 2021

View reviewed changes

bwagner5 changed the title ~~add node metrics controller~~ [WIP] add node metrics controller Sep 29, 2021

cjerad closed this Oct 5, 2021

cjerad deleted the metrics-node-count branch October 5, 2021 18:58

cjerad mentioned this pull request Oct 20, 2021

add pod metrics controller #744

Merged

3 tasks

nonoswz mentioned this pull request Feb 4, 2022

Document Metrics exposed by Karpenter #1275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add node metrics controller #701

[WIP] add node metrics controller #701

cjerad commented Sep 23, 2021

netlify bot commented Sep 23, 2021 •

edited

Loading

bwagner5 left a comment

ellistarn commented Sep 23, 2021

cjerad commented Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

ellistarn Sep 24, 2021

bwagner5 Sep 27, 2021

cjerad commented Oct 5, 2021

[WIP] add node metrics controller #701

[WIP] add node metrics controller #701

Conversation

cjerad commented Sep 23, 2021

netlify bot commented Sep 23, 2021 • edited Loading

bwagner5 left a comment

Choose a reason for hiding this comment

ellistarn commented Sep 23, 2021

cjerad commented Sep 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjerad commented Oct 5, 2021

netlify bot commented Sep 23, 2021 •

edited

Loading