Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pod metrics controller #744

Merged
merged 2 commits into from
Nov 10, 2021
Merged

add pod metrics controller #744

merged 2 commits into from
Nov 10, 2021

Conversation

cjerad
Copy link
Contributor

@cjerad cjerad commented Oct 13, 2021

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a controller that updates pod count metrics across multiple dimensions, e.g. provisioner, phase, and zone.

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cjerad cjerad requested a review from bwagner5 October 13, 2021 19:16
@netlify
Copy link

netlify bot commented Oct 13, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: a41d432

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/618ad414296e4900076343e7

@cjerad cjerad marked this pull request as ready for review October 13, 2021 19:49
cmd/controller/main.go Outdated Show resolved Hide resolved
metricLabelProvisioner: provisioner,
metricLabelZone: zone,
}
errors = append(errors, publishCount(runningPodCountByProvisionerZone, metricLabels, countByZone[zone]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out multierr.Append instead of managing the slice of errors yourself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test cluster len(errors) > 1500. Thoughts on pre-allocation vs repeated multierr.Append()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully they wouldn't all error. The typical pattern I've seen used across the k8s space (and in this project) is:

if err := publicCount(...); err != nil {
  errs = multierr.Append(errs, err)
}

)

var (
knownValuesForNodeLabels = v1alpha4.WellKnownLabels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why alias this? Isn't it cleaner to just use it as is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve clarity and avoid package name "noise" -- in this case it doesn't provide helpful context.

zoneValues := knownValuesForNodeLabels[nodeLabelZone]
// vs
zoneValues := v1alpha4.WellKnownLabels[nodeLabelZone]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps its my own familiarity with the k8s ecosystem that causes v1alpha4.WellKnownLabels to read very clearly to me. I'd lean the latter, since it avoids introducing another concept to be aware of, but I'll leave it to you.

}

podList := v1.PodList{}
withNodeName := client.MatchingFields{"spec.nodeName": node.Name}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to not inline this?

pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
return controllerruntime.
NewControllerManagedBy(m).
Named(controllerName).
For(&v1alpha4.Provisioner{}, builder.WithPredicates(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop all of these predicates and just use the default. I don't see the harm in running this on an update.

pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
}

// 2. Update pod counts associated with the provisioner.
podsByZone, err := c.podsByZone(ctx, &provisioner)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what the design options are here:

return reconcile.Result{Requeue: true}, err
}

// The provisioner has been deleted. Reset all the associated counts to zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same issue we ran into w/ the previous PR? Thoughts on just letting the data expire rather than trying to zero it out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel better zeroing out here. Letting the data expire seems pretty inaccurate. Is there a default expire duration for the metric data or is it configurable?

But it's fairly easy to zero it out on this clear signal that the provisioner has been deleted, might as well, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did want some sort of deletion assurance, we could use a finalizer. Maybe overkill though.

pkg/controllers/metrics/constants.go Outdated Show resolved Hide resolved

func publishCount(gaugeVec *prometheus.GaugeVec, labels prometheus.Labels, count int) error {
gauge, err := gaugeVec.GetMetricWith(labels)
if err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally in go, the happy-path is left indented. So the err would be returned in a branch statement rather than at the end of the func. I'm guessing you did this to save the extra return nil but I think it's better to stick with the convention here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is known as the "short circuitting pattern"

if statement: exit
if statement: exit
success

pkg/controllers/metrics/controller.go Outdated Show resolved Hide resolved
requeueInterval = 10 * time.Second

metricNamespace = metrics.KarpenterNamespace
metricSubsystemCapacity = "capacity"
Copy link
Contributor

@ellistarn ellistarn Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this variable nodeSubsystem to avoid stuttering both the package name. Also, is it cleaner to write this as capacity -> node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it cleaner to write this as capacity -> node

Please clarify. Do you mean you want nodeSubsystem = "capacity -> node"? If so, 1) that string is not a valid subsystem name for Prometheus 2) the name "capacity" was suggested in a previous PR here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the thrash. It occurred to me that node might be a more accurate name, but it sounds like prometheus has this name reserved?

metricLabelInstanceType = "instancetype"
metricLabelOS = "os"
metricLabelPhase = "phase"
metricLabelProvisioner = metrics.ProvisionerLabel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly why do we need to alias this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "label" is a bit overloaded in this package: there are labels on Nodes and labels on metrics. The aliases are to provide clarity to which object each label name applies.

metricLabelProvisioner = metrics.ProvisionerLabel
metricLabelZone = "zone"

nodeLabelArch = v1.LabelArchStable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See response to previous comment.

}

func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
loggerName := fmt.Sprintf("%s.provisioner/%s", strings.ToLower(controllerName), req.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider inlining to avoid creation of an extra variable/concept/line.

zoneValues := knownValuesForNodeLabels[nodeLabelZone]
countByZone := make(map[string]int, len(zoneValues))

for zone, pods := range podsByZone {
Copy link
Contributor

@ellistarn ellistarn Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting this comment since it got hidden in the refactor:

Curious what the design options are here:

pods w/ zone data
pods w/o zone data
other instance attributes (arch/os/etc)
I wonder if we should follow https://en.wikipedia.org/wiki/KISS_principle and just include phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The divisions were made to minimize metric cardinality.

@cjerad cjerad force-pushed the metrics-pod-controller branch from 0dfed97 to 9d34d21 Compare November 9, 2021 15:31
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one comment to remove OS metrics for now.

pkg/controllers/metrics/pods.go Show resolved Hide resolved
pkg/controllers/metrics/nodes.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm


var nodeLabelProvisioner = v1alpha5.ProvisionerNameLabelKey

func publishCount(gaugeVec *prometheus.GaugeVec, labels prometheus.Labels, count int) error {
Copy link
Contributor

@ellistarn ellistarn Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: Given that the labels are deterministic, we could use https://github.com/prometheus/client_golang/blob/v1.11.0/prometheus/counter.go#L256

gaugeVec.With(labels).Set(float64(count))
which would collapse this helper.

@cjerad cjerad merged commit 9427519 into aws:main Nov 10, 2021
@cjerad cjerad deleted the metrics-pod-controller branch November 10, 2021 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants