Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] add node metrics controller #701

Closed
wants to merge 7 commits into from
Closed

Conversation

cjerad
Copy link
Contributor

@cjerad cjerad commented Sep 23, 2021

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a controller that watches node events and updates counters across multiple dimensions, e.g. provisioner, readiness, and instance type

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Sep 23, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: fd532e0

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/614df149f1d159000714b234

bwagner5
bwagner5 previously approved these changes Sep 23, 2021
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, and very readable!

pkg/controllers/metrics/node/counter/counter.go Outdated Show resolved Hide resolved
pkg/controllers/metrics/node/counter/counter.go Outdated Show resolved Hide resolved
cmd/controller/main.go Outdated Show resolved Hide resolved
@ellistarn
Copy link
Contributor

Would love to review this. Can you give me until Friday?

@cjerad
Copy link
Contributor Author

cjerad commented Sep 24, 2021

Would love to review this. Can you give me until Friday?

Sure

node := &v1.Node{}
if err := c.KubeClient.Get(ctx, req.NamespacedName, node); err != nil {
if errors.IsNotFound(err) {
// Pass `nil` to UpdateCount rather than a "zeroed" node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typical reconciler pattern is to return reconcile.Result{} (terminate), since it only occurs in a race condition where the object is deleted after the watch event arrives but before the loop is executed.

I can potentially see some value in tracking this case, but doing so with node = nil seems awkward and confusing to folks who might be looking at prometheus.

}
}
counter.UpdateCount(
logging.WithLogger(ctx, logging.FromContext(ctx).Named("Count")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why redecorate the logger with a new name? Simpler to just counter.Update(ctx, node)?

For(&v1.Node{}).
WithOptions(
controller.Options{
RateLimiter: workqueue.NewMaxOfRateLimiter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Karpenter defines this in other controllers, but I'd just leave this nil and rely on the default backoff retry settings. Not a bad idea to keep the concurrency though controller.Options{MaxConcurrentReconciles: 4}.

[]string{
metrics.ProvisionerLabel,
metricArchLabel,
metricConditionDiskPressureLabel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider leaving out DiskPressure, MemoryPressure, and PIDPressure, since they're out of scope of something Karpenter can help with. I'd keep these metrics focused on Karpenter's domain.

Namespace: metrics.KarpenterNamespace,
Subsystem: "cluster",
Name: "node_count",
Help: "Count of cluster nodes. Broken out by topology and status.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cluster nodes is a bit redundant. Consider something like Nodes by topology and status

nodeCountGaugeVec = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: metrics.KarpenterNamespace,
Subsystem: "cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on calling this capacity? Conceptually

  • "Karpenter Capacity Metrics" = "How good of a job is Karpenter doing managing my capacity?
  • "Karpenter Controller Metrics" = "How healthy is the karpenter process itself?"

karpenterapi "github.com/awslabs/karpenter/pkg/apis/provisioning/v1alpha4"
"github.com/awslabs/karpenter/pkg/metrics"
"github.com/prometheus/client_golang/prometheus"
coreapi "k8s.io/api/core/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes golang conventions recommend calling this v1 or corev1 if there's a conflict.

"reflect"
"strings"

karpenterapi "github.com/awslabs/karpenter/pkg/apis/provisioning/v1alpha4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import are recommended to be of the form v1alpha4 or provisioningv1alpha4 (walking up the chain, if there's a conflict). Typically you'd only alias an import if there was some sort of conflict. This give developers a common convention to avoid alias sprawl of the same concept. It does mean that package names (and heirarchy) is critical, but it's a good sign of project organization if this convention yields english readable names. FWIW, I think you've done excellently re: package naming in this PR.

// UpdateCount updates the emitted metric based on the node's current status relative to the
// past status. If the data for `node` cannot be populated then `nil` should be passed as the
// argument.
func UpdateCount(ctx context.Context, name types.NamespacedName, node *coreapi.Node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider counter.Update() instead of counter.UpdateCount(), which stutters. You can also get the entire namespaced name from the node object node.Name (every kubernetes object has it). In this case, nodes are a cluster global resource -- namespace for nodes will never make sense.

func UpdateCount(ctx context.Context, name types.NamespacedName, node *coreapi.Node) {
currLabels := getLabels(node)
pastLabels, isKnown := prometheusLabelsFor[name]
switch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now why you're passing node as nil if not found.

This is a bit tricky since there's no guarantee that your controller will catch the deleted event (events are not guaranteed to arrive -- it could be delivered five times or never). The controller pattern is best effort and employs a combination of intervals (e.g. resync interval) and watches.

It feels a bit dirty, but I might just statelessly recompute everything on some interval. Even at 10k node scale, this should be pretty quick. If you're unfamiliar with the client.Client, all objects are kept in a cache and synced with incoming events (a.k.a ListWatch), so it's cheap to call client.List(nodes).

prometheusLabelsFor[name] = currLabels

if err := decrementNodeCount(pastLabels); err != nil {
logging.FromContext(ctx).Warnf("Failed to decrement previous count for updated node [labels=%s]: error=%s", pastLabels, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we continue to follow the convention of ErrorF instead of WarnF.

if node == nil {
return labels
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider just inlining all of this

labels := prometheus.Labels{
key: value,
...
}

prometheusLabelsFor[name] = currLabels

if err := decrementNodeCount(pastLabels); err != nil {
logging.FromContext(ctx).Warnf("Failed to decrement previous count for updated node [labels=%s]: error=%s", pastLabels, err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing this code out and I think there's a false assumption in here with the node properties being set. It could be that node properties like arch, os, instance type, etc are not set.

I'm getting floods of this message:

karpenter-controller-9f4878f7b-8rt9b manager 2021-09-27T21:54:04.263Z	WARN	controller.NodeMetrics.Count	Failed to decrement previous count for updated node [labels=map[arch: instancetype: os: provisioner:default region: zone:]]: error=inconsistent label cardinality: expected 10 label values but got 6 in prometheus.Labels{"arch":"", "instancetype":"", "os":"", "provisioner":"default", "region":"", "zone":""}	{"commit": "5655c94"}

@bwagner5 bwagner5 changed the title add node metrics controller [WIP] add node metrics controller Sep 29, 2021
@cjerad
Copy link
Contributor Author

cjerad commented Oct 5, 2021

There are some gaps in this approach. Closing this PR and will open a new PR with the revised approach.

@cjerad cjerad closed this Oct 5, 2021
@cjerad cjerad deleted the metrics-node-count branch October 5, 2021 18:58
@cjerad cjerad mentioned this pull request Oct 20, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants