-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add allocation bind duration metric #677
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: 6743667 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6140e8244df8c80008a51c45 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how many metrics we get from client-go for free: https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go. Maybe there's a way to wire this up?
type Binder struct { | ||
KubeClient client.Client | ||
CoreV1Client corev1.CoreV1Interface | ||
type Binder interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you promoted this to an interface? Do we expect multiple implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It facilitates using the decorator pattern, with the metric logic being in a separate decorate type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a bit about this pattern for structs that have more methods. You could use struct embedding to achieve a similar mechanism without requiring full method overloading.
type DecoratedBinder {
Binder // embedded struct, all methods are promoted to DecoratedBinder
}
func (b *Binder) foo() {}
func (b *Binder) bar() {}
func (d *DecoratedBinder) foo() {
// decorate before
d.Binder.foo()
// decorate after
}
In this case, bar is not overloaded, so anyone calling decoratedBinder.bar()
will get the undecorated implementation.
func DecorateBinderMetrics(binder Binder) Binder { | ||
bindTimeHistogramVec := prometheus.NewHistogramVec( | ||
prometheus.HistogramOpts{ | ||
Namespace: "karpenter", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on building a /pkg/metrics component with helpers to encapsulate the buckets, Name, Subsystem, etc?
package metrics
type HistogramOptions {
Namespace: string // defaults to karpenter
Buckets []float64
...
}
func NewHistogram(options HistogramOptions) {
// default options if some not set (e.g. buckets)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps after more metrics have been implemented and patterns have been identified. Defining constants for Namespace and Subsystem seems likely, but not necessary in this commit as each appears only once in the code.
result = "error" | ||
} | ||
|
||
observer, promErr := b.bindTimeHistogramVec.GetMetricWith(prometheus.Labels{metricLabelResult: result}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what scenarios can we get an error? My understanding is that this is all static memory allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Underlying implementations may change over time. Since the API may return an error
it seems advisable to treat it as though it could fail, even if the current implementation makes that very unlikely.
Another question is whether a failure in the metrics reporting is considered a recoverable error or whether karpenter should panic
? This implementation assumes such a failure is not a cause to panic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree, that metric emission is not a cause to panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to log an error and continue.
|
||
const metricLabelResult = "result" | ||
|
||
func DecorateBinderMetrics(binder Binder) Binder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if (in the future) we could abstract this decoration even further and minimally wrap the code as we need it with something generic.
I'm imagining an interface like
// inside binder
var err error
metrics.recordLatency(metrics.LatencyOptions{ Name: "bind" }, func() {
err = b.Bind()
})
logging.FromContext(ctx).Warnf( | ||
"Failed to record bind duration metric [%s=%s, duration=%f]: error=%w", | ||
metricLabelResult, | ||
result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of the code complexity of tracking everything this error is going to log. Especially since we will already bubble up the errors in the normal karpenter controller logic to print something along the lines of (failed to X).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the karpenter controller logic treat a failure to record the metric as a failure of the bind operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think metric emission errors should bubble up to the controller interface on bind. I think the warn provides enough visibility if this were to fail, but it's not necessarily critical to the actual bind operation (and it's unlikely to occur).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was errorable, I agree, but I'm unsure whether or not recording a metric should be able to fail.
Taking a quick peek at the implementation, it looks like the only way it can fail is if the labels can't be hashed. Of course, this implementation could change, but I expect from how I've seen prometheus metrics used across k8s projects is that the intent of these metrics is to be used as statics.
For example we could do something like. If it does change in an unexpected way, we'll fail fast at process startup.
var (
bindLatencyHistogramVec = prometheus.NewHistogramVec( ... )
successfulBindLatencyHistogram = runtimeutil.Must(bindLatencyHistogramVec.GetMetricWith( ... )
failedBindLatencyHistogram = runtimeutil.Must(bindLatencyHistogramVec.GetMetricWith( ... )
)
func bind(... ) {
...
successfulBindLatencyHistogram.Observe(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking further on this, I wonder if we should be including Provisioner labels (name/namespace) into these metrics, which would make my above points moot (since the labels are dynamic).
bindTimeHistogramVec *prometheus.HistogramVec | ||
} | ||
|
||
const metricLabelResult = "result" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this isn't inlined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An old habit of avoiding bare keys in map lookups; misspellings cause runtime problems. When using a constant, misspellings cause compile-time errors.
const metricLabelResult = "result" | ||
|
||
func DecorateBinderMetrics(binder Binder) Binder { | ||
bindTimeHistogramVec := prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on putting this whole thing in a Var that can be accessed at package scope?
e.g. https://github.com/prometheus/prometheus/blob/main/discovery/kubernetes/client_metrics.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage to doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client-go instruments requests for all API calls made, including /pods/bind
(https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go). This means we should have a bunch of metrics for bind latency (on individual bind pod requests).
That name conflicts with our "bind" concept which means "create the node, bind all the pods". Now that we're exposing the name of this class to customers as a metric, I think it might make sense to rename the concept slightly to avoid confusing customers. I'm struggling a bit with exactly what this name should be. Perhaps something like forge_packing
, apply_packing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
1. Issue, if available:
#612 "Emit Metrics for Karpenter"
This PR does not fully resolve the issue. More changes will be needed.
2. Description of changes:
Add a histogram metric to record the duration of bind operations.
3. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.