-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add allocation scheduling duration metric #680
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: a9d69d2 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6143a77622822d0008f7865c |
result = "error" | ||
} | ||
|
||
provisionerName := provisioner.ObjectMeta.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I'd probably lean inlining this to reduce lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate reference is used to ensure the same value used in the query (line 193) will also be logged (line 200) if the query fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can assume that provisioner.ObjectMeta.Name is immutable. In kubernetes, spec/metadata are considered immutable at reconcile time, while status is mutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario of concern is that the provisioner name might be fetch from elsewhere. Hypothetically, if a new field is added, e.g. provisioner.spec.metricDisplayName
; then, if the lookup is inlined, whoever updates this code will need to be sure to update both locations to keep the log message in sync.
@@ -68,7 +68,7 @@ func NewController(kubeClient client.Client, coreV1Client corev1.CoreV1Interface | |||
Filter: &Filter{KubeClient: kubeClient}, | |||
Binder: DecorateBinderMetrics(NewBinder(kubeClient, coreV1Client)), | |||
Batcher: NewBatcher(maxBatchWindow, batchIdleTimeout), | |||
Scheduler: scheduling.NewScheduler(cloudProvider, kubeClient), | |||
Scheduler: scheduling.DecorateSchedulerMetrics(scheduling.NewScheduler(cloudProvider, kubeClient)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on encapsulating this inside NewScheduler so the test doesn't need to be aware it's decorated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, I agree that this may not be the best place to apply the metrics decoration. However, I would prefer pushing the metrics decoration up to the caller of allocation.NewController()
rather than down into scheduling.NewScheduler()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends whether or not you think the scheduler should innately emit metrics (i.e., the decorator is a convenience impl detail) or whether something should be decorating it. I'd lean the former, as I don't see multiple scheduler implementations in our future. This does feel somewhat like go/java philosophies colliding.
Nonblocking, though.
@@ -147,3 +156,56 @@ func IsSchedulable(pod *v1.Pod, node *v1.Node) bool { | |||
// TODO, support node affinity | |||
return true | |||
} | |||
|
|||
type schedulerMetricsDecorator struct { | |||
scheduler Scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional:
type metricsDecoratedScheduler struct {
Scheduler // embedded
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it. The go compiler won't allow embedding an interface within a struct.
The compiler will allow embedding the scheduler
struct but that had the effect of coupling metricsDecoratedScheduler
to scheduler
, rather than allowing metricsDecoratedScheduler
to work with any Scheduler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think of the metricsdecorator as a subclass of scheduler (with is ~= to embedding in golang, though not quite the same). I tend to see the go philosophy as leaning away from things like interfaces, setters, getters, and instead towards duck typing and direct variable access. I also expect that we won't have multiple scheduler implementations, but I'm fine either way.
schedules, scheduleErr := s.scheduler.Solve(ctx, provisioner, pods) | ||
durationSeconds := time.Since(startTime).Seconds() | ||
|
||
result := "success" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noticing that controller runtime separate success and error cases into separate metrics https://github.com/kubernetes-sigs/controller-runtime/blob/v0.10.0/pkg/internal/controller/metrics/metrics.go#L37. Thoughts on following the same pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using labels better encapsulates the metric information. An analogy in code would be passing around a struct
vs passing around multiple related but individual variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not deep on conventions here. Do you have some examples from similar kubernetes projects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes-sigs/controller-runtime/blob/v0.10.0/pkg/internal/controller/metrics/metrics.go#L30-L33 brakes out the metric by result classification via a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks!
promErr, | ||
) | ||
} else { | ||
observer.Observe(durationSeconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on adding a count metric, too? Or can we get this by counting the number of latency metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The histogram also emits "<metric_name>_count" and "<metric_name>_total" metrics with the count of observations and the sum of all observations, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
return &schedulerMetricsDecorator{scheduler: scheduler, scheduleTimeHistogramVec: scheduleTimeHistogramVec} | ||
} | ||
|
||
func (s *schedulerMetricsDecorator) Solve(ctx context.Context, provisioner *v1alpha3.Provisioner, pods []*v1.Pod) ([]*Schedule, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the similarity of all of these histograms, I think we could likely achieve a lot of code-deduplication. Feel free to punt for a future issue.
metrics.ResultLabel: result, | ||
}) | ||
if promErr != nil { | ||
logging.FromContext(ctx).Warnf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: It just struck me that our ctx already decorates the log line with the provisioner name controller.allocation.provisioner/default
, so you could get away with leaving it out here.
}) | ||
if promErr != nil { | ||
logging.FromContext(ctx).Warnf( | ||
"Failed to record scheduling duration metric [%s=%s, %s=%s, duration=%f]: error=%w", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%w is intended for error chaining. When logging an error, you should use something like fmt.Errorf("%s", error.Error())
result = "error" | ||
} | ||
|
||
provisionerName := provisioner.ObjectMeta.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to overly focus on this, but I just realized that you could print out the labels in the log line to avoid the case you were talking about. It's a map, so it should print [key1: value, key2: value2]
labels = prometheus.Labels{...}
logging(...)Warnf("%s", labels)
* Fix error format specifier in log message
1. Issue, if available:
#612 "Emit Metrics for Karpenter"
This PR does not fully resolve the issue. More changes will be needed.
2. Description of changes:
Add a histogram metric to record the duration of scheduling operations.
3. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.