Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add allocation scheduling duration metric #680

Merged
merged 3 commits into from
Sep 16, 2021

Conversation

cjerad
Copy link
Contributor

@cjerad cjerad commented Sep 16, 2021

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a histogram metric to record the duration of scheduling operations.

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@cjerad cjerad requested a review from ellistarn September 16, 2021 15:01
@netlify
Copy link

netlify bot commented Sep 16, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: a9d69d2

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6143a77622822d0008f7865c

result = "error"
}

provisionerName := provisioner.ObjectMeta.Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I'd probably lean inlining this to reduce lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate reference is used to ensure the same value used in the query (line 193) will also be logged (line 200) if the query fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can assume that provisioner.ObjectMeta.Name is immutable. In kubernetes, spec/metadata are considered immutable at reconcile time, while status is mutable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario of concern is that the provisioner name might be fetch from elsewhere. Hypothetically, if a new field is added, e.g. provisioner.spec.metricDisplayName; then, if the lookup is inlined, whoever updates this code will need to be sure to update both locations to keep the log message in sync.

@@ -68,7 +68,7 @@ func NewController(kubeClient client.Client, coreV1Client corev1.CoreV1Interface
Filter: &Filter{KubeClient: kubeClient},
Binder: DecorateBinderMetrics(NewBinder(kubeClient, coreV1Client)),
Batcher: NewBatcher(maxBatchWindow, batchIdleTimeout),
Scheduler: scheduling.NewScheduler(cloudProvider, kubeClient),
Scheduler: scheduling.DecorateSchedulerMetrics(scheduling.NewScheduler(cloudProvider, kubeClient)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on encapsulating this inside NewScheduler so the test doesn't need to be aware it's decorated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I agree that this may not be the best place to apply the metrics decoration. However, I would prefer pushing the metrics decoration up to the caller of allocation.NewController() rather than down into scheduling.NewScheduler().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends whether or not you think the scheduler should innately emit metrics (i.e., the decorator is a convenience impl detail) or whether something should be decorating it. I'd lean the former, as I don't see multiple scheduler implementations in our future. This does feel somewhat like go/java philosophies colliding.

Nonblocking, though.

@@ -147,3 +156,56 @@ func IsSchedulable(pod *v1.Pod, node *v1.Node) bool {
// TODO, support node affinity
return true
}

type schedulerMetricsDecorator struct {
scheduler Scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:

type metricsDecoratedScheduler struct {
Scheduler // embedded
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried it. The go compiler won't allow embedding an interface within a struct.

The compiler will allow embedding the scheduler struct but that had the effect of coupling metricsDecoratedScheduler to scheduler, rather than allowing metricsDecoratedScheduler to work with any Scheduler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of the metricsdecorator as a subclass of scheduler (with is ~= to embedding in golang, though not quite the same). I tend to see the go philosophy as leaning away from things like interfaces, setters, getters, and instead towards duck typing and direct variable access. I also expect that we won't have multiple scheduler implementations, but I'm fine either way.

schedules, scheduleErr := s.scheduler.Solve(ctx, provisioner, pods)
durationSeconds := time.Since(startTime).Seconds()

result := "success"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing that controller runtime separate success and error cases into separate metrics https://github.com/kubernetes-sigs/controller-runtime/blob/v0.10.0/pkg/internal/controller/metrics/metrics.go#L37. Thoughts on following the same pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using labels better encapsulates the metric information. An analogy in code would be passing around a struct vs passing around multiple related but individual variables.

Copy link
Contributor

@ellistarn ellistarn Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not deep on conventions here. Do you have some examples from similar kubernetes projects?

Copy link
Contributor Author

@cjerad cjerad Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks!

promErr,
)
} else {
observer.Observe(durationSeconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on adding a count metric, too? Or can we get this by counting the number of latency metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The histogram also emits "<metric_name>_count" and "<metric_name>_total" metrics with the count of observations and the sum of all observations, respectively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

return &schedulerMetricsDecorator{scheduler: scheduler, scheduleTimeHistogramVec: scheduleTimeHistogramVec}
}

func (s *schedulerMetricsDecorator) Solve(ctx context.Context, provisioner *v1alpha3.Provisioner, pods []*v1.Pod) ([]*Schedule, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the similarity of all of these histograms, I think we could likely achieve a lot of code-deduplication. Feel free to punt for a future issue.

metrics.ResultLabel: result,
})
if promErr != nil {
logging.FromContext(ctx).Warnf(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It just struck me that our ctx already decorates the log line with the provisioner name controller.allocation.provisioner/default, so you could get away with leaving it out here.

})
if promErr != nil {
logging.FromContext(ctx).Warnf(
"Failed to record scheduling duration metric [%s=%s, %s=%s, duration=%f]: error=%w",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%w is intended for error chaining. When logging an error, you should use something like fmt.Errorf("%s", error.Error())

result = "error"
}

provisionerName := provisioner.ObjectMeta.Name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to overly focus on this, but I just realized that you could print out the labels in the log line to avoid the case you were talking about. It's a map, so it should print [key1: value, key2: value2]

labels = prometheus.Labels{...}
logging(...)Warnf("%s", labels)

* Fix error format specifier in log message
@cjerad cjerad merged commit 18dc533 into aws:main Sep 16, 2021
@cjerad cjerad deleted the allocation-schedule-time-metric branch September 16, 2021 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants