Limit "provisioner does not exist" logging and fix startup reconciliation bug #517

bwagner5 · 2021-07-16T18:24:39Z

Issue, if available:
N/A

Description of changes:

Limits the number of times the error is logged when a provisioner is not found
Fixes a bug where if pods are pending when karpenter starts up, they will be reconciled to any provisioner (previous racing behavior). The matchesProvisioner func was not completely correct when checking the default provisioner.
Moves the default provisioner to a reference in the apis pkg.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Previous Log:

karpenter-controller-86cf989bcd-jkqfm manager 2021-07-16T15:41:13.188Z	ERROR	Retrieving provisioner, create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-86cf989bcd-jkqfm manager 2021-07-16T15:41:13.200Z	ERROR	Retrieving provisioner, create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-86cf989bcd-jkqfm manager 2021-07-16T15:41:13.568Z	ERROR	Retrieving provisioner, create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-86cf989bcd-jkqfm manager 2021-07-16T15:41:13.568Z	ERROR	Retrieving provisioner, create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-86cf989bcd-jkqfm manager 2021-07-16T15:41:13.571Z	ERROR	Retrieving provisioner, create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-86cf989bcd-jkqfm manager 2021-07-16T15:41:13.571Z	ERROR	Retrieving provisioner, create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
... 
... # number of pods + 1

New Log (50 pod scale-up w/ no provisioner - 2 times, one per pod batch + 1):

karpenter-controller-6769b9899b-cr99p manager 2021-07-16T18:18:56.455Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-cr99p manager 2021-07-16T18:18:59.455Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name

New Log (1000 pod scale-up w/ no provisioner - prints 7 times, one per pod batch + 1):

karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:13:35.542Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:13:45.542Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:13:56.542Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:14:07.542Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:14:17.542Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:14:22.541Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name
karpenter-controller-6769b9899b-r66pp manager 2021-07-16T18:14:24.542Z	ERROR	No provisioner found. Create a default provisioner, or specify an alternative using the nodeSelector karpenter.sh/provisioner-name

Why is it "per pod batch +1" you ask?

The way the batch and work queue function. Karpenter watches for pod updates and adds the associated provisioner to the reconcile request work queue as they come in. So there will be a bunch of the same provisioner for a Deployment scale up. Reconcile will wait for a batch before starting, and then reconcile state for the cluster on the provisioner in the reconcile request. Since Reconcile(..) starts almost immediately (depending on the number of unique provisioners ahead in the queue and the configured reconcile workers which is 4 for karpenter currently), then a pod will be added to the work queue after reconcile has triggered. The work queue implementation won't know if the reconcile included the cluster state in the reconcile request, so it will dequeue one more after the batching ends, and since no provisioner requests are enqueued after that reconcile kicks off, no more will be dequeued.

pkg/controllers/allocation/controller.go

ellistarn · 2021-07-16T18:41:21Z

pkg/controllers/allocation/controller.go

+		if errors.IsNotFound(err) {
+			// Queue and batch a reconcile request for a non-existent, empty provisioner
+			// This will reduce the number of repeated error messages about a provisioner not existing
+			c.Batcher.Add(&v1alpha3.Provisioner{})


This feels like a nasty hack, but I don't have a better suggestion.

pkg/controllers/allocation/controller.go

ellistarn · 2021-07-16T18:46:23Z

pkg/controllers/allocation/filter.go

 		return nil
 	}
-	if name == provisioner.Name {
+	if !ok && v1alpha3.DefaultProvisioner.Name == provisioner.Name {


If we invert the predicate it read a bit more smoothly

if ok && provisioner.Name == name if !ok && provisioner.Name == v1alpha3.DefaultProvisioner.Name

If we could somehow assume that provisioner.Name was always populated, we could even remove the ok bit.

I'm not sure it's worth assuming that. Maybe we could, but feels like something that could be forgotten at some point. IMO the ok isn't too dirty and we should probably just leave it.

pkg/controllers/allocation/controller.go

ellistarn

Nice job going red!

limit no provisioner logging and fix startup reconciliation bug

096a837

bwagner5 requested a review from ellistarn July 16, 2021 18:24

ellistarn reviewed Jul 16, 2021

View reviewed changes

pkg/controllers/allocation/controller.go Outdated Show resolved Hide resolved

ellistarn reviewed Jul 16, 2021

View reviewed changes

pkg/controllers/allocation/controller.go Outdated Show resolved Hide resolved

bwagner5 changed the title ~~limit no provisioner logging and fix startup reconciliation bug~~ [WIP] limit no provisioner logging and fix startup reconciliation bug Jul 16, 2021

ellistarn reviewed Jul 16, 2021

View reviewed changes

pkg/controllers/allocation/controller.go Outdated Show resolved Hide resolved

ellistarn reviewed Jul 16, 2021

View reviewed changes

pkg/controllers/allocation/controller.go Outdated Show resolved Hide resolved

bwagner5 added 2 commits July 16, 2021 17:16

fix tests and address pr comments

8c7cb46

consistent double quotes

2e3f4d7

bwagner5 changed the title ~~[WIP] limit no provisioner logging and fix startup reconciliation bug~~ Limit "provisioner does not exist" logging and fix startup reconciliation bug Jul 16, 2021

bwagner5 requested a review from ellistarn July 16, 2021 22:31

ellistarn reviewed Jul 19, 2021

View reviewed changes

pkg/controllers/allocation/controller.go Show resolved Hide resolved

bwagner5 added 2 commits July 19, 2021 10:06

get rid of two provisioner return

ca670cf

refactor provisioner fetching

ca038a3

bwagner5 requested a review from ellistarn July 19, 2021 16:32

ellistarn approved these changes Jul 19, 2021

View reviewed changes

ellistarn merged commit d02f133 into aws:main Jul 19, 2021

bwagner5 deleted the clean-allocation-logs branch July 19, 2021 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit "provisioner does not exist" logging and fix startup reconciliation bug #517

Limit "provisioner does not exist" logging and fix startup reconciliation bug #517

bwagner5 commented Jul 16, 2021

ellistarn Jul 16, 2021

bwagner5 Jul 16, 2021

ellistarn Jul 16, 2021

bwagner5 Jul 16, 2021

ellistarn left a comment

Limit "provisioner does not exist" logging and fix startup reconciliation bug #517

Limit "provisioner does not exist" logging and fix startup reconciliation bug #517

Conversation

bwagner5 commented Jul 16, 2021

Why is it "per pod batch +1" you ask?

ellistarn Jul 16, 2021

Choose a reason for hiding this comment

bwagner5 Jul 16, 2021

Choose a reason for hiding this comment

ellistarn Jul 16, 2021

Choose a reason for hiding this comment

bwagner5 Jul 16, 2021

Choose a reason for hiding this comment

ellistarn left a comment

Choose a reason for hiding this comment