Patch logging #1261

alekc · 2022-02-02T08:10:11Z

1. Issue, if available:
Related to #1166 (won't solve it but will give a hand with future debugging)

2. Description of changes:
Atm, there is a "black hole" in eviction routine. As long as karpenter thinks there are evictable pods, it will return on

	if !drained {
		return reconcile.Result{Requeue: true}, nil
	}

without any log, thus leaving human operator in the dark as to why nothing is happening.

This change adds a debug logging line covering this scenario (in case of a loop, at least operator will see whats exactly is stopping draining and termination process and will be able to investigate/intervene)

Also: fixed variable name colliding with pod name.

3. How was this change tested?
on local cluster

4. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2022-02-02T08:10:30Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 77ac5e5

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61fba33be591b40007a2787e

pkg/controllers/termination/terminate.go

ellistarn · 2022-02-02T19:36:48Z

pkg/controllers/termination/terminate.go

-			logging.FromContext(ctx).Debugf("Unable to drain node, pod %s has do-not-evict annotation", pod.Name)
+	for _, nodePod := range pods {
+		if val := nodePod.Annotations[v1alpha5.DoNotEvictPodAnnotationKey]; val == "true" {
+			logging.FromContext(ctx).Debugf("Unable to drain node, nodePod %s has do-not-evict annotation", nodePod.Name)


We don't want to expose the variable name nodePod in the log line.

I also worry a bit about the amount of logging this will cause. Right now, we don't log on a per pod basis for normal operational behavior. We do log at the pod level for errors. I worry a bit that this will make scale down very noisy. I wonder if we need to introduce a trace level or perhaps only log in some scenarios.

Alternatively, users can simply deal with noisier debug logs. This may be resolvable by pushing some logs up to Info level.

Poking around at our debug logs, I think I'm probably fine with this, given the options.

I blame Goland, eager to rename var names even though they are just strings :/ Fixed. Also, these log lines were already there, they were not added.

ellistarn · 2022-02-02T19:37:19Z

pkg/controllers/termination/terminate.go

@@ -124,6 +124,7 @@ func (t *Terminator) getEvictablePods(pods []*v1.Pod) []*v1.Pod {
 		if pod.IsOwnedByNode(p) {
 			continue
 		}
+		logging.FromContext(ctx).Debugf("pod %s marked as evictable", p.Name)


Instead of logging here, can we log where we actually send the eviction requests?

Yes, but which pods should we be logging in that case?
Given the block of code

func (t *Terminator) evict(pods []*v1.Pod) { // 1. Prioritize noncritical pods https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown critical := []*v1.Pod{} nonCritical := []*v1.Pod{} for _, p := range pods { if !p.DeletionTimestamp.IsZero() { continue } if p.Spec.PriorityClassName == "system-cluster-critical" || p.Spec.PriorityClassName == "system-node-critical" { critical = append(critical, p) } else { nonCritical = append(nonCritical, p) } } // 2. Evict critical pods if all noncritical are evicted if len(nonCritical) == 0 { t.EvictionQueue.Add(critical) } else { t.EvictionQueue.Add(nonCritical) } }

We can either:

log all pods regardless of if conditions

log only pods which are scheduled for the actual deletion (which would require additional loops to fetch them)

I remind you that the scope of the patch is to get a clear picture of what's preventing us from destroying the node (well one of the main reasons).

What do you think about adding the log line as the first line inside t.EvictionQueue.Add(ctx, pod). This will make sure we don't double count (which I think the current impl does).

Do you mind adding the logging that this produces in the description of the PR?

Sorry for the bad advice -- it looks like we already have related logs

karpenter-controller-556bddb76-ng7tt manager 2022-02-03T19:32:22.166Z DEBUG controller Evicted pod default/inflate-68785b889b-9s89h {"commit": "955ed8c"}

@ellistarn log line you've mentioned is coming from here:

if e.evict(ctx, nn) { logging.FromContext(ctx).Debugf("Evicted pod %s", nn.String()) e.RateLimitingInterface.Forget(nn) e.Set.Remove(nn) e.RateLimitingInterface.Done(nn) continue }

so we would only have this if

we are actually able to remove the pod

the queue is working.

Adding log in queue.add works for me, something like

func (e *EvictionQueue) Add(pods []*v1.Pod) { for _, pod := range pods { logging.FromContext(ctx).Debugf("adding pod %s to the eviction queue ", pod.Name()) if nn := client.ObjectKeyFromObject(pod); !e.Set.Contains(nn) { e.Set.Add(nn) e.RateLimitingInterface.Add(nn) } } }

As for amount of logging, these are stdout logs produced only when we set the proper loglevel, so amount of noise during normal operation should not be an issue?

pkg/controllers/termination/terminate.go

…tion stage.

ellistarn · 2022-02-03T17:00:54Z

I worry a bit on the spam of logging on each loop, as this will make the logs really noisy, even in the happy case.

	if !drained {
		return reconcile.Result{Requeue: true}, nil
	}

If we had a consistent reproduction we could add any verbose logging (or even a debugger) we wanted to and see the specific behavior that way.

ellistarn · 2022-02-08T19:08:33Z

I played with this a bit and saw

karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.116Z	INFO	controller.termination	Cordoned node	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-2rhqw for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-q5lbg for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-h95r4 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-dwzvk for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-q5lbg for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-h95r4 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-dwzvk for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-2rhqw for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-q5lbg for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-h95r4 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-dwzvk for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.137Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-2rhqw	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.157Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-q5lbg	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.180Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-h95r4	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.204Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-dwzvk	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.217Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.229Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-sgmk2	{"commit": "4c1fb8a"}

Odd to me that we're enqueuing multiple times.

ellistarn · 2022-02-08T19:48:24Z

@alekc, I'm not comfortable adding the debug logs since it looks like the controller idempotently (for good reason) continues to add evictable pods to the queue on every loop. This adds a huge amount of noise to the logs, which even at debug level, will muddy the waters beyond what I'm comfortable with. I'm thinking about alternatives to increase the visibility here.

alekc · 2022-02-09T10:53:26Z

@ellistarn fair enough, I am open to ideas, as long as we can close this black hole (karpenter taking inaction for any reason and operator not having any feedback).

ellistarn · 2022-02-10T19:16:23Z

Is it possible that this could be deduced by looking at kubectl get events and looking for eviction messages. You should be able to see which pod isn't being evicted in this way.

alekc · 2022-02-10T19:23:08Z

uhm, is it something introduced recently? I am not sure that I ever saw eviction message for individual pod inside events.

ellistarn · 2022-02-10T19:30:24Z

uhm, is it something introduced recently? I am not sure that I ever saw eviction message for individual pod inside events.

Perhaps not. I was looking at 0s Normal Killing pod/inflate-b8fc9fdcc-fjpff Stopping container inflate.

ellistarn · 2022-03-04T18:15:21Z

Closing in favor of #1432

suket22 reviewed Feb 2, 2022

View reviewed changes

pkg/controllers/termination/terminate.go Outdated Show resolved Hide resolved

ellistarn reviewed Feb 2, 2022

View reviewed changes

pkg/controllers/termination/terminate.go Outdated Show resolved Hide resolved

alekc added 2 commits February 3, 2022 09:34

chore: added logging message to cover a black hole during the termina…

8aea8be

…tion stage.

chore: variable name (pod) collides with package name

77ac5e5

alekc force-pushed the patch-logging branch from f6ab630 to 77ac5e5 Compare February 3, 2022 09:41

ellistarn closed this Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch logging #1261

Patch logging #1261

alekc commented Feb 2, 2022 •

edited

Loading

netlify bot commented Feb 2, 2022 •

edited

Loading

ellistarn Feb 2, 2022

ellistarn Feb 2, 2022 •

edited

Loading

ellistarn Feb 2, 2022

alekc Feb 3, 2022

ellistarn Feb 2, 2022

alekc Feb 3, 2022

ellistarn Feb 3, 2022

ellistarn Feb 3, 2022 •

edited

Loading

alekc Feb 4, 2022

ellistarn commented Feb 3, 2022

ellistarn commented Feb 8, 2022

ellistarn commented Feb 8, 2022

alekc commented Feb 9, 2022

ellistarn commented Feb 10, 2022

alekc commented Feb 10, 2022

ellistarn commented Feb 10, 2022

ellistarn commented Mar 4, 2022

Patch logging #1261

Patch logging #1261

Conversation

alekc commented Feb 2, 2022 • edited Loading

netlify bot commented Feb 2, 2022 • edited Loading

ellistarn Feb 2, 2022

Choose a reason for hiding this comment

ellistarn Feb 2, 2022 • edited Loading

Choose a reason for hiding this comment

ellistarn Feb 2, 2022

Choose a reason for hiding this comment

alekc Feb 3, 2022

Choose a reason for hiding this comment

ellistarn Feb 2, 2022

Choose a reason for hiding this comment

alekc Feb 3, 2022

Choose a reason for hiding this comment

ellistarn Feb 3, 2022

Choose a reason for hiding this comment

ellistarn Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

alekc Feb 4, 2022

Choose a reason for hiding this comment

ellistarn commented Feb 3, 2022

ellistarn commented Feb 8, 2022

ellistarn commented Feb 8, 2022

alekc commented Feb 9, 2022

ellistarn commented Feb 10, 2022

alekc commented Feb 10, 2022

ellistarn commented Feb 10, 2022

ellistarn commented Mar 4, 2022

alekc commented Feb 2, 2022 •

edited

Loading

netlify bot commented Feb 2, 2022 •

edited

Loading

ellistarn Feb 2, 2022 •

edited

Loading

ellistarn Feb 3, 2022 •

edited

Loading