Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch logging #1261

Closed
wants to merge 2 commits into from
Closed

Patch logging #1261

wants to merge 2 commits into from

Conversation

alekc
Copy link
Contributor

@alekc alekc commented Feb 2, 2022

1. Issue, if available:
Related to #1166 (won't solve it but will give a hand with future debugging)

2. Description of changes:
Atm, there is a "black hole" in eviction routine. As long as karpenter thinks there are evictable pods, it will return on

	if !drained {
		return reconcile.Result{Requeue: true}, nil
	}

without any log, thus leaving human operator in the dark as to why nothing is happening.

This change adds a debug logging line covering this scenario (in case of a loop, at least operator will see whats exactly is stopping draining and termination process and will be able to investigate/intervene)

Also: fixed variable name colliding with pod name.

3. How was this change tested?
on local cluster

4. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Feb 2, 2022

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 77ac5e5

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61fba33be591b40007a2787e

logging.FromContext(ctx).Debugf("Unable to drain node, pod %s has do-not-evict annotation", pod.Name)
for _, nodePod := range pods {
if val := nodePod.Annotations[v1alpha5.DoNotEvictPodAnnotationKey]; val == "true" {
logging.FromContext(ctx).Debugf("Unable to drain node, nodePod %s has do-not-evict annotation", nodePod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to expose the variable name nodePod in the log line.

Copy link
Contributor

@ellistarn ellistarn Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also worry a bit about the amount of logging this will cause. Right now, we don't log on a per pod basis for normal operational behavior. We do log at the pod level for errors. I worry a bit that this will make scale down very noisy. I wonder if we need to introduce a trace level or perhaps only log in some scenarios.

Alternatively, users can simply deal with noisier debug logs. This may be resolvable by pushing some logs up to Info level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking around at our debug logs, I think I'm probably fine with this, given the options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I blame Goland, eager to rename var names even though they are just strings :/ Fixed. Also, these log lines were already there, they were not added.

@@ -124,6 +124,7 @@ func (t *Terminator) getEvictablePods(pods []*v1.Pod) []*v1.Pod {
if pod.IsOwnedByNode(p) {
continue
}
logging.FromContext(ctx).Debugf("pod %s marked as evictable", p.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of logging here, can we log where we actually send the eviction requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but which pods should we be logging in that case?
Given the block of code

func (t *Terminator) evict(pods []*v1.Pod) {
	// 1. Prioritize noncritical pods https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown
	critical := []*v1.Pod{}
	nonCritical := []*v1.Pod{}
	for _, p := range pods {
		if !p.DeletionTimestamp.IsZero() {
			continue
		}
		if p.Spec.PriorityClassName == "system-cluster-critical" || p.Spec.PriorityClassName == "system-node-critical" {
			critical = append(critical, p)
		} else {
			nonCritical = append(nonCritical, p)
		}
	}
	// 2. Evict critical pods if all noncritical are evicted
	if len(nonCritical) == 0 {
		t.EvictionQueue.Add(critical)
	} else {
		t.EvictionQueue.Add(nonCritical)
	}
}

We can either:

  • log all pods regardless of if conditions
  • log only pods which are scheduled for the actual deletion (which would require additional loops to fetch them)

I remind you that the scope of the patch is to get a clear picture of what's preventing us from destroying the node (well one of the main reasons).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding the log line as the first line inside t.EvictionQueue.Add(ctx, pod). This will make sure we don't double count (which I think the current impl does).

Do you mind adding the logging that this produces in the description of the PR?

Copy link
Contributor

@ellistarn ellistarn Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the bad advice -- it looks like we already have related logs

karpenter-controller-556bddb76-ng7tt manager 2022-02-03T19:32:22.166Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-9s89h	{"commit": "955ed8c"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellistarn log line you've mentioned is coming from here:

if e.evict(ctx, nn) {
			logging.FromContext(ctx).Debugf("Evicted pod %s", nn.String())
			e.RateLimitingInterface.Forget(nn)
			e.Set.Remove(nn)
			e.RateLimitingInterface.Done(nn)
			continue
		}

so we would only have this if

  1. we are actually able to remove the pod
  2. the queue is working.

Adding log in queue.add works for me, something like

func (e *EvictionQueue) Add(pods []*v1.Pod) {
	for _, pod := range pods {
                logging.FromContext(ctx).Debugf("adding pod %s to the eviction queue ", pod.Name())
		if nn := client.ObjectKeyFromObject(pod); !e.Set.Contains(nn) {
			e.Set.Add(nn)
			e.RateLimitingInterface.Add(nn)
		}
	}
}

As for amount of logging, these are stdout logs produced only when we set the proper loglevel, so amount of noise during normal operation should not be an issue?

@ellistarn
Copy link
Contributor

I worry a bit on the spam of logging on each loop, as this will make the logs really noisy, even in the happy case.

	if !drained {
		return reconcile.Result{Requeue: true}, nil
	}

If we had a consistent reproduction we could add any verbose logging (or even a debugger) we wanted to and see the specific behavior that way.

@ellistarn
Copy link
Contributor

I played with this a bit and saw

karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.116Z	INFO	controller.termination	Cordoned node	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-2rhqw for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-q5lbg for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-h95r4 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-dwzvk for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.117Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-q5lbg for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-h95r4 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-dwzvk for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-2rhqw for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.118Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-q5lbg for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-h95r4 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-dwzvk for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.135Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.137Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-2rhqw	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.157Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-q5lbg	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.180Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-h95r4	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.204Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-dwzvk	{"commit": "4c1fb8a"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.217Z	DEBUG	controller.termination	Enqueuing default/inflate-68785b889b-sgmk2 for eviction	{"commit": "4c1fb8a", "node": "ip-192-168-82-251.us-west-2.compute.internal"}
karpenter-controller-5d979f9b9f-5lc9h manager 2022-02-08T19:07:04.229Z	DEBUG	controller	Evicted pod default/inflate-68785b889b-sgmk2	{"commit": "4c1fb8a"}

Odd to me that we're enqueuing multiple times.

@ellistarn
Copy link
Contributor

@alekc, I'm not comfortable adding the debug logs since it looks like the controller idempotently (for good reason) continues to add evictable pods to the queue on every loop. This adds a huge amount of noise to the logs, which even at debug level, will muddy the waters beyond what I'm comfortable with. I'm thinking about alternatives to increase the visibility here.

@alekc
Copy link
Contributor Author

alekc commented Feb 9, 2022

@ellistarn fair enough, I am open to ideas, as long as we can close this black hole (karpenter taking inaction for any reason and operator not having any feedback).

@ellistarn
Copy link
Contributor

Is it possible that this could be deduced by looking at kubectl get events and looking for eviction messages. You should be able to see which pod isn't being evicted in this way.

@alekc
Copy link
Contributor Author

alekc commented Feb 10, 2022

uhm, is it something introduced recently? I am not sure that I ever saw eviction message for individual pod inside events.

@ellistarn
Copy link
Contributor

uhm, is it something introduced recently? I am not sure that I ever saw eviction message for individual pod inside events.

Perhaps not. I was looking at 0s Normal Killing pod/inflate-b8fc9fdcc-fjpff Stopping container inflate.

@ellistarn
Copy link
Contributor

Closing in favor of #1432

@ellistarn ellistarn closed this Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants