-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch logging #1261
Patch logging #1261
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: 77ac5e5 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/61fba33be591b40007a2787e |
logging.FromContext(ctx).Debugf("Unable to drain node, pod %s has do-not-evict annotation", pod.Name) | ||
for _, nodePod := range pods { | ||
if val := nodePod.Annotations[v1alpha5.DoNotEvictPodAnnotationKey]; val == "true" { | ||
logging.FromContext(ctx).Debugf("Unable to drain node, nodePod %s has do-not-evict annotation", nodePod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to expose the variable name nodePod
in the log line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also worry a bit about the amount of logging this will cause. Right now, we don't log on a per pod basis for normal operational behavior. We do log at the pod level for errors. I worry a bit that this will make scale down very noisy. I wonder if we need to introduce a trace level or perhaps only log in some scenarios.
Alternatively, users can simply deal with noisier debug logs. This may be resolvable by pushing some logs up to Info level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poking around at our debug logs, I think I'm probably fine with this, given the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I blame Goland, eager to rename var names even though they are just strings :/ Fixed. Also, these log lines were already there, they were not added.
@@ -124,6 +124,7 @@ func (t *Terminator) getEvictablePods(pods []*v1.Pod) []*v1.Pod { | |||
if pod.IsOwnedByNode(p) { | |||
continue | |||
} | |||
logging.FromContext(ctx).Debugf("pod %s marked as evictable", p.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of logging here, can we log where we actually send the eviction requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but which pods should we be logging in that case?
Given the block of code
func (t *Terminator) evict(pods []*v1.Pod) {
// 1. Prioritize noncritical pods https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown
critical := []*v1.Pod{}
nonCritical := []*v1.Pod{}
for _, p := range pods {
if !p.DeletionTimestamp.IsZero() {
continue
}
if p.Spec.PriorityClassName == "system-cluster-critical" || p.Spec.PriorityClassName == "system-node-critical" {
critical = append(critical, p)
} else {
nonCritical = append(nonCritical, p)
}
}
// 2. Evict critical pods if all noncritical are evicted
if len(nonCritical) == 0 {
t.EvictionQueue.Add(critical)
} else {
t.EvictionQueue.Add(nonCritical)
}
}
We can either:
- log all pods regardless of if conditions
- log only pods which are scheduled for the actual deletion (which would require additional loops to fetch them)
I remind you that the scope of the patch is to get a clear picture of what's preventing us from destroying the node (well one of the main reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding the log line as the first line inside t.EvictionQueue.Add(ctx, pod)
. This will make sure we don't double count (which I think the current impl does).
Do you mind adding the logging that this produces in the description of the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the bad advice -- it looks like we already have related logs
karpenter-controller-556bddb76-ng7tt manager 2022-02-03T19:32:22.166Z DEBUG controller Evicted pod default/inflate-68785b889b-9s89h {"commit": "955ed8c"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ellistarn log line you've mentioned is coming from here:
if e.evict(ctx, nn) {
logging.FromContext(ctx).Debugf("Evicted pod %s", nn.String())
e.RateLimitingInterface.Forget(nn)
e.Set.Remove(nn)
e.RateLimitingInterface.Done(nn)
continue
}
so we would only have this if
- we are actually able to remove the pod
- the queue is working.
Adding log in queue.add works for me, something like
func (e *EvictionQueue) Add(pods []*v1.Pod) {
for _, pod := range pods {
logging.FromContext(ctx).Debugf("adding pod %s to the eviction queue ", pod.Name())
if nn := client.ObjectKeyFromObject(pod); !e.Set.Contains(nn) {
e.Set.Add(nn)
e.RateLimitingInterface.Add(nn)
}
}
}
As for amount of logging, these are stdout logs produced only when we set the proper loglevel, so amount of noise during normal operation should not be an issue?
I worry a bit on the spam of logging on each loop, as this will make the logs really noisy, even in the happy case.
If we had a consistent reproduction we could add any verbose logging (or even a debugger) we wanted to and see the specific behavior that way. |
I played with this a bit and saw
Odd to me that we're enqueuing multiple times. |
@alekc, I'm not comfortable adding the debug logs since it looks like the controller idempotently (for good reason) continues to add evictable pods to the queue on every loop. This adds a huge amount of noise to the logs, which even at debug level, will muddy the waters beyond what I'm comfortable with. I'm thinking about alternatives to increase the visibility here. |
@ellistarn fair enough, I am open to ideas, as long as we can close this black hole (karpenter taking inaction for any reason and operator not having any feedback). |
Is it possible that this could be deduced by looking at |
uhm, is it something introduced recently? I am not sure that I ever saw eviction message for individual pod inside events. |
Perhaps not. I was looking at |
Closing in favor of #1432 |
1. Issue, if available:
Related to #1166 (won't solve it but will give a hand with future debugging)
2. Description of changes:
Atm, there is a "black hole" in eviction routine. As long as karpenter thinks there are evictable pods, it will return on
without any log, thus leaving human operator in the dark as to why nothing is happening.
This change adds a debug logging line covering this scenario (in case of a loop, at least operator will see whats exactly is stopping draining and termination process and will be able to investigate/intervene)
Also: fixed variable name colliding with pod name.
3. How was this change tested?
on local cluster
4. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.