-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add logging for unexpected pod eviction errors #1432
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: 5130600 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6221220474fc180008a970c1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Good work
@@ -103,6 +103,7 @@ func (e *EvictionQueue) evict(ctx context.Context, nn types.NamespacedName) bool | |||
return true | |||
} | |||
if err != nil { | |||
logging.FromContext(ctx).Debugf("Could not evict pod %s due to an unexpected error: %s", nn.String(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't print the error string cleanly. We can also simplify this message.
logging.FromContext(ctx).Debugf("Could not evict pod %s due to an unexpected error: %s", nn.String(), err) | |
logging.FromContext(ctx).Debugf("Could not evict pod %s, %s", nn.String(), err.Error()) |
While we're in here, can we move the .
from the other log lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both nn.String(), err.Error()
can be replaced with nn, err
since they have .String they will both turn into strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated commit with this change.
@@ -103,6 +103,7 @@ func (e *EvictionQueue) evict(ctx context.Context, nn types.NamespacedName) bool | |||
return true | |||
} | |||
if err != nil { | |||
logging.FromContext(ctx).Debugf("Could not evict pod %s due to an unexpected error: %s", nn.String(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be ERROR level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless this is massively noisy. It only happens during misconfiguration, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the difference between lines 95 and 99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the 429 violation of PDB be turned up to an info log as well? It may save a round of "turn on debug logs" when troubleshooting why a node won't drain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me they all should be Error level. "Error" logs will show up with the default log config, I believe.
One note, we do retry eviction failures, so we will see multiple messages in the log file for each pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 429 seems like a different sort of transient condition to me. It may resolve itself if another pod becomes ready. It doesn't seem right for a log to have repeated error messages when everything is possibly working as expected.
I agree that the multiple PDBs definitely seems like an error level condition. The user has to do something to resolve it.
In some ways the logs/events are the "UI" for Karpenter for users. Maybe a doc defining this would be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 429 seems like a different sort of transient condition to me. It may resolve itself if another pod becomes ready. It doesn't seem right for a log to have repeated error messages when everything is possibly working as expected.
This is exactly the reason. We want to avoid spamming INFO logs for normal operation. We're pretty careful about what meets the bar for info. I expect most customers will want to run with DEBUG on and then filter with different tooling. Perhaps we should make DEBUG the default log level in our helm chart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 429 should just be debug log IMO. If you run into this error message once, you'll run into it at least 10 times per pod due to the fast exponential backoff, which would make the logs pretty noisy.
A 500 I would think should be info level as well. An unexpected error would make sense to be info level as well.
@@ -103,6 +103,7 @@ func (e *EvictionQueue) evict(ctx context.Context, nn types.NamespacedName) bool | |||
return true | |||
} | |||
if err != nil { | |||
logging.FromContext(ctx).Debugf("Could not evict pod %s due to an unexpected error: %s", nn.String(), err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both nn.String(), err.Error()
can be replaced with nn, err
since they have .String they will both turn into strings.
ab0d4cf
to
cd8a26f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
1. Issue, if available:
#1414
2. Description of changes:
The eviction API is supposed to only return 429 or 500 status codes when an error is encountered while evicting a pod (per k8s documentation). Today, Karpenter only writes a debug log when a 500 or 429 status code is returned from the eviction API, but doesn't account for the case where an unexpected error/status code is received. This change adds a catch-all so Karpenter will also log unexpected errors from the eviction API.
Edit: This change also introduces structured logging for log emitted by eviction attempts. Also, logs the returned error messages reported by the eviction API for consistency, instead of using custom error messages.
Related to this change, we identified what looks like a bug in the eviction API. In the case where multiple PDBs are found for a pod, the error which is returned correlates to an InternalServerError (500), but the "reason" field excluded. We believe this is the root cause of the problem identified in #1414 . We will be pursuing an upstream fix, but this "catch-all" is also worth implementing to account for unexpected errors.
3. How was this change tested?
(edit) Result:
4. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.