-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linkerd-network-validator fails forver if a pod is scheduled after linkerd-cni #11325
Comments
I've also noticed this with the new linkerd-cni chart version. Since I've been running on v1.2.0, the only notable change for me was removing My theory is that running in hostNetwork mode is giving the linkerd-cni a significant timing advantage because it skips any lower-level CNI init (e.g. IPAM from vpc-cni, cilium etc.) and this marginally reduces the chances that other pods are placed before linkerd-cni is able to do its thing. @msuszko-vertex can you try to update the linkerd-cni DaemonSet and add back The relevant change is here. |
@msuszko-vertex thanks for bringing this up first of all. I'd like to say it's trivial to fix but unfortunately the CNI landscape is a bit messy at times, especially when it comes to installing plugins after a cluster has been provisioned. The specification itself does not offer an API or any guidance on how to do it after a cluster has been started. We've introduced the validator in order to have a failfast guarantee when pods cannot be configured properly. Even if the init container is taken out, pods are still going to be scheduled before the CNI plugin installer has had time to set everything up. This issue has a bunch of background information if you want to go deeper #8120. Without the validator, pods will come up but they will not function properly, which makes it even more painful to investigate. The validator returns an error code that's predictable (I think it's ERRNO 77); you can write a controller that automatically restarts pods whose validator fails with this code. We're tracking automatic restart in #11073. Hope that all makes sense :) |
Closing this in favor of #11073 because this is currently a known problem that needs a new feature to resolve (a controller-like utility to restart the pods). We would eagerly welcome contributions to help get this fixed. |
@relu Setting |
@DavidMcLaughlin I used node taints to prevent pods from being scheduled before linkerd-cni initializes. |
What is the issue?
If a meshed pod is started on a fresh node before linkerd-cni, it will never pass linkerd-network-validator sucessfully.
After Pod is deleted and started on the same node, it passes linkerd-network-validator and runs without issues.
How can it be reproduced?
Required: AWS access, kubectl 1.27, eksctl
Shell script to reproduce the problem:
cluster.yaml:
Logs, error output, etc
linkerd-cni log:
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
maybe
The text was updated successfully, but these errors were encountered: