Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkerd-network-validator fails forver if a pod is scheduled after linkerd-cni #11325

Closed
msuszko-vertex opened this issue Sep 1, 2023 · 5 comments
Labels

Comments

@msuszko-vertex
Copy link

What is the issue?

If a meshed pod is started on a fresh node before linkerd-cni, it will never pass linkerd-network-validator sucessfully.
After Pod is deleted and started on the same node, it passes linkerd-network-validator and runs without issues.

How can it be reproduced?

Required: AWS access, kubectl 1.27, eksctl

Shell script to reproduce the problem:

eksctl create cluster --config-file=cluster.yaml

kubectl version
kubectl get nodes
linkerd check --pre

linkerd install --crds | kubectl apply -f - --server-side
linkerd install-cni | kubectl apply -f - --server-side
linkerd install --linkerd-cni-enabled | kubectl apply -f - --server-side

echo Waiting for 60s...
sleep 60

kubectl get pods -n linkerd-cni
echo Linkerd-CNI pods are running

kubectl get pods -n linkerd
echo Linkerd Pods are stuck in Init:CrashLoopBackOff, press Enter to continue
read _x

echo Restarting Linkerd Pods
kubectl rollout restart deployment -n linkerd linkerd-destination linkerd-identity linkerd-proxy-injector

echo Waiting for 60s...
sleep 60

kubectl get pods -n linkerd
echo Linkerd Pods are running

linkerd check

cluster.yaml:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: recreate-bug
  region: eu-west-1
  version: '1.27'

nodeGroups:
  - name: recreate-bug-br
    instanceType: t3.medium
    desiredCapacity: 3
    amiFamily: Bottlerocket

Logs, error output, etc

2023-09-01T11:58:08.210789Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140                                                                                                                     2023-09-01T11:58:08.210831Z DEBUG linkerd_network_validator: token="Sod6TBxvuWUv7h1FUOcJCB9brdOlcG2Wn8GamThf0HxaSN7RCip9ZDi3DU4WCHV\n"
2023-09-01T11:58:08.210841Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001                                                                                                                                   2023-09-01T11:58:18.211887Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

linkerd-cni log:

[2023-08-31 15:10:11] Wrote linkerd CNI binaries to /host/opt/cni/bin                                                                                                                                                      [2023-08-31 15:10:11] Installing CNI configuration for /host/etc/cni/net.d/10-aws.conflist
[2023-08-31 15:10:11] Using CNI config template from CNI_NETWORK_CONFIG environment variable.                                                                                                                                    "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://172.20.0.1:__KUBERNETES_SERVICE_PORT__",                                                                                                                                                    [2023-08-31 15:10:11] CNI config: {
  "name": "linkerd-cni",                                                                                                                                                                                                     "type": "linkerd-cni",
  "log_level": "info",                                                                                                                                                                                                       "policy": {
      "type": "k8s",                                                                                                                                                                                                             "k8s_api_root": "https://172.20.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"                                                                                                                                                                           },
  "kubernetes": {                                                                                                                                                                                                                "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },                                                                                                                                                                                                                         "linkerd": {
    "incoming-proxy-port": 4143,                                                                                                                                                                                               "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,                                                                                                                                                                                                         "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],                                                                                                                                                                                "simulate": false,
    "use-wait-flag": true                                                                                                                                                                                                    }
}                                                                                                                                                                                                                          [2023-08-31 15:10:11] Created CNI config /host/etc/cni/net.d/10-aws.conflist
Setting up watches.                                                                                                                                                                                                        Watches established.

output of linkerd check -o short

❯ linkerd check -o short
linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-09-02T06:16:57Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-09-02T06:16:56Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-09-02T06:16:57Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-ha-checks
-----------------
‼ pod injection disabled on kube-system
    kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if injector webhook failure policy is Fail
    see https://linkerd.io/2.14/checks/#l5d-injection-disabled for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-09-02T06:16:59Z
    see https://linkerd.io/2.14/checks/#l5d-tap-cert-not-expiring-soon for hints

Status check results are √

Environment

  • Kubernetes version: v1.27.4-eks-2d98532
  • Cluster Environment: AWS EKS
  • Host OS: Bottlerocket OS 1.14.2
  • Linkerd version: 2.14.0
  • Linkerd-CNI version: 1.2.0

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

@relu
Copy link
Contributor

relu commented Sep 6, 2023

I've also noticed this with the new linkerd-cni chart version. Since I've been running on v1.2.0, the only notable change for me was removing hostNetwork for the linkerd-cni pods, which is really odd as I wouldn't have expected this to have such a side effect. Once I downgraded the linkerd-cni chart version back to 30.8.3 it all worked correctly once again.

My theory is that running in hostNetwork mode is giving the linkerd-cni a significant timing advantage because it skips any lower-level CNI init (e.g. IPAM from vpc-cni, cilium etc.) and this marginally reduces the chances that other pods are placed before linkerd-cni is able to do its thing.

@msuszko-vertex can you try to update the linkerd-cni DaemonSet and add back hostNetwork: true and see if that helps?

The relevant change is here.

@mateiidavid
Copy link
Member

@msuszko-vertex thanks for bringing this up first of all. I'd like to say it's trivial to fix but unfortunately the CNI landscape is a bit messy at times, especially when it comes to installing plugins after a cluster has been provisioned. The specification itself does not offer an API or any guidance on how to do it after a cluster has been started.

We've introduced the validator in order to have a failfast guarantee when pods cannot be configured properly. Even if the init container is taken out, pods are still going to be scheduled before the CNI plugin installer has had time to set everything up. This issue has a bunch of background information if you want to go deeper #8120.

Without the validator, pods will come up but they will not function properly, which makes it even more painful to investigate. The validator returns an error code that's predictable (I think it's ERRNO 77); you can write a controller that automatically restarts pods whose validator fails with this code. We're tracking automatic restart in #11073.

Hope that all makes sense :)

@DavidMcLaughlin
Copy link
Contributor

Closing this in favor of #11073 because this is currently a known problem that needs a new feature to resolve (a controller-like utility to restart the pods). We would eagerly welcome contributions to help get this fixed.

@msuszko-vertex
Copy link
Author

@relu Setting hostNetwork: true in linkerd-cni DaemonSet doesn't fix the issue. Pods with linkerd-proxy injected, started before linkerd-cni are stuck failing on linkerd-network-validator initContainer.

@msuszko-vertex
Copy link
Author

@DavidMcLaughlin I used node taints to prevent pods from being scheduled before linkerd-cni initializes.
Nodes are created with linkerd.io/cni-not-ready: NoSchedule taint, and linkerd-cni removes it.
This works fine, but I'm still trying to figure out why after other CNI plugin (VPC CNI) is restarted pods are unable to pass linkerd-network-validator until linkerd-cni is restarted. This might be unrelated to this issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants