When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073

alpeb · 2023-06-29T17:19:29Z

linkerd/linkerd2-proxy-init#242 fixed the CNI boot ordering issue, where linkerd-cni installed its config before any other CNI plugin had a chance to install its own config for setting up the cluster's networking.

However, there remains another scenario to address. Right after the other CNI plugins deploy their configs, the scheduler starts triggering pods, and sometimes might not leave time for linkerd-cni to drop its config. The injected linkerd-network-validor will catch that and fail the pod. But that doesn't signal the scheduler to restart the pod, and manual restarting is required. We have reproduced this issue in GKE with its default CNI plugin, when the node pool is scaled.

rootik · 2023-07-06T11:16:05Z

The bug is reproducible on AKS. Especially in situation when the node pool reached it's capacity and there are queue of unscheduled pods. When we scale out the node pool, most of the pods scheduled on a freshly added nodes were stuck with the linkerd-network-validator containers in endless Init:CrashLoopBackOff . When we kill those pods in some time they are starting normally.
A temporary solution we use is to kill those pods with a shell-operator script.

rootik · 2023-07-06T11:41:04Z

A question, could linkerd CNI label the nodes it runs on as schedulable after full initialization so we can use nodeSelector as an indicator that pods are safe to be sheduled?
e.g. linkerd.io/cni-ready=true

alpeb · 2023-07-10T14:48:03Z

Thanks for letting us know this is an issue in AKS as well, and for the pointer to shell-operator; I wasn't aware of that project :-)
As for having linkerd-cni label its nodes, that sounds like a nice workaround. Could possibly be implemented as a postStart hook on the linkerd-cni Daemonset. We'll keep on mind this when we prioritize this issue :-)

rootik · 2023-07-12T11:10:43Z

Thanks for looking at this.
shell-operator is not actually the right tool for it as it doesn't watch for Warning type events.
We've ended up with a little powershell script which runs in a pod along with kubelet proxy container, watches events filtered by fieldSelector == type=Warning,reason=BackOff,regarding.kind=Pod and drops pods with linkerd-network-validator container stuck in Init:CrashLoopBackOff state.

rootik · 2023-08-04T11:54:41Z

Whoever is stuck with the issue, this will unblock them

Chart.yaml

---
apiVersion: v2
name: event-processor
version: 1.0.1
description: "A Helm chart to install event-processor"

templates/config-map.yaml

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: event-processor
  namespace: devops
data:
  {{- (.Files.Glob "scripts/*").AsConfig | nindent 2 }}

templates/deployments.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-processor
  namespace: devops
spec:
  replicas: 1
  selector:
    matchLabels:
      app: event-processor
  template:
    metadata:
      labels:
        app: event-processor
      annotations:
        configHash: {{ .Files.Get "scripts/event-processor.ps1" | sha256sum }}
    spec:
      serviceAccountName: event-processor
      containers:
        - name: event-processor
          image: mcr.microsoft.com/powershell:latest
          imagePullPolicy: IfNotPresent
          command:
          - pwsh
          args:
          - /scripts/event-processor.ps1
          env:
          - name: DEBUG
            value: 'true'
          resources:
            limits:
              memory: 150Mi
            requests:
              memory: 100Mi
          volumeMounts:
          - name: scripts
            mountPath: /scripts
        - name: kube-proxy
          image: bitnami/kubectl:latest
          command:
          - kubectl
          args:
          - proxy
          - --port=8080
          - --keepalive=10s
          resources:
            limits:
              memory: 32Mi
            requests:
              memory: 20Mi
      volumes:
        - name: scripts
          configMap:
            name: event-processor
            defaultMode: 0755

templates/rbac.yaml

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: event-processor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: event-processor
rules:
- apiGroups: 
  - "*"
  - events.k8s.io
  resources:
  - events
  verbs:
  - get
  - watch
  - list
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: event-processor
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: event-processor
subjects:
  - kind: ServiceAccount
    name: event-processor
    namespace: devops

scripts/event-processor.ps1

$ErrorActionPreference = 'Continue'
function kubeApi() {
  param(
    [string]$Api,
    [string]$Kind,
    [string]$Name,
    [string]$Namespace,
    [string]$Action,
    [string]$Body
  )
  if ($Kind -eq 'pods') { $apiSegment = 'api' }
  else { $apiSegment = 'apis' }
  $uriRequest = [System.UriBuilder]"http://127.0.0.1:8080/$apiSegment/$Api/namespaces/$namespace/$kind/$name"
  if ($Action -eq 'patch') {
    $contentType = @{
      'Content-Type' = 'application/strategic-merge-patch+json'
    }
  }

  $statusCode = 200
  $res = try { Invoke-RestMethod -Method $action -Headers $contentType -Body $Body -Uri $uriRequest.Uri } catch { $statusCode = $_.Exception.Response.StatusCode.value__ }
  if ($env:DEBUG -eq 'true') {
    Write-Host $Action.ToUpper() $uriRequest.Uri '-' $statusCode -ForegroundColor Blue
  }
  Write-Output @{ res = $res; statusCode = $statusCode }
}

Start-Sleep -Seconds 10

Add-Type -AssemblyName System.Web

$queryString= [System.Web.HttpUtility]::ParseQueryString([String]::Empty)
$queryString.Add('watch', 'true')
$queryString.Add('sendInitialEvents', 'false')
$queryString.Add('fieldSelector', 'type=Warning,reason=BackOff,regarding.kind=Pod')

$uriRequest = [System.UriBuilder]'http://127.0.0.1:8080/apis/events.k8s.io/v1/events'

$uriRequest.Query = $queryString.ToString()

while ($true) {
  try {
    $request = [System.Net.WebRequest]::Create($uriRequest.Uri.OriginalString)
    $request.KeepAlive = $true

    $resp = $request.GetResponse()

    $reqstream = $resp.GetResponseStream()

    $sr = New-Object System.IO.StreamReader $reqstream

    while (!$sr.EndOfStream) {

      $line = $sr.ReadLine();

      $k8sEvent = (ConvertFrom-Json $line -Depth 10).object

      if ($k8sEvent) {
        $timestamp = $k8sEvent.metadata.creationTimestamp
        $pod = $k8sEvent.regarding.name
        $namespace = $k8sEvent.regarding.namespace
        $reason = $k8sEvent.reason
        $note = $k8sEvent.note
        $container = $k8sEvent.regarding.fieldPath -replace '.+\{(.+)\}', '$1'
        $action = 'skip'
        $result = 0

        if ($container -eq 'linkerd-network-validator') {
          # pod with linkerd CNI ordering issue
          # just kill it with fire
          $resource = kubeApi -Kind pods -Name $pod -Namespace $namespace -Action delete -Api v1
          if ($resource.statusCode -eq 200) {
            $action = 'delete'
            $resource = $resource.res.name
            $result = 200
          }
          else {
            $resource = "pod/$pod"
            $result = $resource.statusCode
          }
          $kind = 'Pod'

        }
        [ordered]@{
          timestamp    = $timestamp;
          pod          = $pod;
          namespace    = $namespace;
          reason       = $reason;
          note         = $note;
          container    = $container;
          action       = $action;
          result       = $result
        } | ConvertTo-Json -Compress
      }
    }
    Write-Host '*** Stream closed ***'
  }
  finally {
    $sr.Close()
  }
}

This can be deployed with helm upgrade event-processor -n devops .

steve-gray · 2023-09-28T00:20:07Z

We're seeing this issue frequently when using spot/preemptible nodes on Oracle Cloud too. Its a bit of a painful one for us at the moment.

mateiidavid · 2023-10-11T15:58:05Z

I had an interesting conversation with someone on Slack. There may be an alternative to creating a controller that restarts failed pods.

Cilium (and perhaps other CNI distributions) rely on Node taints to block pods from executing. Nodes get registered with a taint, when the plugin has finished executing, an operator removes the taint. Our DaemonSet would be capable of similar behaviour. Provided this doesn't result in multiple distributions writing to the same node at the same time it might be an easier and cleaner solutions than directly descheduling pods. Added some reference literature at the bottom.

https://docs.cilium.io/en/stable/installation/taints/

michalschott · 2023-10-11T21:05:24Z

⬆️ EKS with spot instances and Cilium CNI instead default aws-vpc-cni one.

michalschott · 2023-10-12T11:38:37Z

Workaround which worked for me instead is:

update cluster role with

- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["patch"]

add sidecar container:

      - name: kubectl
        image: alpine/k8s:1.25.14
        command: ["/bin/sh", "-c"]
        args:
          - echo "Removing taint";
            /usr/bin/kubectl taint node $MY_NODE linkerd-cni=NotReady:NoSchedule-;
            echo "Sleeping";
            while true; do sleep 30; done;
        env:
        - name: MY_NODE
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - kill -15 1; sleep 15s
        resources:
          limits:
            memory: "32Mi"
          requests:
            cpu: "1m"
            memory: "32Mi"
        securityContext:
          readOnlyRootFilesystem: true
          privileged: false

Remember to conifgure your nodes/nodegroups so nodes are registered with linkerd-cni=NotReady:NoSchedule taint

rootik · 2023-10-12T15:10:14Z

@michalschott would it be better to make install-cni container an init container? Which would guarantee that your container starts after linkerd CNI installed and the container won't need to sleep for 30 seconds.

michalschott · 2023-10-12T15:41:12Z

@rootik yes thats also an option but I was looking for as small kustomize patch as possible there.

steve-gray · 2023-10-12T20:04:09Z

Thats very interesting - What cluster role did you edit @michalschott, and is that change/patch being made to a particular daemonset? Just want to make sure we can correctly mirror this, as we're seeing the issue with spot instances and its quite painful.

Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test

Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test

Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test

Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test

Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test

Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-cni-repair-controller`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to delete them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#11699). This exposes two custom counter metrics: `linkerd_cni_repair_controller_queue_overflow` (in the spirit of the destination controller's `endpoint_updates_queue_overflow`) and `linkerd_cni_repair_controller_deleted`

Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>

This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>

Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.

alpeb mentioned this issue Jun 29, 2023

linkerd cni plugin blocks pods initialisation on GKE #10849

Closed

risingspiral added this to the stable-2.13.x-patches milestone Jul 14, 2023

risingspiral modified the milestones: stable-2.13.x-patches, stable-2.14.x-patches Sep 6, 2023

mateiidavid mentioned this issue Sep 8, 2023

linkerd-network-validator fails forver if a pod is scheduled after linkerd-cni #11325

Closed

alpeb added the bug label Oct 12, 2023

alpeb mentioned this issue Dec 5, 2023

cni-repair controller linkerd/linkerd2-proxy-init#306

Merged

alpeb mentioned this issue Dec 5, 2023

Add cni-repair-controller to linkerd-cni DaemonSet #11699

Merged

jdinsel-xealth mentioned this issue Dec 11, 2023

Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. #11735

Closed

alpeb closed this as completed in linkerd/linkerd2-proxy-init#306 Jan 2, 2024

mateiidavid mentioned this issue Jan 12, 2024

edge-24.1.1 #11922

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073

When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073

alpeb commented Jun 29, 2023

rootik commented Jul 6, 2023 •

edited

Loading

rootik commented Jul 6, 2023

alpeb commented Jul 10, 2023

rootik commented Jul 12, 2023

rootik commented Aug 4, 2023 •

edited

Loading

steve-gray commented Sep 28, 2023

mateiidavid commented Oct 11, 2023

michalschott commented Oct 11, 2023 •

edited

Loading

michalschott commented Oct 12, 2023 •

edited

Loading

rootik commented Oct 12, 2023

michalschott commented Oct 12, 2023

steve-gray commented Oct 12, 2023

When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073

When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073

Comments

alpeb commented Jun 29, 2023

rootik commented Jul 6, 2023 • edited Loading

rootik commented Jul 6, 2023

alpeb commented Jul 10, 2023

rootik commented Jul 12, 2023

rootik commented Aug 4, 2023 • edited Loading

steve-gray commented Sep 28, 2023

mateiidavid commented Oct 11, 2023

michalschott commented Oct 11, 2023 • edited Loading

michalschott commented Oct 12, 2023 • edited Loading

rootik commented Oct 12, 2023

michalschott commented Oct 12, 2023

steve-gray commented Oct 12, 2023

rootik commented Jul 6, 2023 •

edited

Loading

rootik commented Aug 4, 2023 •

edited

Loading

michalschott commented Oct 11, 2023 •

edited

Loading

michalschott commented Oct 12, 2023 •

edited

Loading