-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073
When linkerd-network-validator catches missing iptables config, pod is left in a failure state #11073
Comments
The bug is reproducible on AKS. Especially in situation when the node pool reached it's capacity and there are queue of unscheduled pods. When we scale out the node pool, most of the pods scheduled on a freshly added nodes were stuck with the |
A question, could linkerd CNI label the nodes it runs on as schedulable after full initialization so we can use |
Thanks for letting us know this is an issue in AKS as well, and for the pointer to shell-operator; I wasn't aware of that project :-) |
Thanks for looking at this. |
Whoever is stuck with the issue, this will unblock them Chart.yaml ---
apiVersion: v2
name: event-processor
version: 1.0.1
description: "A Helm chart to install event-processor" templates/config-map.yaml ---
apiVersion: v1
kind: ConfigMap
metadata:
name: event-processor
namespace: devops
data:
{{- (.Files.Glob "scripts/*").AsConfig | nindent 2 }} templates/deployments.yaml ---
apiVersion: apps/v1
kind: Deployment
metadata:
name: event-processor
namespace: devops
spec:
replicas: 1
selector:
matchLabels:
app: event-processor
template:
metadata:
labels:
app: event-processor
annotations:
configHash: {{ .Files.Get "scripts/event-processor.ps1" | sha256sum }}
spec:
serviceAccountName: event-processor
containers:
- name: event-processor
image: mcr.microsoft.com/powershell:latest
imagePullPolicy: IfNotPresent
command:
- pwsh
args:
- /scripts/event-processor.ps1
env:
- name: DEBUG
value: 'true'
resources:
limits:
memory: 150Mi
requests:
memory: 100Mi
volumeMounts:
- name: scripts
mountPath: /scripts
- name: kube-proxy
image: bitnami/kubectl:latest
command:
- kubectl
args:
- proxy
- --port=8080
- --keepalive=10s
resources:
limits:
memory: 32Mi
requests:
memory: 20Mi
volumes:
- name: scripts
configMap:
name: event-processor
defaultMode: 0755 templates/rbac.yaml ---
apiVersion: v1
kind: ServiceAccount
metadata:
name: event-processor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: event-processor
rules:
- apiGroups:
- "*"
- events.k8s.io
resources:
- events
verbs:
- get
- watch
- list
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: event-processor
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: event-processor
subjects:
- kind: ServiceAccount
name: event-processor
namespace: devops scripts/event-processor.ps1 $ErrorActionPreference = 'Continue'
function kubeApi() {
param(
[string]$Api,
[string]$Kind,
[string]$Name,
[string]$Namespace,
[string]$Action,
[string]$Body
)
if ($Kind -eq 'pods') { $apiSegment = 'api' }
else { $apiSegment = 'apis' }
$uriRequest = [System.UriBuilder]"http://127.0.0.1:8080/$apiSegment/$Api/namespaces/$namespace/$kind/$name"
if ($Action -eq 'patch') {
$contentType = @{
'Content-Type' = 'application/strategic-merge-patch+json'
}
}
$statusCode = 200
$res = try { Invoke-RestMethod -Method $action -Headers $contentType -Body $Body -Uri $uriRequest.Uri } catch { $statusCode = $_.Exception.Response.StatusCode.value__ }
if ($env:DEBUG -eq 'true') {
Write-Host $Action.ToUpper() $uriRequest.Uri '-' $statusCode -ForegroundColor Blue
}
Write-Output @{ res = $res; statusCode = $statusCode }
}
Start-Sleep -Seconds 10
Add-Type -AssemblyName System.Web
$queryString= [System.Web.HttpUtility]::ParseQueryString([String]::Empty)
$queryString.Add('watch', 'true')
$queryString.Add('sendInitialEvents', 'false')
$queryString.Add('fieldSelector', 'type=Warning,reason=BackOff,regarding.kind=Pod')
$uriRequest = [System.UriBuilder]'http://127.0.0.1:8080/apis/events.k8s.io/v1/events'
$uriRequest.Query = $queryString.ToString()
while ($true) {
try {
$request = [System.Net.WebRequest]::Create($uriRequest.Uri.OriginalString)
$request.KeepAlive = $true
$resp = $request.GetResponse()
$reqstream = $resp.GetResponseStream()
$sr = New-Object System.IO.StreamReader $reqstream
while (!$sr.EndOfStream) {
$line = $sr.ReadLine();
$k8sEvent = (ConvertFrom-Json $line -Depth 10).object
if ($k8sEvent) {
$timestamp = $k8sEvent.metadata.creationTimestamp
$pod = $k8sEvent.regarding.name
$namespace = $k8sEvent.regarding.namespace
$reason = $k8sEvent.reason
$note = $k8sEvent.note
$container = $k8sEvent.regarding.fieldPath -replace '.+\{(.+)\}', '$1'
$action = 'skip'
$result = 0
if ($container -eq 'linkerd-network-validator') {
# pod with linkerd CNI ordering issue
# just kill it with fire
$resource = kubeApi -Kind pods -Name $pod -Namespace $namespace -Action delete -Api v1
if ($resource.statusCode -eq 200) {
$action = 'delete'
$resource = $resource.res.name
$result = 200
}
else {
$resource = "pod/$pod"
$result = $resource.statusCode
}
$kind = 'Pod'
}
[ordered]@{
timestamp = $timestamp;
pod = $pod;
namespace = $namespace;
reason = $reason;
note = $note;
container = $container;
action = $action;
result = $result
} | ConvertTo-Json -Compress
}
}
Write-Host '*** Stream closed ***'
}
finally {
$sr.Close()
}
} This can be deployed with |
We're seeing this issue frequently when using spot/preemptible nodes on Oracle Cloud too. Its a bit of a painful one for us at the moment. |
I had an interesting conversation with someone on Slack. There may be an alternative to creating a controller that restarts failed pods. Cilium (and perhaps other CNI distributions) rely on Node taints to block pods from executing. Nodes get registered with a taint, when the plugin has finished executing, an operator removes the taint. Our DaemonSet would be capable of similar behaviour. Provided this doesn't result in multiple distributions writing to the same node at the same time it might be an easier and cleaner solutions than directly descheduling pods. Added some reference literature at the bottom. |
⬆️ EKS with spot instances and Cilium CNI instead default aws-vpc-cni one. |
Workaround which worked for me instead is:
Remember to conifgure your nodes/nodegroups so nodes are registered with |
@michalschott would it be better to make install-cni container an init container? Which would guarantee that your container starts after linkerd CNI installed and the container won't need to sleep for 30 seconds. |
@rootik yes thats also an option but I was looking for as small kustomize patch as possible there. |
Thats very interesting - What cluster role did you edit @michalschott, and is that change/patch being made to a particular daemonset? Just want to make sure we can correctly mirror this, as we're seeing the issue with spot instances and its quite painful. |
Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test
Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test
Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller. ## TO-DOs - Integration test
Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-reinitialize-pods`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to evict them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#xxx). ## TO-DOs - Figure why `/metrics` is returning a 404 (should show process metrics) - Integration test
Fixes linkerd/linkerd2#11073 This fixes the issue of injected pods that cannot acquire proper network config because `linkerd-cni` and/or the cluster's network CNI haven't fully started. They are left in a permanent crash loop and once CNI is ready, they need to be restarted externally, which is what this controller does. This controller "`linkerd-cni-repair-controller`" watches over events on pods in the current node, which have been injected but are in a terminated state and whose `linkerd-network-validator` container exited with code 95, and proceeds to delete them so they can restart with a proper network config. The controller is to be deployed as an additional container in the `linkerd-cni` DaemonSet (addressed in linkerd/linkerd2#11699). This exposes two custom counter metrics: `linkerd_cni_repair_controller_queue_overflow` (in the spirit of the destination controller's `endpoint_updates_queue_overflow`) and `linkerd_cni_repair_controller_deleted`
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.
This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>
This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>
This edge release introduces a number of different fixes and improvements. More notably, it introduces a new `cni-repair-controller` binary to the CNI plugin image. The controller will automatically restart pods that have not received their iptables configuration. * Removed shortnames from Tap API resources to avoid colliding with existing Kubernetes resources ([#11816]; fixes [#11784]) * Introduced a new ExternalWorkload CRD to support upcoming mesh expansion feature ([#11805]) * Changed `MeshTLSAuthentication` resource validation to allow SPIFFE URI identities ([#11882]) * Introduced a new `cni-repair-controller` to the `linkerd-cni` DaemonSet to automatically restart misconfigured pods that are missing iptables rules ([#11699]; fixes [#11073]) * Fixed a `"duplicate metrics"` warning in the multicluster service-mirror component ([#11875]; fixes [#11839]) * Added metric labels and weights to `linkerd diagnostics endpoints` json output ([#11889]) * Changed how `Server` updates are handled in the destination service. The change will ensure that during a cluster resync, consumers won't be overloaded by redundant updates ([#11907]) * Changed `linkerd install` error output to add a newline when a Kubernetes client cannot be successfully initialised ([#11917]) [#11816]: #11816 [#11784]: #11784 [#11805]: #11805 [#11882]: #11882 [#11699]: #11699 [#11073]: #11073 [#11875]: #11875 [#11839]: #11839 [#11889]: #11889 [#11907]: #11907 [#11917]: #11917 Signed-off-by: Matei David <[email protected]>
Followup to linkerd/linkerd2-proxy-init#306 Fixes #11073 This adds the `reinitialize-pods` container to the `linkerd-cni` DaemonSet, along with its config in `values.yaml`. Also the `linkerd-cni`'s version is bumped, to contain the new binary for this controller.
linkerd/linkerd2-proxy-init#242 fixed the CNI boot ordering issue, where linkerd-cni installed its config before any other CNI plugin had a chance to install its own config for setting up the cluster's networking.
However, there remains another scenario to address. Right after the other CNI plugins deploy their configs, the scheduler starts triggering pods, and sometimes might not leave time for linkerd-cni to drop its config. The injected linkerd-network-validor will catch that and fail the pod. But that doesn't signal the scheduler to restart the pod, and manual restarting is required. We have reproduced this issue in GKE with its default CNI plugin, when the node pool is scaled.
The text was updated successfully, but these errors were encountered: