-
Notifications
You must be signed in to change notification settings - Fork 814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safeguard against outdated /dev/disk/by-id/
symlinks that can lead Pod to mount the wrong volume
#1224
Comments
We are seeing similar issue on our end, but not sure it's related to to the cloud-init bug since on our nodes its version In our case node root volume is mounted instead of the actual volume on pod restart. It already caused at least two production outages and we had no luck replicating it on our development environments. We are using Edit: I think our issue is closer related to #1166 |
We hit the same issue on our side a lot of times. We believe the root cause for this issue gets fixed with kubernetes/kubernetes#100183 (also backported to release-1.21 and present in K8s 1.21.9+). Hence, I would recommend you to upgrade to 1.21.9+. So far we didn't hit this issue after we upgraded to 1.21.10. |
Thanks for the info, that shed a lot of light on our issue! We are at the mercy of AWS in regards Kubernetes patch upgrades. Now waiting till AWS EKS rolls out to |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
/sig storage
/kind bug
What happened?
For nvme volumes the aws-ebs-csi-driver relies on the
/dev/disk/by-id/
symlink to be able to determine the nvme device attached for the given volume id.aws-ebs-csi-driver/pkg/driver/node_linux.go
Lines 83 to 88 in 0de2586
This symlink is updated by udev rules that react on kernel attach/detach events.
We also know that during Pod restarts volumes can be detached and quickly reattached at another location (i.e.
/dev/nvme3n1
now, but after detach/attach cycle it would be/dev/nvme4n1
).A known cloud-init bug causes udev rules to be processed with a huge delay. To demonstrate the impact of the corresponding bug let's compare the
udevadm monitor -s block
output on a "healthy" and "affected" Node:"healthy" Node
"affected" Node
In such case, aws-ebs-csi-driver can use an outdated
/dev/disk/by-id/
symlink (that for example points to/dev/nvme3n1
for vol-1 but vol-1 is already attached to/dev/nvme4n1
) and afterwards mount the wrong nvme device to the Pod.What you expected to happen?
A safeguarding mechanism to exist in aws-ebs-csi-driver. We assume that the device's (
/dev/nvme4n1
) creation timestamp reflects the attachment time. The driver can ensure that the/dev/disk/by-id/
symlink timestamp is greater than the device's timestamp. Otherwise it would mean that the device was attached AFTER the symlink was created (by udev).How to reproduce it (as minimally and precisely as possible)?
Create a single Node cluster with a Linux distro that is affected by the cloud-init bug.
Create 4 StatefulSets and 1 pause Deployment with 30 replicas (the pause Deployment is needed to trigger the cloud-init bug).
Expand to see the manifests
$ k create deploy pause --image busybox --replicas 30 -- sh -c "sleep 100d"
Rollout the 4 StatefulSets and the Deployment.
$ k rollout restart deploy pause; k rollout restart sts app1 app2 app3 app4
Make sure that Pods mount the wrong PV:
The following command should like the volumes in increasing order by their size (i.e
app1-0
has to mount the volume with size1Gi
, etc.)Make sure that in some cases the order is wrong and Pod mounts the wrong PV:
In this case we see Pod
app1-0
mounts the volume ofapp2-0
.Environment
kubectl version
):v1.21.10
v1.5.0
Credits to @dguendisch for all of the investigations and the safeguarding suggestion!
The text was updated successfully, but these errors were encountered: