Safeguard against outdated `/dev/disk/by-id/` symlinks that can lead Pod to mount the wrong volume #1224

ialidzhikov · 2022-05-02T08:27:49Z

/sig storage
/kind bug

What happened?

For nvme volumes the aws-ebs-csi-driver relies on the /dev/disk/by-id/ symlink to be able to determine the nvme device attached for the given volume id.

aws-ebs-csi-driver/pkg/driver/node_linux.go

Lines 83 to 88 in 0de2586

    
           // AWS recommends identifying devices by volume ID 
        
           // (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html), 
        
           // so find the nvme device path using volume ID. This is the magic name on 
        
           // which AWS presents NVME devices under /dev/disk/by-id/. For example, 
        
           // vol-0fab1d5e3f72a5e23 creates a symlink at 
        
           // /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0fab1d5e3f72a5e23

This symlink is updated by udev rules that react on kernel attach/detach events.

We also know that during Pod restarts volumes can be detached and quickly reattached at another location (i.e. /dev/nvme3n1 now, but after detach/attach cycle it would be /dev/nvme4n1).

A known cloud-init bug causes udev rules to be processed with a huge delay. To demonstrate the impact of the corresponding bug let's compare the udevadm monitor -s block output on a "healthy" and "affected" Node:

"healthy" Node

udevadm monitor -s block
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[735.827384] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
UDEV  [735.850150] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
KERNEL[737.628955] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
KERNEL[737.657828] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
UDEV  [737.681569] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
UDEV  [737.695555] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
KERNEL[738.219222] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
UDEV  [738.246151] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)

"affected" Node

$ udevadm monitor -s block
monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[1078.716035] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
KERNEL[1078.920058] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
KERNEL[1079.209415] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
KERNEL[1088.412335] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)
UDEV  [1122.943263] change   /devices/pci0000:00/0000:00:1c.0/nvme/nvme4/nvme4n1 (block)
UDEV  [1122.982514] change   /devices/pci0000:00/0000:00:1f.0/nvme/nvme1/nvme1n1 (block)
UDEV  [1123.093665] change   /devices/pci0000:00/0000:00:1e.0/nvme/nvme2/nvme2n1 (block)
UDEV  [1148.605915] change   /devices/pci0000:00/0000:00:1d.0/nvme/nvme3/nvme3n1 (block)

In such case, aws-ebs-csi-driver can use an outdated /dev/disk/by-id/ symlink (that for example points to /dev/nvme3n1 for vol-1 but vol-1 is already attached to /dev/nvme4n1) and afterwards mount the wrong nvme device to the Pod.

What you expected to happen?

A safeguarding mechanism to exist in aws-ebs-csi-driver. We assume that the device's (/dev/nvme4n1) creation timestamp reflects the attachment time. The driver can ensure that the /dev/disk/by-id/ symlink timestamp is greater than the device's timestamp. Otherwise it would mean that the device was attached AFTER the symlink was created (by udev).

How to reproduce it (as minimally and precisely as possible)?

Create a single Node cluster with a Linux distro that is affected by the cloud-init bug.

Create 4 StatefulSets and 1 pause Deployment with 30 replicas (the pause Deployment is needed to trigger the cloud-init bug).

Expand to see the manifests

apiVersion: v1
kind: Service
metadata:
  name: app1
  labels:
    app: app1
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app1
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app1
spec:
  serviceName: app1
  replicas: 1
  selector:
    matchLabels:
      app: app1
  template:
    metadata:
      labels:
        app: app1
    spec:
      containers:
        - name: app1
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app1
            mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app1
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

apiVersion: v1
kind: Service
metadata:
  name: app2
  labels:
    app: app2
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app2
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app2
spec:
  serviceName: app2
  replicas: 1
  selector:
    matchLabels:
      app: app2
  template:
    metadata:
      labels:
        app: app2
    spec:
      containers:
        - name: app2
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app2
            mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app2
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi

apiVersion: v1
kind: Service
metadata:
  name: app3
  labels:
    app: app3
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app3
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app3
spec:
  serviceName: app3
  replicas: 1
  selector:
    matchLabels:
      app: app3
  template:
    metadata:
      labels:
        app: app3
    spec:
      containers:
        - name: app3
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app3
            mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app3
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 3Gi

apiVersion: v1
kind: Service
metadata:
  name: app4
  labels:
    app: app4
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: app4
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app4
spec:
  serviceName: app4
  replicas: 1
  selector:
    matchLabels:
      app: app4
  template:
    metadata:
      labels:
        app: app4
    spec:
      containers:
        - name: app4
          image: centos
          command: ["/bin/sh"]
          args: ["-c", "while true; do echo $HOSTNAME $(date -u) >> /data/out.txt; sleep 5; done"]
          volumeMounts:
          - name: persistent-storage-app4
            mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: persistent-storage-app4
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 4Gi

$ k create deploy pause --image busybox --replicas 30 -- sh -c "sleep 100d"

Rollout the 4 StatefulSets and the Deployment.

$ k rollout restart deploy pause; k rollout restart sts app1 app2 app3 app4

Make sure that Pods mount the wrong PV:

The following command should like the volumes in increasing order by their size (i.e app1-0 has to mount the volume with size 1Gi, etc.)

for pod in app1-0 app2-0 app3-0 app4-0; do k exec $pod -- df -h /data; done

Make sure that in some cases the order is wrong and Pod mounts the wrong PV:

$ for pod in app1-0 app2-0 app3-0 app4-0; do k exec $pod -- df -h /data; done
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme3n1    2.0G  6.1M  1.9G   1% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    976M  2.6M  958M   1% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme4n1    2.9G  9.1M  2.9G   1% /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme2n1    3.9G   17M  3.8G   1% /data

In this case we see Pod app1-0 mounts the volume of app2-0.

$ k exec app1-0 -- tail /data/out.txt
app2-0 Mon Apr 4 13:12:42 UTC 2022
app2-0 Mon Apr 4 13:12:47 UTC 2022
app2-0 Mon Apr 4 13:12:52 UTC 2022
app2-0 Mon Apr 4 13:12:57 UTC 2022
app2-0 Mon Apr 4 13:13:02 UTC 2022
app2-0 Mon Apr 4 13:13:07 UTC 2022
app2-0 Mon Apr 4 13:13:12 UTC 2022
app2-0 Mon Apr 4 13:13:17 UTC 2022
app1-0 Mon Apr 4 13:13:56 UTC 2022
app1-0 Mon Apr 4 13:14:01 UTC 2022
app1-0 Mon Apr 4 13:14:06 UTC 2022
app1-0 Mon Apr 4 13:14:11 UTC 2022
app1-0 Mon Apr 4 13:14:16 UTC 2022
app1-0 Mon Apr 4 13:14:21 UTC 2022
app1-0 Mon Apr 4 13:14:26 UTC 2022

Environment

Kubernetes version (use kubectl version): v1.21.10
Driver version: v1.5.0

Credits to @dguendisch for all of the investigations and the safeguarding suggestion!

The text was updated successfully, but these errors were encountered:

RicardsRikmanis · 2022-05-23T10:42:30Z

We are seeing similar issue on our end, but not sure it's related to to the cloud-init bug since on our nodes its version 19.3-45.amzn2.

In our case node root volume is mounted instead of the actual volume on pod restart.

It already caused at least two production outages and we had no luck replicating it on our development environments.

We are using c5d.18xlarge node on EKS v1.21 and with EBS CSI driver v1.6.1

Edit:

I think our issue is closer related to #1166

ialidzhikov · 2022-05-25T07:18:48Z

In our case node root volume is mounted instead of the actual volume on pod restart.

We hit the same issue on our side a lot of times. We believe the root cause for this issue gets fixed with kubernetes/kubernetes#100183 (also backported to release-1.21 and present in K8s 1.21.9+). Hence, I would recommend you to upgrade to 1.21.9+. So far we didn't hit this issue after we upgraded to 1.21.10.
Monitoring colleagues also implemented alerting for PVCs affected by this issue (the actual volume size not matching the PVC size in the spec) -> it helps getting notified right away when the issue occurs.

RicardsRikmanis · 2022-05-26T12:40:11Z

Thanks for the info, that shed a lot of light on our issue!

We are at the mercy of AWS in regards Kubernetes patch upgrades. Now waiting till AWS EKS rolls out to 1.21.10/eks.7 and we will see if it also helped us.

k8s-triage-robot · 2022-08-24T12:46:29Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ialidzhikov · 2022-08-24T12:47:44Z

/remove-lifecycle stale

k8s-triage-robot · 2022-11-22T12:50:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ialidzhikov · 2022-11-22T13:04:14Z

/remove-lifecycle stale

k8s-triage-robot · 2023-02-20T13:05:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-22T13:29:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ialidzhikov · 2023-03-22T14:22:41Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-06-20T14:58:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-07-20T15:24:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ialidzhikov · 2023-10-04T06:12:18Z

/remove-lifecycle rotten

k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. kind/bug Categorizes issue or PR as related to a bug. labels May 2, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 22, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 20, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 4, 2023

ConnorJC3 mentioned this issue Dec 21, 2023

Use lsblk to safeguard against outdated symlinks #1878

Merged

k8s-ci-robot closed this as completed in #1878 Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safeguard against outdated `/dev/disk/by-id/` symlinks that can lead Pod to mount the wrong volume #1224

Safeguard against outdated `/dev/disk/by-id/` symlinks that can lead Pod to mount the wrong volume #1224

ialidzhikov commented May 2, 2022 •

edited

Loading

RicardsRikmanis commented May 23, 2022 •

edited

Loading

ialidzhikov commented May 25, 2022 •

edited

Loading

RicardsRikmanis commented May 26, 2022 •

edited

Loading

k8s-triage-robot commented Aug 24, 2022

ialidzhikov commented Aug 24, 2022

k8s-triage-robot commented Nov 22, 2022

ialidzhikov commented Nov 22, 2022

k8s-triage-robot commented Feb 20, 2023

k8s-triage-robot commented Mar 22, 2023

ialidzhikov commented Mar 22, 2023

k8s-triage-robot commented Jun 20, 2023

k8s-triage-robot commented Jul 20, 2023

ialidzhikov commented Oct 4, 2023

Safeguard against outdated /dev/disk/by-id/ symlinks that can lead Pod to mount the wrong volume #1224

Safeguard against outdated /dev/disk/by-id/ symlinks that can lead Pod to mount the wrong volume #1224

Comments

ialidzhikov commented May 2, 2022 • edited Loading

RicardsRikmanis commented May 23, 2022 • edited Loading

ialidzhikov commented May 25, 2022 • edited Loading

RicardsRikmanis commented May 26, 2022 • edited Loading

k8s-triage-robot commented Aug 24, 2022

ialidzhikov commented Aug 24, 2022

k8s-triage-robot commented Nov 22, 2022

ialidzhikov commented Nov 22, 2022

k8s-triage-robot commented Feb 20, 2023

k8s-triage-robot commented Mar 22, 2023

ialidzhikov commented Mar 22, 2023

k8s-triage-robot commented Jun 20, 2023

k8s-triage-robot commented Jul 20, 2023

ialidzhikov commented Oct 4, 2023

Safeguard against outdated `/dev/disk/by-id/` symlinks that can lead Pod to mount the wrong volume #1224

Safeguard against outdated `/dev/disk/by-id/` symlinks that can lead Pod to mount the wrong volume #1224

ialidzhikov commented May 2, 2022 •

edited

Loading

RicardsRikmanis commented May 23, 2022 •

edited

Loading

ialidzhikov commented May 25, 2022 •

edited

Loading

RicardsRikmanis commented May 26, 2022 •

edited

Loading