Crash loops in azurefile-csi and azuredisk-csi (Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified) #2568

adoprog · 2021-09-28T06:05:37Z

What happened:

After the release https://github.com/Azure/AKS/releases/tag/2021-09-16 has rolled out to the region (East US) where our AKS cluster is deployed, windows nodes fail to run azurefile-csi and azuredisk-csi, both enter the crash loop with a similar exception: "Error: open \.\pipe\csi-proxy-filesystem-v1beta1: The system cannot find the file specified."

How to reproduce it (as minimally and precisely as possible):

Have a cluster in one of the affected regions and force autoscaler to create a new Windows node.

Anything else we need to know?:

Environment: 1.21.2 (previously upgraded from 1.1x, not brand new 1.21.2)

Kubernetes version (use kubectl version): 1.21.2
Size of cluster (how many worker nodes are in the cluster?): 5+
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Windows containers that use PVs.
Others:

The text was updated successfully, but these errors were encountered:

ghost · 2021-09-28T06:05:40Z

Hi adoprog, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

andyzhangx · 2021-09-28T09:50:26Z

@adoprog can you Upgrade vmss model in the vmss page in azure portal? Create a new windows node pool could also mitigate this issue. Thanks.

adoprog · 2021-09-28T09:54:09Z

Yep, tried that - did not help. With the help of support engineer we've found out that many clusters are missing property called "enableCSIProxy" in a cluster definition. I suppose, the clusters that were upgraded from 1.18 and earlier are the ones that don't have it.

Unfortunately, this property is not exposed in UI or Azure CLI so we can't "enable" it on problematic clusters.

andyzhangx · 2021-09-28T09:56:18Z

Does creating a new windows node pool work? @adoprog

adoprog · 2021-09-28T09:59:32Z

Rechecking it now, will report back in 10 mins

adoprog · 2021-09-28T10:12:56Z

Creating new node pool does not help. The property is not there and the services crash just like in old pool.

ZeroMagic · 2021-09-28T10:18:21Z

@adoprog Which region is your cluster in?

adoprog · 2021-09-28T10:18:50Z

East US

adoprog · 2021-09-28T10:21:09Z

Australia East - same issues

andyzhangx · 2021-09-28T12:40:30Z

pls file an azure ticket, we already fixed one cluster in eastus region. The upgrade fix has already being rolled out, with one more day, it would be rolled out to all regions, while for those windows clusters that were upgraded to v1.21 without the upgrade fix, we need to manually workaround this issue in our backend.

cailyoung · 2021-09-28T23:12:24Z

We've opened ticket 2109280060001140 for Australia East.

adoprog · 2021-09-29T08:56:04Z

After the fix, both services run better but still crash randomly (East US cluster is usable, Australia East one - not really). We also have the ticket open and informed support team.

andyzhangx · 2021-09-29T09:02:55Z

the windows node should have at least 4 cores CPU vm size, otherwise the driver pod would crash

adoprog · 2021-09-29T09:05:07Z

We tried D8S_v3 and D16S_v3 for Windows nodes, it fails on both.
csi-node-driver-registrar:v2.3.0 exits with no exception in a log

andyzhangx · 2021-09-29T09:08:32Z

We tried D8S_v3 and D16S_v3 for Windows nodes, it fails on both. csi-node-driver-registrar:v2.3.0 exits with no exception in a log

@adoprog could you provide kubectl describe po csi-azuredisk-node-win-xxx -n kube-system output if it failed?

adoprog · 2021-09-29T09:13:28Z

describe.txt

Sure, I've attached the output.

adoprog · 2021-09-29T09:15:41Z

The errors on the pods are similar to the one below. Not sure if attaching crashes driver or driver crash causes this error.

MountVolume.MountDevice failed for volume "pvc-e91eaa3c-82cc-4ebb-bba6-95613aa556dd" : rpc error: code = Internal desc = could not format "22"(lun: "2"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-e91eaa3c-82cc-4ebb-bba6-95613aa556dd\globalmount"

andyzhangx · 2021-09-29T11:43:40Z

@adoprog could you run:

kubectl edit ds csi-azuredisk-node-win -n kube-system
kubectl edit ds csi-azurefile-node-win -n kube-system

change timeoutSeconds: 15 for csi-node-driver-registrar inside container livenessProbe, that would workaround the issue, thanks (pls only change timeoutSeconds: 15 for csi-node-driver-registrar container)

        image: mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar:v2.3.0
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - cmd
              - /c
              - del /f C:\registration\disk.csi.azure.com-reg.sock C:\csi\disk.csi.azure.com\csi.sock
        livenessProbe:
          exec:
            command:
            - /csi-node-driver-registrar.exe
            - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
            - --mode=kubelet-registration-probe
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15

adoprog · 2021-09-29T12:04:17Z

The services are not crashing after the workaround, but the volumes (at least existing ones, haven't tested new ones yet) are not attaching and show error:

MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "13"(lun: "0"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount"

andyzhangx · 2021-09-29T12:06:41Z

The services are not crashing after the workaround, but the volumes (at least existing ones, haven't tested new ones yet) are not attaching and show error:

MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "13"(lun: "0"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount"

@adoprog pls try delete the pod, that would trigger attach & mount process again, thanks.

adoprog · 2021-09-29T12:10:10Z

Tried that, did not help, same error occurs.

andyzhangx · 2021-09-29T12:15:16Z

Tried that, did not help, same error occurs.

@adoprog does new pod works? You may cordon that node, delete pod, make it reschedule to other node, if mount still not works, try provide node driver logs by https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/docs/csi-debug.md#case2-volume-mountunmount-failed

adoprog · 2021-09-29T12:23:46Z

New service (i.e. new pod, new PV) worked. Log file attached.
csi-azuredisk-node.log

andyzhangx · 2021-09-29T12:27:14Z

New service (i.e. new pod, new PV) worked. Log file attached. csi-azuredisk-node.log

@adoprog I think it's related to abnormal state of driver a few moments ago, try cordon the node, delete pod in problem, and pod would be scheduled to the new node, so it's like new pod, thanks.

adoprog · 2021-09-29T12:38:57Z

Cordoned node, killed the pod, still the same error when attaching (on brand new node):

Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-vrtsm data]: timed out waiting for the condition Warning FailedMount 47s (x8 over 3m12s) kubelet, akswincsi000003 MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "18"(lun: "0" ), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount"

adoprog · 2021-09-29T13:19:40Z

What I also noticed, that old and new PVs have different tags format in Azure:

Old:

New:

andyzhangx · 2021-09-29T13:28:13Z

tag format are changed from azure disk csi driver v1.7.0: kubernetes-sigs/azuredisk-csi-driver#1009
And about disk format error, it seems those disks are broken, could you create new disks?

adoprog · 2021-09-29T13:37:48Z

Not really, we have hundreds of such instances each comes with disks, often with important data.

adoprog · 2021-09-29T13:55:48Z

I tried to attach on of the disks to a VM in the same subscription and it worked, showed the data...

andyzhangx · 2021-09-29T14:04:27Z

I tried to attach on of the disks to a VM in the same subscription and it worked, showed the data...

@adoprog can you run Powershell command Get-Partition inside that Windows VM and paste the output? thanks.

adoprog · 2021-09-29T14:09:12Z

andyzhangx · 2021-09-29T14:34:25Z

@adoprog thanks. I found this is a breaking change brought from upstream project: csi-proxy, already worked out a PR: kubernetes-csi/csi-proxy#175, about how to mitigate this issue soon on AKS, I think we need to switch back to csi-proxy beta interface on AKS first.

andyzhangx · 2021-09-29T23:38:47Z

@adoprog how can I create & format a disk with IFS type? How was that disk created, by azure disk driver, if so, what's the k8s version? I would like to repro this issue, thanks.

zhiweiv · 2021-09-30T02:06:59Z

There is no workaround and we have to wait right? Some of csi pods began to crash without any actions.

# kubectl logs csi-azurefile-node-win-6xg4p -n kube-system -c azurefile
I0930 02:02:19.890199   11536 safe_mounter_windows.go:279] failed to connect to csi-proxy v1 with error: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified., will try with v1Beta
E0930 02:02:19.890199   11536 safe_mounter_windows.go:289] failed to connect to csi-proxy v1beta with error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified.
F0930 02:02:19.890199   11536 azurefile.go:243] Failed to get safe mounter. Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified.

C:\Users\zwliu>kubectl -n kube-system -o wide get pods | findstr csi-azurefile-node-win
csi-azurefile-node-win-4jmmp          3/3     Running            0          9d      10.5.4.194   akswin100000b                    <none>           <none>
csi-azurefile-node-win-6xg4p          1/3     CrashLoopBackOff   11         5m48s   10.5.2.171   akswinjob000000                  <none>           <none>
csi-azurefile-node-win-7bwmx          3/3     Running            0          9d      10.5.5.52    akswin1000005                    <none>           <none>
csi-azurefile-node-win-85fht          3/3     Running            0          9d      10.5.4.176   akswin100000a                    <none>           <none>
csi-azurefile-node-win-86k7h          1/3     CrashLoopBackOff   319        9h      10.5.5.144   akswin1000008                    <none>           <none>
csi-azurefile-node-win-bpcjh          3/3     Running            0          9d      10.5.5.121   akswin1000007                    <none>           <none>
csi-azurefile-node-win-jkvd8          3/3     Running            0          9d      10.5.5.88    akswin1000006                    <none>           <none>
csi-azurefile-node-win-k49hx          2/3     CrashLoopBackOff   317        9h      10.5.5.173   akswin1000009                    <none>           <none>
csi-azurefile-node-win-khcd9          3/3     Running            3          46h     10.5.1.136   akswinjob000002                  <none>           <none>
csi-azurefile-node-win-l8zwl          3/3     Running            0          8d      10.5.0.21    akswin100000c                    <none>           <none>
csi-azurefile-node-win-mcbj7          3/3     Running            0          9d      10.5.5.19    akswin1000004                    <none>           <none>
csi-azurefile-node-win-mslqf          3/3     Running            0          9d      10.5.4.214   akswin1000002                    <none>           <none>
csi-azurefile-node-win-t42hx          3/3     Running            0          9d      10.5.4.255   akswin1000003                    <none>           <none>
csi-azurefile-node-win-vpf9h          3/3     Running            0          9d      10.5.4.122   akswin1000001                    <none>           <none>
csi-azurefile-node-win-wlksg          3/3     Running            0          9d      10.5.4.46    akswin1000000                    <none>           <none>

andyzhangx · 2021-09-30T02:20:24Z

@zhiweiv
try this workaround: #2568 (comment)

zhiweiv · 2021-09-30T02:55:38Z

Not work for me, we are using E2_V4 for Windows nodes.

C:\Users\zwliu>kubectl -n kube-system get pods | findstr csi-azurefile-node-win
csi-azurefile-node-win-4jmmp          3/3     Running            0          9d
csi-azurefile-node-win-7bwmx          3/3     Running            0          9d
csi-azurefile-node-win-85fht          3/3     Running            0          9d
csi-azurefile-node-win-9ptd5          2/3     CrashLoopBackOff   13         8m54s
csi-azurefile-node-win-bg4t5          1/3     CrashLoopBackOff   15         8m49s
csi-azurefile-node-win-bpcjh          3/3     Running            0          9d
csi-azurefile-node-win-bz9fq          1/3     CrashLoopBackOff   11         5m5s
csi-azurefile-node-win-jkvd8          3/3     Running            0          9d
csi-azurefile-node-win-khcd9          3/3     Running            3          46h
csi-azurefile-node-win-l8zwl          3/3     Running            0          8d
csi-azurefile-node-win-llrml          1/3     CrashLoopBackOff   13         8m56s
csi-azurefile-node-win-mcbj7          3/3     Running            0          9d
csi-azurefile-node-win-mslqf          3/3     Running            0          9d
csi-azurefile-node-win-t42hx          3/3     Running            0          9d
csi-azurefile-node-win-vpf9h          3/3     Running            0          9d

andyzhangx · 2021-09-30T03:03:57Z

@zhiweiv the kubelet response on your E2_V4 node may be too slow, can you try E4_V4 node?

zhiweiv · 2021-09-30T03:08:19Z

Hmm, will the 4 cores be the least requirement for csi afterward? or just a temporary mitigation?

andyzhangx · 2021-09-30T03:10:13Z

Hmm, will the 4 cores be the least requirement for csi afterward? or just a temporary mitigation?

@zhiweiv If you want to run as production, I think 4 cores should be the at least. Try 4 cores first, the automatic fix is already on the way.

zhiweiv · 2021-09-30T03:11:48Z

Our production clusters are using 4 cores, for testing clusters we prefer 2 cores to save cost. I will try 4 cores first.

adoprog · 2021-09-30T06:20:56Z

@andyzhangx All the volumes have been created with K8S, same 1.21.2 version as before. The only difference I believe is that they were created before the release 2021-09-16 was applied and new node pool created.

zhiweiv · 2021-09-30T06:22:59Z

@andyzhangx
Still crash on newly created 4 cores VMs.

C:\Users\zwliu>kubectl -n kube-system get pods -o wide | findstr csi | findstr akswin2
csi-azuredisk-node-win-2g9hl          1/3     CrashLoopBackOff   26         36m     10.5.2.131   akswin2000000  
csi-azuredisk-node-win-n5j99          1/3     CrashLoopBackOff   26         37m     10.5.2.152   akswin2000001  
csi-azurefile-node-win-l7cx4          1/3     CrashLoopBackOff   27         37m     10.5.3.8     akswin2000001  
csi-azurefile-node-win-z9dh9          1/3     CrashLoopBackOff   27         36m     10.5.1.165   akswin2000000

C:\Users\zwliu>kubectl describe node akswin2000000
Name:               akswin2000000
CreationTimestamp:  Thu, 30 Sep 2021 13:40:46 +0800
Capacity:
  cpu:                4
  ephemeral-storage:  261629948Ki
  memory:             33553972Ki
  pods:               50

andyzhangx · 2021-09-30T07:08:46Z

e volumes

@adoprog the volumes in problem were created before k8s 1.21 version, right?

adoprog · 2021-09-30T07:12:14Z

In some cases the volumes were created in k8s 1.21.

I tried it on test 1.21 cluster yesterday:

Created an instance with volume
Stopped the pods
Created new node pool (old node pool was also 1.21 but probably older image version)
Started the pods
An error occurs, the instance can't start due to volume attachment failure

andyzhangx · 2021-09-30T11:49:06Z

@andyzhangx Still crash on newly created 4 cores VMs.

C:\Users\zwliu>kubectl -n kube-system get pods -o wide | findstr csi | findstr akswin2
csi-azuredisk-node-win-2g9hl          1/3     CrashLoopBackOff   26         36m     10.5.2.131   akswin2000000  
csi-azuredisk-node-win-n5j99          1/3     CrashLoopBackOff   26         37m     10.5.2.152   akswin2000001  
csi-azurefile-node-win-l7cx4          1/3     CrashLoopBackOff   27         37m     10.5.3.8     akswin2000001  
csi-azurefile-node-win-z9dh9          1/3     CrashLoopBackOff   27         36m     10.5.1.165   akswin2000000

C:\Users\zwliu>kubectl describe node akswin2000000
Name:               akswin2000000
CreationTimestamp:  Thu, 30 Sep 2021 13:40:46 +0800
Capacity:
  cpu:                4
  ephemeral-storage:  261629948Ki
  memory:             33553972Ki
  pods:               50

@zhiweiv sorry, my fault, your issue is another one, pls file an azure ticket, ask our support to add "enableCSIProxy": true into managed cluster properties since there was an upgrade scenario we missed(already fixed on current release)

        "windowsProfile": {
            "adminUsername": "azureuser",
            "enableCSIProxy": true
        },

andyzhangx · 2021-09-30T11:53:05Z

In some cases the volumes were created in k8s 1.21.

I tried it on test 1.21 cluster yesterday:

Created an instance with volume

Stopped the pods

Created new node pool (old node pool was also 1.21 but probably older image version)

Started the pods

An error occurs, the instance can't start due to volume attachment failure

@adoprog sorry for the delay, finally we figured out that the disk attach failure was caused by last mitigation, we mistakenly used csi-proxy v1.0.0 binary for your clusters(while all other aks clusters are using v0.2.2 which does not have the compatibility issue), we will correct the csi-proxy config for all your clusters, and then our support will ask you to do vmss upgrade. After vmss upgrade, it should work.

And thanks for the patience, we found csi-proxy v1.0.0 has severe disk type compatibility issue, so AKS will abandon this csi-proxy v1.0.0, and upgrade to csi-proxy v1.1.0 release directly from v0.2.2 next time.

andyzhangx · 2021-10-14T01:11:14Z

close this issue since the liveness timeout fix has been rolled out to all regions, users don't need to do anything, existing clusters will adopt this change automatically.

ghost added the triage label Sep 28, 2021

This was referenced Sep 29, 2021

fix: driver pod crash issue on Windows kubernetes-sigs/azuredisk-csi-driver#1033

Merged

fix: driver pod crash issue on Windows kubernetes-sigs/azurefile-csi-driver#809

Merged

andyzhangx closed this as completed in kubernetes-sigs/azurefile-csi-driver#809 Sep 29, 2021

andyzhangx reopened this Sep 29, 2021

andyzhangx mentioned this issue Sep 29, 2021

fix: check partition error on MBR disk kubernetes-csi/csi-proxy#175

Merged

andyzhangx mentioned this issue Sep 29, 2021

feat: switch back to csi-proxy v1beta interface on Windows kubernetes-sigs/azuredisk-csi-driver#1034

Merged

4 tasks

justindavies added the upstream-bug label Sep 29, 2021

ghost removed the triage label Sep 29, 2021

andyzhangx closed this as completed in kubernetes-sigs/azuredisk-csi-driver#1034 Sep 30, 2021

andyzhangx reopened this Sep 30, 2021

andyzhangx closed this as completed Oct 14, 2021

ghost locked as resolved and limited conversation to collaborators Nov 17, 2021

Crash loops in azurefile-csi and azuredisk-csi (Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified) #2568

Crash loops in azurefile-csi and azuredisk-csi (Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified) #2568

Comments

adoprog commented Sep 28, 2021 • edited Loading

ghost commented Sep 28, 2021

andyzhangx commented Sep 28, 2021

adoprog commented Sep 28, 2021

andyzhangx commented Sep 28, 2021

adoprog commented Sep 28, 2021

adoprog commented Sep 28, 2021

ZeroMagic commented Sep 28, 2021

adoprog commented Sep 28, 2021

adoprog commented Sep 28, 2021

andyzhangx commented Sep 28, 2021 • edited Loading

cailyoung commented Sep 28, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021 • edited Loading

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021 • edited Loading

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

adoprog commented Sep 29, 2021

andyzhangx commented Sep 29, 2021

andyzhangx commented Sep 29, 2021 • edited Loading

zhiweiv commented Sep 30, 2021 • edited by andyzhangx Loading

andyzhangx commented Sep 30, 2021

zhiweiv commented Sep 30, 2021 • edited Loading

andyzhangx commented Sep 30, 2021

zhiweiv commented Sep 30, 2021

andyzhangx commented Sep 30, 2021

zhiweiv commented Sep 30, 2021

adoprog commented Sep 30, 2021

zhiweiv commented Sep 30, 2021 • edited Loading

andyzhangx commented Sep 30, 2021

adoprog commented Sep 30, 2021

andyzhangx commented Sep 30, 2021

andyzhangx commented Sep 30, 2021 • edited Loading

andyzhangx commented Oct 14, 2021 • edited Loading

adoprog commented Sep 28, 2021 •

edited

Loading

andyzhangx commented Sep 28, 2021 •

edited

Loading

andyzhangx commented Sep 29, 2021 •

edited

Loading

adoprog commented Sep 29, 2021 •

edited

Loading

andyzhangx commented Sep 29, 2021 •

edited

Loading

zhiweiv commented Sep 30, 2021 •

edited by andyzhangx

Loading

zhiweiv commented Sep 30, 2021 •

edited

Loading

zhiweiv commented Sep 30, 2021 •

edited

Loading

andyzhangx commented Sep 30, 2021 •

edited

Loading

andyzhangx commented Oct 14, 2021 •

edited

Loading