Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash loops in azurefile-csi and azuredisk-csi (Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified) #2568

Comments

@adoprog
Copy link

adoprog commented Sep 28, 2021

What happened:

After the release https://github.com/Azure/AKS/releases/tag/2021-09-16 has rolled out to the region (East US) where our AKS cluster is deployed, windows nodes fail to run azurefile-csi and azuredisk-csi, both enter the crash loop with a similar exception: "Error: open \.\pipe\csi-proxy-filesystem-v1beta1: The system cannot find the file specified."

azure_error

How to reproduce it (as minimally and precisely as possible):

Have a cluster in one of the affected regions and force autoscaler to create a new Windows node.

Anything else we need to know?:

Environment: 1.21.2 (previously upgraded from 1.1x, not brand new 1.21.2)

  • Kubernetes version (use kubectl version): 1.21.2
  • Size of cluster (how many worker nodes are in the cluster?): 5+
  • General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Windows containers that use PVs.
  • Others:
@ghost ghost added the triage label Sep 28, 2021
@ghost
Copy link

ghost commented Sep 28, 2021

Hi adoprog, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@andyzhangx
Copy link
Contributor

@adoprog can you Upgrade vmss model in the vmss page in azure portal? Create a new windows node pool could also mitigate this issue. Thanks.

@adoprog
Copy link
Author

adoprog commented Sep 28, 2021

Yep, tried that - did not help. With the help of support engineer we've found out that many clusters are missing property called "enableCSIProxy" in a cluster definition. I suppose, the clusters that were upgraded from 1.18 and earlier are the ones that don't have it.

Unfortunately, this property is not exposed in UI or Azure CLI so we can't "enable" it on problematic clusters.

@andyzhangx
Copy link
Contributor

Does creating a new windows node pool work? @adoprog

@adoprog
Copy link
Author

adoprog commented Sep 28, 2021

Rechecking it now, will report back in 10 mins

@adoprog
Copy link
Author

adoprog commented Sep 28, 2021

Creating new node pool does not help. The property is not there and the services crash just like in old pool.

@ZeroMagic
Copy link

@adoprog Which region is your cluster in?

@adoprog
Copy link
Author

adoprog commented Sep 28, 2021

East US

@adoprog
Copy link
Author

adoprog commented Sep 28, 2021

Australia East - same issues

@andyzhangx
Copy link
Contributor

andyzhangx commented Sep 28, 2021

pls file an azure ticket, we already fixed one cluster in eastus region. The upgrade fix has already being rolled out, with one more day, it would be rolled out to all regions, while for those windows clusters that were upgraded to v1.21 without the upgrade fix, we need to manually workaround this issue in our backend.

@cailyoung
Copy link

We've opened ticket 2109280060001140 for Australia East.

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

After the fix, both services run better but still crash randomly (East US cluster is usable, Australia East one - not really). We also have the ticket open and informed support team.

@andyzhangx
Copy link
Contributor

the windows node should have at least 4 cores CPU vm size, otherwise the driver pod would crash

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

We tried D8S_v3 and D16S_v3 for Windows nodes, it fails on both.
csi-node-driver-registrar:v2.3.0 exits with no exception in a log

@andyzhangx
Copy link
Contributor

We tried D8S_v3 and D16S_v3 for Windows nodes, it fails on both. csi-node-driver-registrar:v2.3.0 exits with no exception in a log

@adoprog could you provide kubectl describe po csi-azuredisk-node-win-xxx -n kube-system output if it failed?

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

describe.txt

Sure, I've attached the output.

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

The errors on the pods are similar to the one below. Not sure if attaching crashes driver or driver crash causes this error.

MountVolume.MountDevice failed for volume "pvc-e91eaa3c-82cc-4ebb-bba6-95613aa556dd" : rpc error: code = Internal desc = could not format "22"(lun: "2"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-e91eaa3c-82cc-4ebb-bba6-95613aa556dd\globalmount"

@andyzhangx
Copy link
Contributor

andyzhangx commented Sep 29, 2021

@adoprog could you run:

  • kubectl edit ds csi-azuredisk-node-win -n kube-system
  • kubectl edit ds csi-azurefile-node-win -n kube-system

change timeoutSeconds: 15 for csi-node-driver-registrar inside container livenessProbe, that would workaround the issue, thanks (pls only change timeoutSeconds: 15 for csi-node-driver-registrar container)

        image: mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar:v2.3.0
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - cmd
              - /c
              - del /f C:\registration\disk.csi.azure.com-reg.sock C:\csi\disk.csi.azure.com\csi.sock
        livenessProbe:
          exec:
            command:
            - /csi-node-driver-registrar.exe
            - --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
            - --mode=kubelet-registration-probe
          failureThreshold: 3
          initialDelaySeconds: 30
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 15

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

The services are not crashing after the workaround, but the volumes (at least existing ones, haven't tested new ones yet) are not attaching and show error:

MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "13"(lun: "0"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount"

@andyzhangx
Copy link
Contributor

The services are not crashing after the workaround, but the volumes (at least existing ones, haven't tested new ones yet) are not attaching and show error:

MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "13"(lun: "0"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount"

@adoprog pls try delete the pod, that would trigger attach & mount process again, thanks.

@andyzhangx andyzhangx reopened this Sep 29, 2021
@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

Tried that, did not help, same error occurs.

@andyzhangx
Copy link
Contributor

Tried that, did not help, same error occurs.

@adoprog does new pod works? You may cordon that node, delete pod, make it reschedule to other node, if mount still not works, try provide node driver logs by https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/docs/csi-debug.md#case2-volume-mountunmount-failed

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

New service (i.e. new pod, new PV) worked. Log file attached.
csi-azuredisk-node.log

@andyzhangx
Copy link
Contributor

New service (i.e. new pod, new PV) worked. Log file attached. csi-azuredisk-node.log

@adoprog I think it's related to abnormal state of driver a few moments ago, try cordon the node, delete pod in problem, and pod would be scheduled to the new node, so it's like new pod, thanks.

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

Cordoned node, killed the pod, still the same error when attaching (on brand new node):

Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-vrtsm data]: timed out waiting for the condition Warning FailedMount 47s (x8 over 3m12s) kubelet, akswincsi000003 MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "18"(lun: "0" ), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount"

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

What I also noticed, that old and new PVs have different tags format in Azure:

Old:
tags_old

New:
tags_new

@andyzhangx
Copy link
Contributor

tag format are changed from azure disk csi driver v1.7.0: kubernetes-sigs/azuredisk-csi-driver#1009
And about disk format error, it seems those disks are broken, could you create new disks?

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

Not really, we have hundreds of such instances each comes with disks, often with important data.

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

I tried to attach on of the disks to a VM in the same subscription and it worked, showed the data...

@andyzhangx
Copy link
Contributor

I tried to attach on of the disks to a VM in the same subscription and it worked, showed the data...

@adoprog can you run Powershell command Get-Partition inside that Windows VM and paste the output? thanks.

@adoprog
Copy link
Author

adoprog commented Sep 29, 2021

disk

@andyzhangx
Copy link
Contributor

@adoprog thanks. I found this is a breaking change brought from upstream project: csi-proxy, already worked out a PR: kubernetes-csi/csi-proxy#175, about how to mitigate this issue soon on AKS, I think we need to switch back to csi-proxy beta interface on AKS first.

@andyzhangx
Copy link
Contributor

andyzhangx commented Sep 29, 2021

@adoprog how can I create & format a disk with IFS type? How was that disk created, by azure disk driver, if so, what's the k8s version? I would like to repro this issue, thanks.

@zhiweiv
Copy link

zhiweiv commented Sep 30, 2021

There is no workaround and we have to wait right? Some of csi pods began to crash without any actions.

# kubectl logs csi-azurefile-node-win-6xg4p -n kube-system -c azurefile
I0930 02:02:19.890199   11536 safe_mounter_windows.go:279] failed to connect to csi-proxy v1 with error: open \\.\\pipe\\csi-proxy-filesystem-v1: The system cannot find the file specified., will try with v1Beta
E0930 02:02:19.890199   11536 safe_mounter_windows.go:289] failed to connect to csi-proxy v1beta with error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified.
F0930 02:02:19.890199   11536 azurefile.go:243] Failed to get safe mounter. Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified.
C:\Users\zwliu>kubectl -n kube-system -o wide get pods | findstr csi-azurefile-node-win
csi-azurefile-node-win-4jmmp          3/3     Running            0          9d      10.5.4.194   akswin100000b                    <none>           <none>
csi-azurefile-node-win-6xg4p          1/3     CrashLoopBackOff   11         5m48s   10.5.2.171   akswinjob000000                  <none>           <none>
csi-azurefile-node-win-7bwmx          3/3     Running            0          9d      10.5.5.52    akswin1000005                    <none>           <none>
csi-azurefile-node-win-85fht          3/3     Running            0          9d      10.5.4.176   akswin100000a                    <none>           <none>
csi-azurefile-node-win-86k7h          1/3     CrashLoopBackOff   319        9h      10.5.5.144   akswin1000008                    <none>           <none>
csi-azurefile-node-win-bpcjh          3/3     Running            0          9d      10.5.5.121   akswin1000007                    <none>           <none>
csi-azurefile-node-win-jkvd8          3/3     Running            0          9d      10.5.5.88    akswin1000006                    <none>           <none>
csi-azurefile-node-win-k49hx          2/3     CrashLoopBackOff   317        9h      10.5.5.173   akswin1000009                    <none>           <none>
csi-azurefile-node-win-khcd9          3/3     Running            3          46h     10.5.1.136   akswinjob000002                  <none>           <none>
csi-azurefile-node-win-l8zwl          3/3     Running            0          8d      10.5.0.21    akswin100000c                    <none>           <none>
csi-azurefile-node-win-mcbj7          3/3     Running            0          9d      10.5.5.19    akswin1000004                    <none>           <none>
csi-azurefile-node-win-mslqf          3/3     Running            0          9d      10.5.4.214   akswin1000002                    <none>           <none>
csi-azurefile-node-win-t42hx          3/3     Running            0          9d      10.5.4.255   akswin1000003                    <none>           <none>
csi-azurefile-node-win-vpf9h          3/3     Running            0          9d      10.5.4.122   akswin1000001                    <none>           <none>
csi-azurefile-node-win-wlksg          3/3     Running            0          9d      10.5.4.46    akswin1000000                    <none>           <none>

@andyzhangx
Copy link
Contributor

@zhiweiv
try this workaround: #2568 (comment)

@zhiweiv
Copy link

zhiweiv commented Sep 30, 2021

Not work for me, we are using E2_V4 for Windows nodes.

C:\Users\zwliu>kubectl -n kube-system get pods | findstr csi-azurefile-node-win
csi-azurefile-node-win-4jmmp          3/3     Running            0          9d
csi-azurefile-node-win-7bwmx          3/3     Running            0          9d
csi-azurefile-node-win-85fht          3/3     Running            0          9d
csi-azurefile-node-win-9ptd5          2/3     CrashLoopBackOff   13         8m54s
csi-azurefile-node-win-bg4t5          1/3     CrashLoopBackOff   15         8m49s
csi-azurefile-node-win-bpcjh          3/3     Running            0          9d
csi-azurefile-node-win-bz9fq          1/3     CrashLoopBackOff   11         5m5s
csi-azurefile-node-win-jkvd8          3/3     Running            0          9d
csi-azurefile-node-win-khcd9          3/3     Running            3          46h
csi-azurefile-node-win-l8zwl          3/3     Running            0          8d
csi-azurefile-node-win-llrml          1/3     CrashLoopBackOff   13         8m56s
csi-azurefile-node-win-mcbj7          3/3     Running            0          9d
csi-azurefile-node-win-mslqf          3/3     Running            0          9d
csi-azurefile-node-win-t42hx          3/3     Running            0          9d
csi-azurefile-node-win-vpf9h          3/3     Running            0          9d

@andyzhangx
Copy link
Contributor

@zhiweiv the kubelet response on your E2_V4 node may be too slow, can you try E4_V4 node?

@zhiweiv
Copy link

zhiweiv commented Sep 30, 2021

Hmm, will the 4 cores be the least requirement for csi afterward? or just a temporary mitigation?

@andyzhangx
Copy link
Contributor

Hmm, will the 4 cores be the least requirement for csi afterward? or just a temporary mitigation?

@zhiweiv If you want to run as production, I think 4 cores should be the at least. Try 4 cores first, the automatic fix is already on the way.

@zhiweiv
Copy link

zhiweiv commented Sep 30, 2021

Our production clusters are using 4 cores, for testing clusters we prefer 2 cores to save cost. I will try 4 cores first.

@adoprog
Copy link
Author

adoprog commented Sep 30, 2021

@andyzhangx All the volumes have been created with K8S, same 1.21.2 version as before. The only difference I believe is that they were created before the release 2021-09-16 was applied and new node pool created.

@zhiweiv
Copy link

zhiweiv commented Sep 30, 2021

@andyzhangx
Still crash on newly created 4 cores VMs.

C:\Users\zwliu>kubectl -n kube-system get pods -o wide | findstr csi | findstr akswin2
csi-azuredisk-node-win-2g9hl          1/3     CrashLoopBackOff   26         36m     10.5.2.131   akswin2000000  
csi-azuredisk-node-win-n5j99          1/3     CrashLoopBackOff   26         37m     10.5.2.152   akswin2000001  
csi-azurefile-node-win-l7cx4          1/3     CrashLoopBackOff   27         37m     10.5.3.8     akswin2000001  
csi-azurefile-node-win-z9dh9          1/3     CrashLoopBackOff   27         36m     10.5.1.165   akswin2000000  
C:\Users\zwliu>kubectl describe node akswin2000000
Name:               akswin2000000
CreationTimestamp:  Thu, 30 Sep 2021 13:40:46 +0800
Capacity:
  cpu:                4
  ephemeral-storage:  261629948Ki
  memory:             33553972Ki
  pods:               50

@andyzhangx
Copy link
Contributor

e volumes

@adoprog the volumes in problem were created before k8s 1.21 version, right?

@adoprog
Copy link
Author

adoprog commented Sep 30, 2021

In some cases the volumes were created in k8s 1.21.

I tried it on test 1.21 cluster yesterday:

  1. Created an instance with volume
  2. Stopped the pods
  3. Created new node pool (old node pool was also 1.21 but probably older image version)
  4. Started the pods
  5. An error occurs, the instance can't start due to volume attachment failure

@andyzhangx
Copy link
Contributor

@andyzhangx Still crash on newly created 4 cores VMs.

C:\Users\zwliu>kubectl -n kube-system get pods -o wide | findstr csi | findstr akswin2
csi-azuredisk-node-win-2g9hl          1/3     CrashLoopBackOff   26         36m     10.5.2.131   akswin2000000  
csi-azuredisk-node-win-n5j99          1/3     CrashLoopBackOff   26         37m     10.5.2.152   akswin2000001  
csi-azurefile-node-win-l7cx4          1/3     CrashLoopBackOff   27         37m     10.5.3.8     akswin2000001  
csi-azurefile-node-win-z9dh9          1/3     CrashLoopBackOff   27         36m     10.5.1.165   akswin2000000  
C:\Users\zwliu>kubectl describe node akswin2000000
Name:               akswin2000000
CreationTimestamp:  Thu, 30 Sep 2021 13:40:46 +0800
Capacity:
  cpu:                4
  ephemeral-storage:  261629948Ki
  memory:             33553972Ki
  pods:               50

@zhiweiv sorry, my fault, your issue is another one, pls file an azure ticket, ask our support to add "enableCSIProxy": true into managed cluster properties since there was an upgrade scenario we missed(already fixed on current release)

        "windowsProfile": {
            "adminUsername": "azureuser",
            "enableCSIProxy": true
        },

@andyzhangx
Copy link
Contributor

andyzhangx commented Sep 30, 2021

In some cases the volumes were created in k8s 1.21.

I tried it on test 1.21 cluster yesterday:

  1. Created an instance with volume
  2. Stopped the pods
  3. Created new node pool (old node pool was also 1.21 but probably older image version)
  4. Started the pods
  5. An error occurs, the instance can't start due to volume attachment failure

@adoprog sorry for the delay, finally we figured out that the disk attach failure was caused by last mitigation, we mistakenly used csi-proxy v1.0.0 binary for your clusters(while all other aks clusters are using v0.2.2 which does not have the compatibility issue), we will correct the csi-proxy config for all your clusters, and then our support will ask you to do vmss upgrade. After vmss upgrade, it should work.

And thanks for the patience, we found csi-proxy v1.0.0 has severe disk type compatibility issue, so AKS will abandon this csi-proxy v1.0.0, and upgrade to csi-proxy v1.1.0 release directly from v0.2.2 next time.

@andyzhangx
Copy link
Contributor

andyzhangx commented Oct 14, 2021

close this issue since the liveness timeout fix has been rolled out to all regions, users don't need to do anything, existing clusters will adopt this change automatically.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.