-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash loops in azurefile-csi and azuredisk-csi (Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified) #2568
Crash loops in azurefile-csi and azuredisk-csi (Error: open \\.\\pipe\\csi-proxy-filesystem-v1beta1: The system cannot find the file specified) #2568
Comments
Hi adoprog, AKS bot here 👋 I might be just a bot, but I'm told my suggestions are normally quite good, as such:
|
@adoprog can you |
Yep, tried that - did not help. With the help of support engineer we've found out that many clusters are missing property called "enableCSIProxy" in a cluster definition. I suppose, the clusters that were upgraded from 1.18 and earlier are the ones that don't have it. Unfortunately, this property is not exposed in UI or Azure CLI so we can't "enable" it on problematic clusters. |
Does creating a new windows node pool work? @adoprog |
Rechecking it now, will report back in 10 mins |
Creating new node pool does not help. The property is not there and the services crash just like in old pool. |
@adoprog Which region is your cluster in? |
East US |
Australia East - same issues |
pls file an azure ticket, we already fixed one cluster in eastus region. The upgrade fix has already being rolled out, with one more day, it would be rolled out to all regions, while for those windows clusters that were upgraded to v1.21 without the upgrade fix, we need to manually workaround this issue in our backend. |
We've opened ticket |
After the fix, both services run better but still crash randomly (East US cluster is usable, Australia East one - not really). We also have the ticket open and informed support team. |
the windows node should have at least 4 cores CPU vm size, otherwise the driver pod would crash |
We tried D8S_v3 and D16S_v3 for Windows nodes, it fails on both. |
@adoprog could you provide |
Sure, I've attached the output. |
The errors on the pods are similar to the one below. Not sure if attaching crashes driver or driver crash causes this error. MountVolume.MountDevice failed for volume "pvc-e91eaa3c-82cc-4ebb-bba6-95613aa556dd" : rpc error: code = Internal desc = could not format "22"(lun: "2"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-e91eaa3c-82cc-4ebb-bba6-95613aa556dd\globalmount" |
@adoprog could you run:
change image: mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar:v2.3.0
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- cmd
- /c
- del /f C:\registration\disk.csi.azure.com-reg.sock C:\csi\disk.csi.azure.com\csi.sock
livenessProbe:
exec:
command:
- /csi-node-driver-registrar.exe
- --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
- --mode=kubelet-registration-probe
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 15 |
The services are not crashing after the workaround, but the volumes (at least existing ones, haven't tested new ones yet) are not attaching and show error: MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "13"(lun: "0"), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount" |
@adoprog pls try delete the pod, that would trigger attach & mount process again, thanks. |
Tried that, did not help, same error occurs. |
@adoprog does new pod works? You may cordon that node, delete pod, make it reschedule to other node, if mount still not works, try provide node driver logs by https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/docs/csi-debug.md#case2-volume-mountunmount-failed |
New service (i.e. new pod, new PV) worked. Log file attached. |
@adoprog I think it's related to abnormal state of driver a few moments ago, try cordon the node, delete pod in problem, and pod would be scheduled to the new node, so it's like new pod, thanks. |
Cordoned node, killed the pod, still the same error when attaching (on brand new node): Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-vrtsm data]: timed out waiting for the condition Warning FailedMount 47s (x8 over 3m12s) kubelet, akswincsi000003 MountVolume.MountDevice failed for volume "pvc-8646cb0b-78cf-4383-86df-1b969523bd74" : rpc error: code = Internal desc = could not format "18"(lun: "0" ), and mount it at "\var\lib\kubelet\plugins\kubernetes.io\csi\pv\pvc-8646cb0b-78cf-4383-86df-1b969523bd74\globalmount" |
tag format are changed from azure disk csi driver v1.7.0: kubernetes-sigs/azuredisk-csi-driver#1009 |
Not really, we have hundreds of such instances each comes with disks, often with important data. |
I tried to attach on of the disks to a VM in the same subscription and it worked, showed the data... |
@adoprog can you run Powershell command |
@adoprog thanks. I found this is a breaking change brought from upstream project: csi-proxy, already worked out a PR: kubernetes-csi/csi-proxy#175, about how to mitigate this issue soon on AKS, I think we need to switch back to csi-proxy beta interface on AKS first. |
@adoprog how can I create & format a disk with |
There is no workaround and we have to wait right? Some of csi pods began to crash without any actions.
|
@zhiweiv |
Not work for me, we are using E2_V4 for Windows nodes.
|
@zhiweiv the kubelet response on your E2_V4 node may be too slow, can you try E4_V4 node? |
Hmm, will the 4 cores be the least requirement for csi afterward? or just a temporary mitigation? |
@zhiweiv If you want to run as production, I think 4 cores should be the at least. Try 4 cores first, the automatic fix is already on the way. |
Our production clusters are using 4 cores, for testing clusters we prefer 2 cores to save cost. I will try 4 cores first. |
@andyzhangx All the volumes have been created with K8S, same 1.21.2 version as before. The only difference I believe is that they were created before the release 2021-09-16 was applied and new node pool created. |
@andyzhangx
|
@adoprog the volumes in problem were created before k8s 1.21 version, right? |
In some cases the volumes were created in k8s 1.21. I tried it on test 1.21 cluster yesterday:
|
@zhiweiv sorry, my fault, your issue is another one, pls file an azure ticket, ask our support to add "windowsProfile": {
"adminUsername": "azureuser",
"enableCSIProxy": true
}, |
@adoprog sorry for the delay, finally we figured out that the disk attach failure was caused by last mitigation, we mistakenly used csi-proxy v1.0.0 binary for your clusters(while all other aks clusters are using v0.2.2 which does not have the compatibility issue), we will correct the csi-proxy config for all your clusters, and then our support will ask you to do vmss upgrade. After vmss upgrade, it should work. And thanks for the patience, we found csi-proxy v1.0.0 has severe disk type compatibility issue, so AKS will abandon this csi-proxy v1.0.0, and upgrade to csi-proxy v1.1.0 release directly from v0.2.2 next time. |
close this issue since the liveness timeout fix has been rolled out to all regions, users don't need to do anything, existing clusters will adopt this change automatically. |
What happened:
After the release https://github.com/Azure/AKS/releases/tag/2021-09-16 has rolled out to the region (East US) where our AKS cluster is deployed, windows nodes fail to run azurefile-csi and azuredisk-csi, both enter the crash loop with a similar exception: "Error: open \.\pipe\csi-proxy-filesystem-v1beta1: The system cannot find the file specified."
How to reproduce it (as minimally and precisely as possible):
Have a cluster in one of the affected regions and force autoscaler to create a new Windows node.
Anything else we need to know?:
Environment: 1.21.2 (previously upgraded from 1.1x, not brand new 1.21.2)
kubectl version
): 1.21.2The text was updated successfully, but these errors were encountered: