restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

andyzhangx · 2020-02-24T14:12:20Z

What happened:

install blobfuse csi driver and run a nginx-blobfuse pod example
kubectl delete po csi-blobfuse-node-8ttf5 -n kube-system would make current blobfuse mount inaccessible

workaround
delete current nginx-blobfuse pod and create a new nginx-blobfuse pod

$ kubectl exec -it nginx-blobfuse bash
root@nginx-blobfuse:/# df -h
df: /mnt/blobfuse: Transport endpoint is not connected
Filesystem      Size  Used Avail Use% Mounted on
overlay          29G   15G   15G  50% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda1        29G   15G   15G  50% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           3.4G   12K  3.4G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           3.4G     0  3.4G   0% /proc/acpi
tmpfs           3.4G     0  3.4G   0% /proc/scsi
tmpfs           3.4G     0  3.4G   0% /sys/firmware


$ mount | grep blobfuse
blobfuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
blobfuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
blobfuse on /var/lib/kubelet/pods/f5f56d79-553e-416d-a852-4ef8224e6422/volumes/kubernetes.io~csi/pvc-0433847e-03fd-422f-b053-5534510eb338/mount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
blobfuse on /var/lib/kubelet/pods/f5f56d79-553e-416d-a852-4ef8224e6422/volumes/kubernetes.io~csi/pvc-0433847e-03fd-422f-b053-5534510eb338/mount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
azureuser@k8s-agentpool-10150444-0:~$ sudo ls /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount
ls: cannot access '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount': Transport endpoint is not connected

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

CSI Driver version: v0.5.0
Kubernetes version (use kubectl version):
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

andyzhangx · 2020-02-24T14:48:05Z

When stage volume is broken, csi driver could not recover:
kubelet logs:

Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.353894   48920 reconciler.go:269] operationExecutor.MountVolume started for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.353949   48920 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-ttwfc" (UniqueName: "kubernetes.io/secret/0f8982ac-9651-43fe-bee6-e8e783ba1a06-default-token-ttwfc") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.354253   48920 operation_generator.go:552] MountVolume.WaitForAttach entering for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06") DevicePath "csi-1c8cdbdb5514092a520ae07d667e8228f15dfa7cdd11a4a6c4ed10e03508a3c9"
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.357269   48920 operation_generator.go:561] MountVolume.WaitForAttach succeeded for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06") DevicePath "csi-1c8cdbdb5514092a520ae07d667e8228f15dfa7cdd11a4a6c4ed10e03508a3c9"
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: E0224 14:29:50.357409   48920 csi_mounter.go:414] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount]
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: E0224 14:29:50.357427   48920 csi_attacher.go:233] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount]
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: E0224 14:29:50.357496   48920 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338\"" failed. No retries permitted until 2020-02-24 14:29:50.857470539 +0000 UTC m=+173143.830105231 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-0433847e-03fd-422f-b053-5534510eb338\" (UniqueName: \"kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338\") pod \"deployment-blobfuse-85bddbd75d-mtssv\" (UID: \"0f8982ac-9651-43fe-bee6-e8e783ba1a06\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount: transport endpoint is not connected"
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.454335   48920 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-ttwfc" (UniqueName: "kubernetes.io/secret/0f8982ac-9651-43fe-bee6-e8e783ba1a06-default-token-ttwfc") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.486586   48920 operation_generator.go:648] MountVolume.SetUp succeeded for volume "default-token-ttwfc" (UniqueName: "kubernetes.io/secret/0f8982ac-9651-43fe-bee6-e8e783ba1a06-default-token-ttwfc") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.958453   48920 reconciler.go:269] operationExecutor.MountVolume started for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.958773   48920 operation_generator.go:552] MountVolume.WaitForAttach entering for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06") DevicePath "csi-1c8cdbdb5514092a520ae07d667e8228f15dfa7cdd11a4a6c4ed10e03508a3c9"

ZeroMagic · 2020-02-24T14:51:06Z

It seems that there is a similar problem in azuredisk-csi-driver. After deleting the csi-node pod, the following error appears when creating a new nginx-azuredisk pod.

Events:
  Type     Reason                  Age               From                                        Message
  ----     ------                  ----              ----                                        -------
  Normal   Scheduled               <unknown>         default-scheduler                           Successfully assigned default/nginx-azuredisk to aks-agentpool-42669436-vmss000000
  Normal   SuccessfulAttachVolume  71s               attachdetach-controller                     AttachVolume.Attach succeeded for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb"
  Warning  FailedMount             24s               kubelet, aks-agentpool-42669436-vmss000000  Unable to attach or mount volumes: unmounted volumes=[azuredisk01], unattached volumes=[azuredisk01 default-token-mhfm5]: timed out waiting for the condition
  Warning  FailedMount             4s (x6 over 20s)  kubelet, aks-agentpool-42669436-vmss000000  MountVolume.MountDevice failed for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb" : rpc error: code = InvalidArgument desc = lun not provided

andyzhangx · 2020-02-24T15:13:22Z

fuse driver issue is related to this issue: kubernetes/kubernetes#70013
not sure about the azure disk driver issue, it should not be broken as the same way as fuse driver.

andyzhangx · 2020-02-24T15:38:49Z

Events:
  Type     Reason       Age                From                               Message
  ----     ------       ----               ----                               -------
  Normal   Scheduled    2m9s               default-scheduler                  Successfully assigned default/deployment-blobfuse-85bddbd75d-mtssv to k8s-agentpool-10150444-0
  Warning  FailedMount  6s                 kubelet, k8s-agentpool-10150444-0  Unable to attach or mount volumes: unmounted volumes=[blobfuse], unattached volumes=[default-token-ttwfc blobfuse]: timed out waiting for the condition
  Warning  FailedMount  1s (x9 over 2m9s)  kubelet, k8s-agentpool-10150444-0  MountVolume.MountDevice failed for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount: transport endpoint is not connected

andyzhangx · 2020-02-26T11:37:48Z

reopen this issue, now it depends on kubernetes/kubernetes#88569

andyzhangx · 2020-02-26T11:41:23Z

also need to investigate the other two CSI drivers, when restart driver daemonset, will the original mount point still work? May use same fix.

andyzhangx · 2020-02-29T14:56:22Z

It seems that there is a similar problem in azuredisk-csi-driver. After deleting the csi-node pod, the following error appears when creating a new nginx-azuredisk pod.

Events:
  Type     Reason                  Age               From                                        Message
  ----     ------                  ----              ----                                        -------
  Normal   Scheduled               <unknown>         default-scheduler                           Successfully assigned default/nginx-azuredisk to aks-agentpool-42669436-vmss000000
  Normal   SuccessfulAttachVolume  71s               attachdetach-controller                     AttachVolume.Attach succeeded for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb"
  Warning  FailedMount             24s               kubelet, aks-agentpool-42669436-vmss000000  Unable to attach or mount volumes: unmounted volumes=[azuredisk01], unattached volumes=[azuredisk01 default-token-mhfm5]: timed out waiting for the condition
  Warning  FailedMount             4s (x6 over 20s)  kubelet, aks-agentpool-42669436-vmss000000  MountVolume.MountDevice failed for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb" : rpc error: code = InvalidArgument desc = lun not provided

@ZeroMagic could you repro this issue? I have done the azure disk csi driver daemonset restart test, not found any issue.

ZeroMagic · 2020-02-29T16:09:11Z

I tried it again. But this time it was the same as you. All the things were normal. Maybe there was some kind of illegal operation last time.

andyzhangx · 2020-03-02T01:46:50Z

Update:
I have tried on azure file and azure disk csi drivers, restart driver daemonset won't make original mount unavailable, so this issue only applies for fuse driver.

andyzhangx · 2020-03-02T01:51:10Z

I tried it again. But this time it was the same as you. All the things were normal. Maybe there was some kind of illegal operation last time.

@ZeroMagic I think it could be due to this commit:
kubernetes-sigs/azuredisk-csi-driver@d193671

there is a field name change from devicePath to LUN, old driver uses devicePath and when you switch to use new driver, it uses LUN

andyzhangx · 2020-03-02T14:40:02Z

kubernetes/kubernetes#88569 was merged into k8s v1.18.0, and also in cherry-picking to k8s v1.15, 1.16, 1.17

andyzhangx · 2020-03-10T05:36:38Z

update:
kubernetes/kubernetes#88569 is merged in 1.15.11, 1.16.8, 1.17.4, 1.18.0
Blobfuse mount would be re-established after original app pod restarted.

andyzhangx · 2020-09-21T02:51:14Z

This issue is actually not fixed, restart blob driver daemonset would still make current blobfuse mount unavailable, workaround is delete pod with blobfuse mount, and remount would work with fix kubernetes/kubernetes#88569, to permanently fix this issue, should add a new proxy (run as process) to mount blobfuse outside of driver daemonset(like csi-proxy on Windows).

Another workaround is we don't use blobfuse mount, use NFS protocol instead, in long term, we may recommend user to use NFS protocol on Linux, so we don't need to implement blobfuse-proxy

fejta-bot · 2020-12-20T03:25:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-01-19T04:10:01Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

rhummelmose · 2021-04-23T09:59:38Z

I get this issue when upgrading a cluster. Rebooting the nodes afterwards seem to resolve the issue.

rhummelmose · 2021-04-23T10:19:54Z

Too soon, it stopped working again I think it probably didn't successfully mount.

andyzhangx · 2021-05-06T07:19:58Z

using blobfuse-proxy could mitigate this issue:

install blobfuse-proxy on debian based agent node(below daemonset would also install latest blobfuse version)

kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/blob-csi-driver/master/deploy/blobfuse-proxy/blobfuse-proxy.yaml

install blobfuse driver with node.enableBlobfuseProxy=true setting

helm repo add blob-csi-driver https://raw.githubusercontent.com/kubernetes-sigs/blob-csi-driver/master/charts
helm install blob-csi-driver blob-csi-driver/blob-csi-driver --namespace kube-system --version v1.6.0 --set node.enableBlobfuseProxy=true

andyzhangx · 2021-10-14T13:49:45Z

pls try with blobfuse-proxy: https://github.com/kubernetes-sigs/blob-csi-driver/tree/master/deploy/blobfuse-proxy, it's now the default setting from v1.6.0, blobfuse-proxy could make blobfuse mount still available after driver restart.

andyzhangx · 2021-11-22T02:49:42Z

btw, restart blobfuse-proxy on the agent node would cause all blobfuse mount invalid, while blobfuse-proxy should be a very stable service, that service should not be restarted in normal conditions.

andyzhangx added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2020

andyzhangx mentioned this issue Feb 26, 2020

fix corrupted mount issue when driver daemonset restarted #117

Merged

k8s-ci-robot closed this as completed in #117 Feb 26, 2020

andyzhangx reopened this Feb 26, 2020

andyzhangx closed this as completed Mar 2, 2020

andyzhangx mentioned this issue Jul 8, 2020

After some time volume is no more available to pod: Socket not connected #173

Closed

andyzhangx reopened this Sep 21, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2021

andyzhangx removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 19, 2021

boddumanohar mentioned this issue Feb 17, 2021

use blobfuse-proxy agent that runs on nodes to maintain blobfuse mounts #349

Merged

4 tasks

andyzhangx added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 22, 2021

boddumanohar mentioned this issue Mar 1, 2021

feat: blobfuse proxy #356

Merged

4 tasks

andyzhangx closed this as completed May 6, 2021

Huweicai mentioned this issue Aug 19, 2021

Support remount damaged volumes during csi-node restart or upgrade cubefs/cubefs-csi#54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

andyzhangx commented Feb 24, 2020 •

edited

Loading

andyzhangx commented Feb 24, 2020

ZeroMagic commented Feb 24, 2020 •

edited

Loading

andyzhangx commented Feb 24, 2020 •

edited

Loading

andyzhangx commented Feb 24, 2020

andyzhangx commented Feb 26, 2020

andyzhangx commented Feb 26, 2020

andyzhangx commented Feb 29, 2020

ZeroMagic commented Feb 29, 2020 •

edited

Loading

andyzhangx commented Mar 2, 2020

andyzhangx commented Mar 2, 2020

andyzhangx commented Mar 2, 2020

andyzhangx commented Mar 10, 2020 •

edited

Loading

andyzhangx commented Sep 21, 2020

fejta-bot commented Dec 20, 2020

fejta-bot commented Jan 19, 2021

rhummelmose commented Apr 23, 2021

rhummelmose commented Apr 23, 2021

andyzhangx commented May 6, 2021 •

edited

Loading

andyzhangx commented Oct 14, 2021

andyzhangx commented Nov 22, 2021

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

Comments

andyzhangx commented Feb 24, 2020 • edited Loading

andyzhangx commented Feb 24, 2020

ZeroMagic commented Feb 24, 2020 • edited Loading

andyzhangx commented Feb 24, 2020 • edited Loading

andyzhangx commented Feb 24, 2020

andyzhangx commented Feb 26, 2020

andyzhangx commented Feb 26, 2020

andyzhangx commented Feb 29, 2020

ZeroMagic commented Feb 29, 2020 • edited Loading

andyzhangx commented Mar 2, 2020

andyzhangx commented Mar 2, 2020

andyzhangx commented Mar 2, 2020

andyzhangx commented Mar 10, 2020 • edited Loading

andyzhangx commented Sep 21, 2020

fejta-bot commented Dec 20, 2020

fejta-bot commented Jan 19, 2021

rhummelmose commented Apr 23, 2021

rhummelmose commented Apr 23, 2021

andyzhangx commented May 6, 2021 • edited Loading

andyzhangx commented Oct 14, 2021

andyzhangx commented Nov 22, 2021

andyzhangx commented Feb 24, 2020 •

edited

Loading

ZeroMagic commented Feb 24, 2020 •

edited

Loading

andyzhangx commented Feb 24, 2020 •

edited

Loading

ZeroMagic commented Feb 29, 2020 •

edited

Loading

andyzhangx commented Mar 10, 2020 •

edited

Loading

andyzhangx commented May 6, 2021 •

edited

Loading