Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

Closed
andyzhangx opened this issue Feb 24, 2020 · 20 comments · Fixed by #117
Closed

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable #115

andyzhangx opened this issue Feb 24, 2020 · 20 comments · Fixed by #117
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@andyzhangx
Copy link
Member

andyzhangx commented Feb 24, 2020

What happened:

  1. install blobfuse csi driver and run a nginx-blobfuse pod example
  2. kubectl delete po csi-blobfuse-node-8ttf5 -n kube-system would make current blobfuse mount inaccessible
  • workaround
    delete current nginx-blobfuse pod and create a new nginx-blobfuse pod
$ kubectl exec -it nginx-blobfuse bash
root@nginx-blobfuse:/# df -h
df: /mnt/blobfuse: Transport endpoint is not connected
Filesystem      Size  Used Avail Use% Mounted on
overlay          29G   15G   15G  50% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.4G     0  3.4G   0% /sys/fs/cgroup
/dev/sda1        29G   15G   15G  50% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           3.4G   12K  3.4G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs           3.4G     0  3.4G   0% /proc/acpi
tmpfs           3.4G     0  3.4G   0% /proc/scsi
tmpfs           3.4G     0  3.4G   0% /sys/firmware


$ mount | grep blobfuse
blobfuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
blobfuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
blobfuse on /var/lib/kubelet/pods/f5f56d79-553e-416d-a852-4ef8224e6422/volumes/kubernetes.io~csi/pvc-0433847e-03fd-422f-b053-5534510eb338/mount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
blobfuse on /var/lib/kubelet/pods/f5f56d79-553e-416d-a852-4ef8224e6422/volumes/kubernetes.io~csi/pvc-0433847e-03fd-422f-b053-5534510eb338/mount type fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_read=131072)
azureuser@k8s-agentpool-10150444-0:~$ sudo ls /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount
ls: cannot access '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount': Transport endpoint is not connected

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

  • CSI Driver version: v0.5.0
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@andyzhangx andyzhangx added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2020
@andyzhangx
Copy link
Member Author

When stage volume is broken, csi driver could not recover:
kubelet logs:

Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.353894   48920 reconciler.go:269] operationExecutor.MountVolume started for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.353949   48920 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-ttwfc" (UniqueName: "kubernetes.io/secret/0f8982ac-9651-43fe-bee6-e8e783ba1a06-default-token-ttwfc") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.354253   48920 operation_generator.go:552] MountVolume.WaitForAttach entering for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06") DevicePath "csi-1c8cdbdb5514092a520ae07d667e8228f15dfa7cdd11a4a6c4ed10e03508a3c9"
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.357269   48920 operation_generator.go:561] MountVolume.WaitForAttach succeeded for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06") DevicePath "csi-1c8cdbdb5514092a520ae07d667e8228f15dfa7cdd11a4a6c4ed10e03508a3c9"
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: E0224 14:29:50.357409   48920 csi_mounter.go:414] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount]
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: E0224 14:29:50.357427   48920 csi_attacher.go:233] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount]
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: E0224 14:29:50.357496   48920 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338\"" failed. No retries permitted until 2020-02-24 14:29:50.857470539 +0000 UTC m=+173143.830105231 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-0433847e-03fd-422f-b053-5534510eb338\" (UniqueName: \"kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338\") pod \"deployment-blobfuse-85bddbd75d-mtssv\" (UID: \"0f8982ac-9651-43fe-bee6-e8e783ba1a06\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount: transport endpoint is not connected"
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.454335   48920 reconciler.go:269] operationExecutor.MountVolume started for volume "default-token-ttwfc" (UniqueName: "kubernetes.io/secret/0f8982ac-9651-43fe-bee6-e8e783ba1a06-default-token-ttwfc") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.486586   48920 operation_generator.go:648] MountVolume.SetUp succeeded for volume "default-token-ttwfc" (UniqueName: "kubernetes.io/secret/0f8982ac-9651-43fe-bee6-e8e783ba1a06-default-token-ttwfc") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.958453   48920 reconciler.go:269] operationExecutor.MountVolume started for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06")
Feb 24 14:29:50 k8s-agentpool-10150444-0 kubelet[48920]: I0224 14:29:50.958773   48920 operation_generator.go:552] MountVolume.WaitForAttach entering for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" (UniqueName: "kubernetes.io/csi/blobfuse.csi.azure.com^andy-1180alpha5#fuse9e6bc1063ad742c8a12#pvc-0433847e-03fd-422f-b053-5534510eb338") pod "deployment-blobfuse-85bddbd75d-mtssv" (UID: "0f8982ac-9651-43fe-bee6-e8e783ba1a06") DevicePath "csi-1c8cdbdb5514092a520ae07d667e8228f15dfa7cdd11a4a6c4ed10e03508a3c9"

@ZeroMagic
Copy link
Member

ZeroMagic commented Feb 24, 2020

It seems that there is a similar problem in azuredisk-csi-driver. After deleting the csi-node pod, the following error appears when creating a new nginx-azuredisk pod.

Events:
  Type     Reason                  Age               From                                        Message
  ----     ------                  ----              ----                                        -------
  Normal   Scheduled               <unknown>         default-scheduler                           Successfully assigned default/nginx-azuredisk to aks-agentpool-42669436-vmss000000
  Normal   SuccessfulAttachVolume  71s               attachdetach-controller                     AttachVolume.Attach succeeded for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb"
  Warning  FailedMount             24s               kubelet, aks-agentpool-42669436-vmss000000  Unable to attach or mount volumes: unmounted volumes=[azuredisk01], unattached volumes=[azuredisk01 default-token-mhfm5]: timed out waiting for the condition
  Warning  FailedMount             4s (x6 over 20s)  kubelet, aks-agentpool-42669436-vmss000000  MountVolume.MountDevice failed for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb" : rpc error: code = InvalidArgument desc = lun not provided

@andyzhangx
Copy link
Member Author

andyzhangx commented Feb 24, 2020

fuse driver issue is related to this issue: kubernetes/kubernetes#70013
not sure about the azure disk driver issue, it should not be broken as the same way as fuse driver.

@andyzhangx
Copy link
Member Author

Events:
  Type     Reason       Age                From                               Message
  ----     ------       ----               ----                               -------
  Normal   Scheduled    2m9s               default-scheduler                  Successfully assigned default/deployment-blobfuse-85bddbd75d-mtssv to k8s-agentpool-10150444-0
  Warning  FailedMount  6s                 kubelet, k8s-agentpool-10150444-0  Unable to attach or mount volumes: unmounted volumes=[blobfuse], unattached volumes=[default-token-ttwfc blobfuse]: timed out waiting for the condition
  Warning  FailedMount  1s (x9 over 2m9s)  kubelet, k8s-agentpool-10150444-0  MountVolume.MountDevice failed for volume "pvc-0433847e-03fd-422f-b053-5534510eb338" : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-0433847e-03fd-422f-b053-5534510eb338/globalmount: transport endpoint is not connected

@andyzhangx
Copy link
Member Author

reopen this issue, now it depends on kubernetes/kubernetes#88569

@andyzhangx andyzhangx reopened this Feb 26, 2020
@andyzhangx
Copy link
Member Author

also need to investigate the other two CSI drivers, when restart driver daemonset, will the original mount point still work? May use same fix.

@andyzhangx
Copy link
Member Author

It seems that there is a similar problem in azuredisk-csi-driver. After deleting the csi-node pod, the following error appears when creating a new nginx-azuredisk pod.

Events:
  Type     Reason                  Age               From                                        Message
  ----     ------                  ----              ----                                        -------
  Normal   Scheduled               <unknown>         default-scheduler                           Successfully assigned default/nginx-azuredisk to aks-agentpool-42669436-vmss000000
  Normal   SuccessfulAttachVolume  71s               attachdetach-controller                     AttachVolume.Attach succeeded for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb"
  Warning  FailedMount             24s               kubelet, aks-agentpool-42669436-vmss000000  Unable to attach or mount volumes: unmounted volumes=[azuredisk01], unattached volumes=[azuredisk01 default-token-mhfm5]: timed out waiting for the condition
  Warning  FailedMount             4s (x6 over 20s)  kubelet, aks-agentpool-42669436-vmss000000  MountVolume.MountDevice failed for volume "pvc-a4ce5bf5-9fa2-444e-b60c-9a290c69d6bb" : rpc error: code = InvalidArgument desc = lun not provided

@ZeroMagic could you repro this issue? I have done the azure disk csi driver daemonset restart test, not found any issue.

@ZeroMagic
Copy link
Member

ZeroMagic commented Feb 29, 2020

I tried it again. But this time it was the same as you. All the things were normal. Maybe there was some kind of illegal operation last time.

@andyzhangx
Copy link
Member Author

Update:
I have tried on azure file and azure disk csi drivers, restart driver daemonset won't make original mount unavailable, so this issue only applies for fuse driver.

@andyzhangx
Copy link
Member Author

I tried it again. But this time it was the same as you. All the things were normal. Maybe there was some kind of illegal operation last time.

@ZeroMagic I think it could be due to this commit:
kubernetes-sigs/azuredisk-csi-driver@d193671

there is a field name change from devicePath to LUN, old driver uses devicePath and when you switch to use new driver, it uses LUN

@andyzhangx
Copy link
Member Author

kubernetes/kubernetes#88569 was merged into k8s v1.18.0, and also in cherry-picking to k8s v1.15, 1.16, 1.17

@andyzhangx
Copy link
Member Author

andyzhangx commented Mar 10, 2020

update:
kubernetes/kubernetes#88569 is merged in 1.15.11, 1.16.8, 1.17.4, 1.18.0
Blobfuse mount would be re-established after original app pod restarted.

@andyzhangx
Copy link
Member Author

This issue is actually not fixed, restart blob driver daemonset would still make current blobfuse mount unavailable, workaround is delete pod with blobfuse mount, and remount would work with fix kubernetes/kubernetes#88569, to permanently fix this issue, should add a new proxy (run as process) to mount blobfuse outside of driver daemonset(like csi-proxy on Windows).

Another workaround is we don't use blobfuse mount, use NFS protocol instead, in long term, we may recommend user to use NFS protocol on Linux, so we don't need to implement blobfuse-proxy

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 19, 2021
@andyzhangx andyzhangx removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 19, 2021
@andyzhangx andyzhangx added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 22, 2021
@boddumanohar boddumanohar mentioned this issue Mar 1, 2021
4 tasks
@rhummelmose
Copy link
Contributor

I get this issue when upgrading a cluster. Rebooting the nodes afterwards seem to resolve the issue.

@rhummelmose
Copy link
Contributor

Too soon, it stopped working again I think it probably didn't successfully mount.

@andyzhangx
Copy link
Member Author

andyzhangx commented May 6, 2021

using blobfuse-proxy could mitigate this issue:

  • install blobfuse-proxy on debian based agent node(below daemonset would also install latest blobfuse version)
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/blob-csi-driver/master/deploy/blobfuse-proxy/blobfuse-proxy.yaml
  • install blobfuse driver with node.enableBlobfuseProxy=true setting
helm repo add blob-csi-driver https://raw.githubusercontent.com/kubernetes-sigs/blob-csi-driver/master/charts
helm install blob-csi-driver blob-csi-driver/blob-csi-driver --namespace kube-system --version v1.6.0 --set node.enableBlobfuseProxy=true

@andyzhangx
Copy link
Member Author

pls try with blobfuse-proxy: https://github.com/kubernetes-sigs/blob-csi-driver/tree/master/deploy/blobfuse-proxy, it's now the default setting from v1.6.0, blobfuse-proxy could make blobfuse mount still available after driver restart.

@andyzhangx
Copy link
Member Author

btw, restart blobfuse-proxy on the agent node would cause all blobfuse mount invalid, while blobfuse-proxy should be a very stable service, that service should not be restarted in normal conditions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants