No easy way how to update CSI driver that uses fuse #70013

jsafrane · 2018-10-19T09:16:21Z

We recommend to use DaemonSet to run CSI drivers on node. If a driver runs fuse daemon, it's almost impossible to update it, as killing a pod with the driver kills the fuse daemons too and it will kill all mounts, possibly corrupting application data.

We need a documented and supported way how to update such CSI drivers. Note that the update process can be manual or the code can live somewhere else, we just need it to to be documented and supported so people don't loose data.

/sig storage
@msau42 @davidz627 @saad-ali @pohly @vladimirvivien @verult @lpabon @jingxu97 @gnufied

lpabon · 2018-10-25T00:04:21Z

@jsafrane I think that is the responsibility of the driver, in my opinion. Although we could definitely provide guidelines, we shouldn't provide a solution, since we cannot control their release or update processes.

fejta-bot · 2019-01-23T00:35:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

davidz627 · 2019-01-24T01:36:27Z

/remove-lifecycle stale

fredkan · 2019-03-25T09:49:30Z

Expect there is a common solution for this issue.

One workaround: https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md

fejta-bot · 2019-06-23T10:09:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

farcaller · 2019-06-23T10:21:04Z

/remove-lifecycle stale

jim3ma · 2019-07-17T04:01:58Z

Available solution from Alibaba Cloud OSS plugin:
https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md
This solution is making fuse processes running in node's root namespaces and cgroup, so when daemonset pod upgrade, fuse processes will not be killed.

Another solution: Buddy daemonset pods(two daemonset pods work as buddy)
When need upgrade, the buddy pod will take all fd(s) from original pod and then serve csi service.
This solution needs some code changes for fuse drivers.

fejta-bot · 2019-10-16T03:13:55Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jim3ma · 2019-10-23T08:19:09Z

/remove-lifecycle stale

fejta-bot · 2020-01-21T09:13:30Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

davidz627 · 2020-01-21T21:00:36Z

/remove-lifecycle stale

andyzhangx · 2020-02-24T15:11:43Z

I hit similar issue, make fuse driver standalone is one solution;
another problem is that how to recover when those mounting paths are already broken when fuse csi driver is restarted:
so when fuse driver is broken(sometimes due to csi driver pod restarted), the stage volume path is broken, following code will return error directly:

kubernetes/pkg/volume/csi/csi_attacher.go

Lines 231 to 234 in e4a5012

    
           mounted, err := isDirMounted(c.plugin, deviceMountPath) 
        
           if err != nil { 
        
           	klog.Error(log("attacher.MountDevice failed while checking mount status for dir [%s]", deviceMountPath)) 
        
           	return err

is it possible only return error when it's not IsCorruptedMnt, like what flexvolume did:

kubernetes/pkg/volume/flexvolume/detacher.go

Lines 60 to 61 in e4a5012

    
           if pathErr != nil && !mount.IsCorruptedMnt(pathErr) { 
        
           	return fmt.Errorf("Error checking path: %v", pathErr)

So in that case, even fuse driver is broken, we could make sure if fuse csi driver has remount logic, it could recover after fuse driver is back.
WDYT? I could make code change if that's an acceptable workaround.

andyzhangx · 2020-02-26T08:22:41Z

I worked out a PR(#88569) to mitigate this issue when fuse driver on the node is restarted(mount point is corrupted), could someone take a look? Thanks.

andyzhangx · 2020-03-02T11:32:49Z

/open
keep the issue open since PR(#88569) is only a mitigation

Ark-kun · 2020-10-31T02:47:06Z

/reopen

k8s-ci-robot · 2020-10-31T02:47:18Z

@Ark-kun: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx · 2020-10-31T13:38:48Z

/reopen

k8s-ci-robot · 2020-10-31T13:39:00Z

@andyzhangx: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-10-31T13:39:05Z

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jim3ma · 2020-10-31T13:44:36Z

Available solution from Alibaba Cloud OSS plugin:
https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md
This solution is making fuse processes running in node's root namespaces and cgroup, so when daemonset pod upgrade, fuse processes will not be killed.

Another solution: Buddy daemonset pods(two daemonset pods work as buddy)
When need upgrade, the buddy pod will take all fd(s) from original pod and then serve csi service.
This solution needs some code changes for fuse drivers.

We implement a solution - transferring fds in csi pods with surging rolling update.

andyzhangx · 2020-10-31T13:47:21Z

Available solution from Alibaba Cloud OSS plugin:
https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md
This solution is making fuse processes running in node's root namespaces and cgroup, so when daemonset pod upgrade, fuse processes will not be killed.
Another solution: Buddy daemonset pods(two daemonset pods work as buddy)
When need upgrade, the buddy pod will take all fd(s) from original pod and then serve csi service.
This solution needs some code changes for fuse drivers.

We implement a solution - transferring fd in csi pod with surging rolling update.

@jim3ma do you have the details of the solution? thanks.

jim3ma · 2020-10-31T13:56:49Z

Available solution from Alibaba Cloud OSS plugin:
https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md
This solution is making fuse processes running in node's root namespaces and cgroup, so when daemonset pod upgrade, fuse processes will not be killed.
Another solution: Buddy daemonset pods(two daemonset pods work as buddy)
When need upgrade, the buddy pod will take all fd(s) from original pod and then serve csi service.
This solution needs some code changes for fuse drivers.

We implement a solution - transferring fd in csi pod with surging rolling update.

@jim3ma do you have the details of the solution? thanks.

Currently, this solution is using in Ant Group, we will open source in Dragonfly Image Service some times later.

Another solution: csi container starts a fuse session in another container in host not in pod, then we can update csi pod with keeping a fuse session containers. this likes the solution from Alibaba Cloud OSS plugin but within our control.

fejta-bot · 2021-01-29T14:14:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-02-28T14:59:57Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-03-30T15:46:06Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-30T15:46:16Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xhejtman · 2022-03-31T07:38:03Z

Available solution from Alibaba Cloud OSS plugin:
https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md
This solution is making fuse processes running in node's root namespaces and cgroup, so when daemonset pod upgrade, fuse processes will not be killed.
Another solution: Buddy daemonset pods(two daemonset pods work as buddy)
When need upgrade, the buddy pod will take all fd(s) from original pod and then serve csi service.
This solution needs some code changes for fuse drivers.

We implement a solution - transferring fd in csi pod with surging rolling update.

@jim3ma do you have the details of the solution? thanks.

Currently, this solution is using in Ant Group, we will open source in Dragonfly Image Service some times later.

Another solution: csi container starts a fuse session in another container in host not in pod, then we can update csi pod with keeping a fuse session containers. this likes the solution from Alibaba Cloud OSS plugin but within our control.

did you open source already?

andyzhangx · 2022-03-31T07:43:48Z

@xhejtman Azure Blob CSI driver uses similar fuse proxy solution, and it's open source, check details here: https://github.com/kubernetes-sigs/blob-csi-driver/tree/master/deploy/blobfuse-proxy

xhejtman · 2022-03-31T07:47:42Z

Thank. I was also interested in truly restartable fuse driver, at least I understand what @jim3ma talk about - track everything fuse needs for restart.

jim3ma · 2022-03-31T17:13:46Z

Available solution from Alibaba Cloud OSS plugin:
https://github.com/AliyunContainerService/csi-plugin/blob/master/docs/oss-upgrade.md
This solution is making fuse processes running in node's root namespaces and cgroup, so when daemonset pod upgrade, fuse processes will not be killed.
Another solution: Buddy daemonset pods(two daemonset pods work as buddy)
When need upgrade, the buddy pod will take all fd(s) from original pod and then serve csi service.
This solution needs some code changes for fuse drivers.

We implement a solution - transferring fd in csi pod with surging rolling update.

@jim3ma do you have the details of the solution? thanks.

Currently, this solution is using in Ant Group, we will open source in Dragonfly Image Service some times later.
Another solution: csi container starts a fuse session in another container in host not in pod, then we can update csi pod with keeping a fuse session containers. this likes the solution from Alibaba Cloud OSS plugin but within our control.

did you open source already?

We are trying to merge some code into upstream fuse driver to make this solution perfect in some corner cases. And someday we will make our solution open source with a real project.

ofek · 2022-05-14T04:17:17Z

@jim3ma Is it open source now?

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 19, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 24, 2019

msau42 mentioned this issue Feb 21, 2019

FUSE volumes #7890

Open

fredkan mentioned this issue Mar 5, 2019

oss Pod get "transport endpoint is not connected" when restart ossplugin kubernetes-sigs/alibaba-cloud-csi-driver#22

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2019

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 16, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2020

This was referenced Feb 24, 2020

restart csi-blobfuse-node daemonset would make current blobfuse mount unavailable kubernetes-sigs/blob-csi-driver#115

Closed

fix: corrupted mount point in csi driver node stage/publish #88569

Merged

k8s-ci-robot closed this as completed in #88569 Mar 2, 2020

devopsjonas mentioned this issue Mar 31, 2020

cephfsplugin in daemonset container restarted the app pod Transport endpoint is not connected ceph/ceph-csi#792

Closed

maennchen mentioned this issue May 3, 2020

Mount using K8S Util Mounter ofek/csi-gcs#36

Merged

maennchen mentioned this issue May 3, 2020

Delete Orphaned Pods ofek/csi-gcs#38

Merged

k8s-ci-robot reopened this Oct 31, 2020

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 31, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2021

k8s-ci-robot closed this as completed Mar 30, 2021

msau42 mentioned this issue May 18, 2021

sig-node: Kubelet-in-UserNS, aka Rootless mode kubernetes/enhancements#1371

Merged

joshimoo mentioned this issue Jun 2, 2021

[BUG] Volumes are not properly mounted/unmounted when kubelet restarts longhorn/longhorn#2629

Closed

Huweicai mentioned this issue Aug 19, 2021

Support remount damaged volumes during csi-node restart or upgrade cubefs/cubefs-csi#54

Merged

leelavg mentioned this issue Sep 21, 2021

[demo] Kadalu CSI support for Nomad hashicorp/nomad#11207

Merged

Binyang2014 mentioned this issue Feb 23, 2022

Nodeserver shutdown makes mount unavailable until restart business pod Alluxio/alluxio#14917

Closed

liyuntao mentioned this issue Apr 9, 2024

[Bug]: restart kadalu-csi-nodeplugin daemonset would make current gfs mount unavailable. kadalu/kadalu#1055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No easy way how to update CSI driver that uses fuse #70013

No easy way how to update CSI driver that uses fuse #70013

jsafrane commented Oct 19, 2018

lpabon commented Oct 25, 2018

fejta-bot commented Jan 23, 2019

davidz627 commented Jan 24, 2019

fredkan commented Mar 25, 2019

fejta-bot commented Jun 23, 2019

farcaller commented Jun 23, 2019

jim3ma commented Jul 17, 2019 •

edited

Loading

fejta-bot commented Oct 16, 2019

jim3ma commented Oct 23, 2019

fejta-bot commented Jan 21, 2020

davidz627 commented Jan 21, 2020

andyzhangx commented Feb 24, 2020 •

edited

Loading

andyzhangx commented Feb 26, 2020

andyzhangx commented Mar 2, 2020

Ark-kun commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

andyzhangx commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

jim3ma commented Oct 31, 2020 •

edited

Loading

andyzhangx commented Oct 31, 2020

jim3ma commented Oct 31, 2020 •

edited

Loading

fejta-bot commented Jan 29, 2021

fejta-bot commented Feb 28, 2021

fejta-bot commented Mar 30, 2021

k8s-ci-robot commented Mar 30, 2021

xhejtman commented Mar 31, 2022

andyzhangx commented Mar 31, 2022

xhejtman commented Mar 31, 2022

jim3ma commented Mar 31, 2022

ofek commented May 14, 2022

No easy way how to update CSI driver that uses fuse #70013

No easy way how to update CSI driver that uses fuse #70013

Comments

jsafrane commented Oct 19, 2018

lpabon commented Oct 25, 2018

fejta-bot commented Jan 23, 2019

davidz627 commented Jan 24, 2019

fredkan commented Mar 25, 2019

fejta-bot commented Jun 23, 2019

farcaller commented Jun 23, 2019

jim3ma commented Jul 17, 2019 • edited Loading

fejta-bot commented Oct 16, 2019

jim3ma commented Oct 23, 2019

fejta-bot commented Jan 21, 2020

davidz627 commented Jan 21, 2020

andyzhangx commented Feb 24, 2020 • edited Loading

andyzhangx commented Feb 26, 2020

andyzhangx commented Mar 2, 2020

Ark-kun commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

andyzhangx commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

k8s-ci-robot commented Oct 31, 2020

jim3ma commented Oct 31, 2020 • edited Loading

andyzhangx commented Oct 31, 2020

jim3ma commented Oct 31, 2020 • edited Loading

fejta-bot commented Jan 29, 2021

fejta-bot commented Feb 28, 2021

fejta-bot commented Mar 30, 2021

k8s-ci-robot commented Mar 30, 2021

xhejtman commented Mar 31, 2022

andyzhangx commented Mar 31, 2022

xhejtman commented Mar 31, 2022

jim3ma commented Mar 31, 2022

ofek commented May 14, 2022

jim3ma commented Jul 17, 2019 •

edited

Loading

andyzhangx commented Feb 24, 2020 •

edited

Loading

jim3ma commented Oct 31, 2020 •

edited

Loading

jim3ma commented Oct 31, 2020 •

edited

Loading