Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document upgrade procedure for CSI nodeplugins #703

Closed
ShyamsundarR opened this issue Nov 4, 2019 · 17 comments · Fixed by #770
Closed

Document upgrade procedure for CSI nodeplugins #703

ShyamsundarR opened this issue Nov 4, 2019 · 17 comments · Fixed by #770
Labels
bug Something isn't working component/cephfs Issues related to CephFS

Comments

@ShyamsundarR
Copy link
Contributor

CSI nodeplugins, specifically when using cephfs FUSE or rbd-nbd as the mounters, when upgraded, will cause existing mounts to become stale/not-rechable (usually connection timeout errors).

This is due to losing the mount processes running within the CSI nodeplugin pods.

We need documented steps to ensure upgrades are smooth, even when upgrading to minor image versions, for bug fixes.

@travisn
Copy link
Member

travisn commented Nov 4, 2019

Upgrading the CSI driver will cause all the mounts to become stale? This seems like a blocker for upgrades. What's the workaround to keep a mount available during any upgrade? You have to failover all pods on one node, then upgrade its csi driver?

@dillaman
Copy link

dillaman commented Nov 4, 2019

You have to failover all pods on one node, then upgrade its csi driver?

Yes (when using the referenced backend drivers). We are working on a way to preserve the rbd-nbd state post-upgrade so that it can recover.

@ShyamsundarR
Copy link
Contributor Author

Also, for CephFS-FUSE driver we do have the feature to preserve mounts on the system, post restart via the --mountcachedir option. That needs to be tested better though.

@Madhu-1 Madhu-1 added the bug Something isn't working label Nov 5, 2019
@ajarr ajarr added the component/cephfs Issues related to CephFS label Nov 6, 2019
@ShyamsundarR
Copy link
Contributor Author

Also, for CephFS-FUSE driver we do have the feature to preserve mounts on the system, post restart via the --mountcachedir option. That needs to be tested better though.

The stated feature and options do not work for CephFS. The reasons are as follows,

  • When a volume is staged and published on a node it typically gets the following mounts (with kubernetes as the CO),

    • mount output from nodeplugin or on th host:
      ceph-fuse on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e2f04422-9786-4ad5-8cd0-49f8b8ee9b66/globalmount type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

      ceph-fuse on /var/lib/kubelet/pods/c290345c-6cae-4904-9d63-e707cec7fb1f/volumes/kubernetes.io~csi/pvc-e2f04422-9786-4ad5-8cd0-49f8b8ee9b66/mount type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

  • The pod that is run on the node, gets a runc configuration like so,
    {"destination":"/pvc-cephfs-mnt","type":"bind","source":"/var/lib/kubelet/pods/c290345c-6cae-4904-9d63-e707cec7fb1f/volumes/kubernetes.io~csi/pvc-e2f04422-9786-4ad5-8cd0-49f8b8ee9b66/mount","options":["rbind","rprivate"]}

  • The above ensures that the publish path is further bind mounted within the pod namespace as required, which hence ends up within the pod as the following mount,

    • mount output: ceph-fuse on /pvc-cephfs-mnt type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)

IOW, the pod gets its own bind mount, and it really does not matter at this point in time if we lose the stage and publish mount points, what matters is that the pod also depends on the specific instance of cephfs-fuse to back the mount point.

The feature to restart the bind mounts in cephFS are only contained to the CSI stage and publish paths, hence on a restart, when a new cephfs-fuse instance is mounted to the stage path and subsequently bind mounted to the publish path, these 2 paths become healthy. The pods bind mount however is never refreshed, as that is out of the control of CSI, and the CO (kubernetes in this case) has no reason to refresh that path due to nodeplugin restarts (also runc is already running the container, so even if it was required it would not be possible).

As a result of the above, when using cephfs, at present even with the --mountcachedir option and setting it to a non-emptydir local data stash, does not provide the required recover-ability semantics when the CSI nodeplugin is restarted.

@ajarr
Copy link
Contributor

ajarr commented Nov 22, 2019

@batrick @joscollin ^^
We need to figure out what we need to here

@batrick
Copy link
Member

batrick commented Nov 22, 2019

It is ironic that the container movement partially started out of a desire to avoid dependency hell with shared libraries and other system files and yet here we are resolving dependencies between pods/infrastructure.

@ShyamsundarR it's not clear to me if this is a a result of dysfunction in the CSI "nodeplugin" interface or what. What is an example of a shared file system that is supposed to survive this upgrade process? The bind mount will surely become stale as soon as the file system proxy (FUSE in this case) is restarted?

@dillaman
Copy link

@batrick The CephFS kernel driver will survive (as will krbd). The issue arises when we start mixing userspace daemons that are backing kernel file systems / block devices. This upgrade issue is only applicable for the ceph-fuse and rbd-nbd userspace tools since those daemons are run within the CSI node plugin pod, so when that pod gets upgraded, it results in those daemons getting killed. We are moving rbd-nbd to its own pod in the future to better manage its lifecycle outside of the CSI driver, but it sounds like if the ceph-fuse daemon is killed, there is no way to recover a mount(?) even if you could restart the daemon.

@batrick
Copy link
Member

batrick commented Nov 22, 2019

@batrick The CephFS kernel driver will survive (as will krbd). The issue arises when we start mixing userspace daemons that are backing kernel file systems / block devices.

Right.

This upgrade issue is only applicable for the ceph-fuse and rbd-nbd userspace tools since those daemons are run within the CSI node plugin pod, so when that pod gets upgraded, it results in those daemons getting killed. We are moving rbd-nbd to its own pod in the future to better manage its lifecycle outside of the CSI driver,

I understand moving the rbd-nbd userspace agent to another pod. Is that so you can avoid upgrades for the running application pods?

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

but it sounds like if the ceph-fuse daemon is killed, there is no way to recover a mount(?) even if you could restart the daemon.

There's no way to recover the mount, no. That is unlikely to ever change.

@dillaman
Copy link

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

You can adopt a similar approach. Move ceph-fuse to its own pod, have it run in the foreground (it becomes the container's pid 1) but spawn child processes or threads for each mount. Re-use the Ceph admin-daemon to communicate between this new ceph-fuse pod and the CephFS CSI node plugin (e.g. "ceph --admin-daemon /shared/path/to/the/ceph-fuse-pod.asok mount ...."), and boom goes the dynamite so long as ceph-fuse doesn't crash.

@batrick
Copy link
Member

batrick commented Nov 23, 2019

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

You can adopt a similar approach. Move ceph-fuse to its own pod, have it run in the foreground (it becomes the container's pid 1) but spawn child processes or threads for each mount. Re-use the Ceph admin-daemon to communicate between this new ceph-fuse pod and the CephFS CSI node plugin (e.g. "ceph --admin-daemon /shared/path/to/the/ceph-fuse-pod.asok mount ...."), and boom goes the dynamite so long as ceph-fuse doesn't crash.

I'm not sure why we wouldn't have a ceph-fuse daemon for each pod. Sharing one libcephfs cache for all pods has dubious benefits. Also, a single ceph-fuse means all pods' I/O funnels through the single-threaded FUSE daemon.

@ShyamsundarR
Copy link
Contributor Author

@ShyamsundarR it's not clear to me if this is a a result of dysfunction in the CSI "nodeplugin" interface or what. What is an example of a shared file system that is supposed to survive this upgrade process? The bind mount will surely become stale as soon as the file system proxy (FUSE in this case) is restarted?

The alternative here is to drain a node of application pods using PVs backed by fuse-cephfs (and as an extension rbd-nbd), before upgrading the CSI nodeplugin. IOW, move pods out of the node before an upgrade of the nodeplugin pod.

In prior to CSI cases, the mount proxy was run on the host/node, hence was not tied to a pod and there was no such pod to upgrade as well in the first place. (this may not be true for all storage providers).

If the proxy on the node needed to be upgraded, an upgrade of the node (resulting in application pod drain anyway) was performed.

I am adding @phlogistonjohn and @raghavendra-talur for more commentary on pre-CSI cases where the proxy service, for example gluster fuse client, needed to be upgraded.

@ShyamsundarR
Copy link
Contributor Author

For ceph-fuse, we would never want to upgrade the client mount while an application pod is using it.

You can adopt a similar approach. Move ceph-fuse to its own pod, have it run in the foreground (it becomes the container's pid 1) but spawn child processes or threads for each mount. Re-use the Ceph admin-daemon to communicate between this new ceph-fuse pod and the CephFS CSI node plugin (e.g. "ceph --admin-daemon /shared/path/to/the/ceph-fuse-pod.asok mount ...."), and boom goes the dynamite so long as ceph-fuse doesn't crash.

I'm not sure why we wouldn't have a ceph-fuse daemon for each pod. Sharing one libcephfs cache for all pods has dubious benefits. Also, a single ceph-fuse means all pods' I/O funnels through the single-threaded FUSE daemon.

Interesting, this may mean we should close this issue as a result, where I was toying with the thought of running a single ceph-fuse for all subvolumes on that node.

@ShyamsundarR
Copy link
Contributor Author

Also one downside of this discussion is we need to pull out this PR from the code, as it serves no purpose at present #282

@raghavendra-talur
Copy link

In prior to CSI cases, the mount proxy was run on the host/node, hence was not tied to a pod and there was no such pod to upgrade as well in the first place. (this may not be true for all storage providers).

If the proxy on the node needed to be upgraded, an upgrade of the node (resulting in application pod drain anyway) was performed.

I am adding @phlogistonjohn and @raghavendra-talur for more commentary on pre-CSI cases where the proxy service, for example gluster fuse client, needed to be upgraded.

That is right. Even though we did not have pods for client operation, we had client rpms that needed upgrade. We followed the same rules for client rpms that are recommended for the kubelet on the nodes.

I was not able to find any docs specifically for the CSI pods.

@raghavendra-talur
Copy link

I linked to the rules in the previous comment but the summary is that the worker nodes are drained, cordoned off before upgrading the kubelet.

Admins prefer to do this when the usage of cluster is low and the upgrade of all nodes in the cluster might take days. Hence it is expected that the some nodes have lower version client and the other have higher version at a given point.

Madhu-1 added a commit to Madhu-1/rook that referenced this issue Dec 13, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
Madhu-1 added a commit to Madhu-1/rook that referenced this issue Dec 13, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
@ShyamsundarR
Copy link
Contributor Author

The current strategy (as discussed with @Madhu-1) to address this is as follows,

  • Add an update strategy to the nodeplugin deamonsets to denote them as "OnDelete"
    • This prevents the nodeplugins from being upgraded automatically without user/admin intervention
  • Document upgrade procedure to ensure, node is evicted prior to restarting the nodeplugin deamonset, hence preventing apps from losing access to storage

The above at least ensures that there are no surprises for app pods using the storage on said nodes and the upgrade can be admin controlled as well.

As updated above, @Madhu-1 is working on this in Rook to begin with rook/rook#4496

@ShyamsundarR
Copy link
Contributor Author

Here is another community discussion on the topic that is a useful read.

travisn pushed a commit to Madhu-1/rook that referenced this issue Dec 17, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
travisn pushed a commit to Madhu-1/rook that referenced this issue Dec 17, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
rajatsing pushed a commit to rajatsing/rook that referenced this issue Dec 17, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
BlaineEXE pushed a commit to SUSE/rook that referenced this issue Dec 17, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
zoetrope pushed a commit to cybozu-go/rook that referenced this issue Dec 26, 2019
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
@mergify mergify bot closed this as completed in #770 Jan 14, 2020
kfyharukz pushed a commit to cybozu-go/rook that referenced this issue Jan 23, 2020
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
binoue pushed a commit to binoue/rook that referenced this issue Apr 10, 2020
CSI nodeplugins, specifically when using cephfs FUSE or
rbd-nbd as the mounters, when upgraded, will
cause existing mounts to become stale/not-rechable
(usually connection timeout errors).

This is due to losing the mount processes running within the
CSI nodeplugin pods.

This PR add updated the Daemonset update strategy
based on the ENV variable to take care of above issue
with some manual steps

Moreinfo: ceph/ceph-csi#703

Resolves: rook#4248

Signed-off-by: Madhu Rajanna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/cephfs Issues related to CephFS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants