Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rotating cloud instances with PVCs in a StatefulSet #181

Open
joekohlsdorf opened this issue Mar 14, 2020 · 31 comments
Open

Rotating cloud instances with PVCs in a StatefulSet #181

joekohlsdorf opened this issue Mar 14, 2020 · 31 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@joekohlsdorf
Copy link

Online you can find a bunch of examples (even in the official docs) which show how to use the local-volume-provisioner in combination with PhysicalVolumeClaims in a StatefulSet.

All works fine until a node goes away and your cloud provider brings up a new one, be it due to an issue on their side or due to you bringing up some new nodes because you are upgrading K8s.

What happens in this case is that the PVC stays bound to a PV which no longer exists. This prohibits the pod in the StatefulSet from coming up until you manually delete the PVC. Now this makes sense because there is no way of knowing if the node was shut down for maintenance and if it will come back later or if it's gone forever.

However I'd just like the node to be assumed dead because I'm never going to reboot nodes intentionally, I'll just roll the cluster. If the pod can be scheduled on another node I know 100% that the node was replaced (due to my affinity settings).
Is there any official way of dealing with this or any config option I'm overseeing?

I can write a job which takes care of this but surely others must have hit this issue?!

@nerddelphi
Copy link

@joekohlsdorf I guess we are facing the same issue: #65 (comment)

How do you plan solve that?

@cofyc
Copy link
Member

cofyc commented Apr 10, 2020

There is no Kubernetes-official way right now because Kubernetes will not unbind or delete PVCs. It's up to the users to recover from this situation. I have a plan to write a cloud controller to handle this automatically.

When a new node with a different name is created to replace the old node (e.g. auto-scaling group in AWS), PVs belonging to the old node are invalid. PVCs must be deleted, then the scheduler can find feasible PVs on other nodes. By the way, if pods have already been recreated on node deletion and are stuck at pending, they must be recreated to trigger StatefulSet to create PVCs again.

In GKE, this is a little different because the managed instance group recreates the underlying instance but uses the old node name.

@joekohlsdorf
Copy link
Author

joekohlsdorf commented Apr 10, 2020

What I did is I wrote a janitor which every 20 seconds looks for pending pods which have PVCs bound to PVs on dead hosts and will remove the PVC if necessary. It then deletes the pending pod to get it scheduled again.

My nodes for this service are static and have labels, this way I can be sure that the host isn't just rebooting. I know that my service runs on X nodes but if I see X nodes online and a PV on a node that doesn't exist I know it's dead and no coming back.

If this doesn't happen on GKE maybe some workaround could be found with custom node tags. You could have an ASG for every node so tags would stay the same even if a node dies.

@nerddelphi
Copy link

Thank for your answers, @cofyc and @joekohlsdorf .

@joekohlsdorf could you provide your janitor with us? I'll be glad if you can.

@msau42
Copy link
Contributor

msau42 commented Apr 10, 2020

@NickrenREN may have written a similar controller in the past

@joekohlsdorf
Copy link
Author

Well I certainly would strongly advise against doing what I did but here is the unedited janitor I hacked up. Please only use it as a reference, I had to get this done in a time crunch.
https://gist.github.com/joekohlsdorf/2658f03b1e1b6194ebe6b61bd8770647

@nerddelphi
Copy link

Hi, @joekohlsdorf . Thank you for script.

@NickrenREN
Copy link
Contributor

NickrenREN commented Apr 13, 2020

There is an issue that is similiar to this.
Some guys and I propose to introduce NodeFencing to solve this because it suit for both Cloud Providers and Bare metals and the reaction is relatively simple.
But others decide to take NodeShutdown taint method, there is an ongoing proposal: kubernetes/enhancements#1116.

Actually we have implemented NodeFencing feature (external controller and agent) in our own production environment.

@nerddelphi
Copy link

nerddelphi commented Apr 13, 2020

@NickrenREN Are you using that implementation https://github.com/kvaps/kube-fencing ? If yes, what kind of agent to dealing with PV/PVC issues?
My clusters are on GKE.

Thank you.

@NickrenREN
Copy link
Contributor

@nerddelphi No, our fencing controller and agent are implemented by ourselves.
Agent is designed to shut down machines forcefully, the control logic, race conditions and cleanup work are done by controller.

@NickrenREN
Copy link
Contributor

The design above is for bare metals, and for cloud providers, it may be a little bit different

@rsoika
Copy link

rsoika commented Apr 18, 2020

I am sorry that I am entering this discussion even though I am not a Kubernetes expert as you.
But I have been dealing with this problem for a some weeks and I also followed this long running discussion.

I am running a simple Kubernetes Cluster with only a view Nodes. I guess this is a complete different environment as that ones you discuss here, but let me describe my scenario to give you different view on the problem:

  • I have setup a distributed storage based on Ceph or Longhorn (same behavior for both).
  • I deploy a PostgresDB using a persistence volume claim.
  • I kill (for testing) the node on which the Database POD is running
  • Now I run into this problem that Kubernetes gets stuck while restarting the database POD on a new node, because the broken POD get not detached from the volume.
  • I have to manually delete the volumeattachment to get rid of this situation

I understand all your concerns about the data and what can happen to it if a volume is automatically detached.
But I - as the administrator of my cluster - trust in my Longhorn or Ceph Cluster. And of course, something can always go wrong, but that's my job to secure my data.

From my point of view, it is not Kubernetes' job to interfere in my data management. PLEASE give us a switch with which we can switch off this behavior and get terminating pods detached from a volume.

@NickrenREN
Copy link
Contributor

NickrenREN commented Apr 20, 2020

@rsoika Thanks for your input.
IIUIC, your scenario is the case NodeFencing can solve. If the node is dead (or Unknown), it will be forced to shut down and we do not expect it to be back again. As you described, data management isn't kubernetes' job, so the reaction is easy: go ahead and detach the volume forcefully.
And of course, if you want to bring you node back, you need to do the cleanup work first (this is also the work of kubernetes relevant team).

@rsoika
Copy link

rsoika commented Apr 20, 2020

@NickrenREN Thanks for your clarification. So there is no self-healing mechanism in Kubernetes for this scenario?

@NickrenREN
Copy link
Contributor

@rsoika For now, yes

@NickrenREN
Copy link
Contributor

@rsoika Since the progress of "Node Shutdown Taint" feature is slow, we are considering creating new proposal and projects to opensource "NodeFencing" solution. It can be another option.

@nerddelphi
Copy link

@joekohlsdorf Hi.

Are yours PVs (bound to the deleted PVC) deleted as well? In my cluster (GKE) they are with status RELEASED, even after its PVC be deleted by a janitor and my StorageClass/ReclaimPolicy be DELETE.
image

Are you experiencing that behavior?

I guess I wouldn't billed for a non-existent localssd, so I should do a way to delete theses RELEASEDs PVs, also.

@cofyc @NickrenREN Is that behavior normal/expected? Shouldn't previous PVs be deleted automatically, once its PVCs don't exist anymore?

Thanks.

@cofyc
Copy link
Member

cofyc commented Apr 21, 2020

If nodes which these PVs belong to do not exist anymore, you need to delete these PVs manually because no local-volume-provisioner can run on these nodes and recycle them.

@NickrenREN
Copy link
Contributor

NickrenREN commented Apr 21, 2020

@nerddelphi For now, k8s controller will just send Delete events (setting deletion timestamp), and as @cofyc said, the drivers(or kubelet) on the broken node break down too, so it won't do the cleanup work.
But with NodeFencing feature, these PVs can be released automatically (forcefully).

@rsoika
Copy link

rsoika commented Apr 21, 2020

Is the feature of NodeFencing official planned or is it still only in discussion?
I found these projects that seems to address the problem:

https://github.com/kvaps/kube-fencing
https://github.com/rootfs/node-fencing

@NickrenREN
Copy link
Contributor

IIRC, NodeFencing was discussed before but we didn't reach an agreement 😓

@nerddelphi
Copy link

Ok, guys. Thank you!

1 similar comment
@nerddelphi
Copy link

Ok, guys. Thank you!

@rsoika
Copy link

rsoika commented Apr 21, 2020

@NickrenREN can you share the discussion about the NodeFencing feature? I would like to better understand the backgrounds.

@NickrenREN
Copy link
Contributor

It was originally discussed here: kubernetes/kubernetes#65392
We also discussed it several times offline on slack.

And also, there are several KEPs there, but didn't get merged:
kubernetes/community#2763
kubernetes/community#1416

We didn't reach an agreement, and if needed, i'd like to reopen the discussion.

@rsoika
Copy link

rsoika commented Apr 22, 2020

I can't believe that this is true. I invested so much time to migrate from docker-swarm to kubernetes. Now I had to learn that kubernetes is not a self-healing system as promoted form everywhere. I think I understand the discussion and concerns about the pros and cons very well. But I am personally not on a level that I can discuss this in the refered groups.

It is absolutely strange: I makes no sense to setup a Ceph cluster and connect it to my kubernetes cluster because of this limitation. I am running a small environment with about 100 PODs on 5 virtual nodes hosted by my cloud provider (Hetzner).
I can be sure that if my cloud provider has a problem in one of its data centers (which are spread on different locations in Germany) my applications running on this node will stuck in termination state. My customers will call me because they can no longer work. I have to figure out all the affected volumeattachments and delete them manually. This is of course no solution. We are a small company with no 7x24 admin team.

My only hope is now that the Longhorn Team will solve this issue in there storage solution without the help from the kubernetes framework.

I can't believe that Kubernets is only focusing on stateless services....
I am not only talking about databases like postgres but also about services like Apache-Solr for fulltext search indexes or the Spacy project for ML-Services. All these services need in the end a data volume. If you see a way to re-energise this discussion, I would like to support you.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2020
@cofyc
Copy link
Member

cofyc commented Jul 21, 2020

/remove-lifecycle stale
/lifecycle fronzen

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2020
@cofyc
Copy link
Member

cofyc commented Jul 21, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jul 21, 2020
@oomichi
Copy link

oomichi commented Jun 7, 2021

/cc @oomichi

@eduardobr
Copy link

eduardobr commented Jun 29, 2022

Does this seem like a solution Azure Kubernetes Service implemented on their own Container Storage Interface (CSI)?
https://azure.microsoft.com/da-dk/updates/public-preview-azure-disk-csi-driver-v2-in-aks/

https://github.com/kubernetes-sigs/azuredisk-csi-driver/tree/main_v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

10 participants