-
Notifications
You must be signed in to change notification settings - Fork 741
Persistent/Durable etcd cluster #1323
Comments
Frequent majority loss MUST be avoided. Or you should NOT use etcd at all. etcd is not designed for handling frequent majority loss. Also when etcd loses majority, etcd operator will try to seed from one existing member's data if possible. Or it will try to recover from existing backups if any. An on-going effort is to support continuous backup. Remember that PV is just another cloud managed backup for etcd. It is not a magic. I am fine with add PV support. It should not be a hard feature to add. PV just makes thing a little bit easier to reason about and makes management easier. However, it is not the best way to run etcd. etcd definitely supports domain address. And in etcd operator, we use FQDN all over the place. The issue you pointed out has nothing to do with peerURL in update. ListenURL and advertise URL are different things. I do not think you understand it well. |
Do you mind explaining your use case? When people want PV, usually creating a one member etcd deployment in k8s with PV is good enough for them. Clustering might not even be needed. |
I didn't mean frequent as N times a week, but in one year of production infrastructure this could happen few times. If my DC has a major problem or during an upgrade someone shut down some k8s nodes hosting a major number of etcd pods I'd like that my services restart cleanly after all is returned active without the need to recover from a backup.
That's a good thing but I haven't noticed this behavior, I'll test better this (and look deeper at the code) since the README doesn't says this (is it documented somewhere?)
This will be a great thing. I imagine it will be continuous but asynchronous right?
D'oh! looks like I was wrong and overlooked etcd-io/etcd#6336 because I got errors when putting FQDN in --initial-advertise-peer-urls but it was probably the domain in the listen peer url... So this will make easier to implement the second part of my proposal (I updated it) if you can provide persistent domain names in peerURLs.
stolon saves the cluster state (the main information is the current pg primary/master instance) inside etcd. Restoring an etcd cluster from a backup will mean restoring an old cluster state if between the backup and the disaster a new master was elected causing some problems. I'll try to do another example: consider the etcd cluster backing the k8s api. What will you do if all the 3 nodes restarts? I'll personally just wait for them to come back since the cluster will continue working (you can't just do changes), the etcd members have persistent data and when they will come back the cluster will become functional (If I haven't permanently destroyed a majority of the nodes). I won't create a new cluster and restore a backup if possible since I can end with ugly situations if I did some changes between the backup and the reboot. I'd like to achieve the same with an etcd cluster inside k8s using the etcd-operator.
I can already achieve a multi member etcd cluster inside k8s without etcd-operator using instead aone or more statefulset. I was just proposing to also make etcd-operator achieve the same goal while keeping all the other etcd-operator great features. Related the the single etcd member cluster suggestion: unfortunately using a single member with a PV will mean, with the current k8s state, if the k8s nodes dies, waiting some minutes for it to being declared as dead and detaching the pv (if using a block device based PV like AWS EBS) from that node and attaching it to a new node. In addition if for some reason detaching fails due to different problems you'll need to wait more or do some manual operations (there're many other possible cases depending on how you're deploying your k8s cluster and the underling storage used for PVs). |
Affinity is the way to solve this.
This is a totally different story. etcd operator does not have to remove the pod immediately. It is about how long we want to wait the pod to come back. There is not prefect solution for this. After adding local volume support, we can let user to configure how long they want to wait.
There are tons of things your stuff do not handle I assume, or it becomes another etcd operator. The complexity is not really about the initial deployment. It is about the on-going maintain, failure detection, and failure handling. For example, how do you add member to scale up the cluster? How do you backup the cluster for recovering bad state? Statefulset is not flexible enough to achieve quite a few things easily, and the benefits it bring in right now are not significant.
If you want to cycle it faster, you can write a simple monitoring script to do it. k8s will still handle the PV for you. If you do not trust PV, then 3 nodes with PV wont help either. After reading through your opinions and use case, I feel all you want is PV support. I am fine with adding this feature, all we need to do is to change the pod restart policy to always and change emptyDir to PV initially. @sgotti it would be great if you can work on it. |
Right. That's the reason I would like to improve etcd-operator to be able to handle cases where you lose the majority of member without being forced to restore from a backup 😄
Correct me if I'm wrong, but I'm not sure that just adding PV support with the current etcd-operator pod management logic will be enough, that's why my proposal feels a bit invasive as it's trying to change the etcd-operator pod managament logic to something similar to the one of statefulsets. Let me explain: If we keep the current operator logic (also with restartpolicy always), if node01 that is executing etcd-0000 dies/is partitioned for some minutes, the node controller will start pod eviction marking all its pods for deletion (deletion that is blocked since node01 is considered partitioned), if node01 comes back or is permanently removed from the cluster (or we force delete the pod etcd-0000 using grace-period=0) the pod will be deleted. So etcd-operator will schedule a new pod (etcd-0004) (to be sure I just tried this now and I see this behavior on k8s 1.7). If we just add the ability to define PVs (I think using something like a pvc template so we can handle dynamic provisioning, multizone etc...) that will be attached to etcd pods (say etcd-0000 to 0003), with the above example, etcd-0004 will get a new PV (not the one previously attached to etcd-000). So I don't see the difference and gain using a PV in this way. Instead with the statefulset logic, you always have pods with fixed names and fixed PV. If a pod is deleted from a node the statefulset controller will recreate a new pod with same name and same PV that will be scheduled on another node.
as above, in case died/partitioned (for whatever reason) node the pod is automatically marked for deletion by the node controller when a node dies/is partitioned (for whatever reason) so etcd-operator, currently, will create a new pod with a different name. |
I do not think it is as complicated as you described. There are only four cases:
We can start with 1 as I mentioned:
Some membership updates might be involved for 2,3,4, but none of them are complicated I think. |
This is really different from what you originally described. And, yes, I am aware of this. I do not thing it changes anything as I described. This is just another case of a "forced" pod movement. |
Yeah I think these cases will help achieving the goal. I started implementing 1 and the needed changes to the reconcile logic for 2,3,4 to just check if all fits together. I'll open an RFC PR for 1 in the next days (or next week) since I'd like to see if you agree on some choices. |
@sgotti That is great! Thanks! |
xref: #1434 |
Plan:
Let me clarify the four points in the plan:
|
I'm very interested in this feature for more resiliency under extreme outage. I'd rather lose some availability but have everything be back automatically without losing any data than having to load backups manually. I tested the patch in @hongchaodeng branch merged with master yesterday. All seem good so far. I'm interested in the handling cases of volumes for pods replacement, during restart/partition/eviction phases described above. Is the milestone 0.8.1 still accurate? |
Note New finding suggests that we can use |
@alexandrem I would definitely encourage you to test those failure scenarios in order to prove PV fixes those issues. If any functionality is missing, please communicate and we are more than glad to merge any fix. I would expect this feature to be a stable release blocker. |
PV helped a bit under those stress tests, but didn't solve it entirely. Another thing that is required while using that PV feature is to change the restartPolicy of the pod members to Always. Otherwise, operator will attempt to replace the pods and obviously we don't have the logic to move volumes around yet. I decided to actually replace the etcd-operator with a statefulset implementation for my specific use case. I still believe that PV is an interesting addition to the operator though. |
Yes. We are aware of this. See #1861 (comment).
What's your specific reason to switch to statefulset? What advantages does statefulset provide? We would like to have issues to track that. |
My principal use case is to host Kubernetes control plane components on a Kubernetes clusters (kubeception like). We want to offer managed Kubernetes clusters on-demand to different teams inside our company. I have built a solution to do lifecycle operations of Kubernetes cluster resources via an API that are hosted on a global cluster. We don't have a global shared Kubernetes cluster for everyone. We instead want to offer dedicated clusters on-demand, something similar to GKE. One of those master components is obviously the etcd cluster. We need to have persistent and very resilient etcd clusters. We can't afford to lose data; that would be catastrophic for users to have an outage and have all their pod members removed, since they would lose their Kubernetes cluster entirely. Obviously we need backups, but we also need to automate the whole thing as much as possible. I could have hundreds of clusters hosted and cannot afford to manually restore the cluster state for each of them if something goes wrong. A second use case we have is hosting the Quay docker registry on-prem using a postgreSQL database on Kubernetes. There is a database proxy system via stolon that uses an etcd cluster to do leader election and route to the master database member. If the etcd cluster is unavailable, then no access to the database is possible and this creates a global disruption of service. Lately, we've suffered from a networking outage which has impacted the etcd cluster managed by the operator for both of those use cases. We are currently hosting this in a private openstack cloud and there was a bad configuration in the networking layer on the hypervisors that was pushed for a short time. This created a global outage for both Kubernetes and the etcd-operator hosted on top of it. When etcd-operator loses its quorum, then bad things ensue and etcd pod members get deleted after a few minutes. At this point, nothing recovers the cluster automatically. This was tested on 0.5+ up to 0.7.x. Fortunately, there weren't production Kubernetes clusters impacted (all of this is very experimental so far), but there was a big outage on the docker registry instance and it required manual operation to recover it from backups. I have found that using a statefulset with PV is more resilient. It would always recover itself automatically following either partial or global networking outages. I think there has to be a lot of improvements in the way etcd-operator handles split brain scenarios. Losing the quorum of members will translate in the cluster being considered as dead, then member deletions happen. I believe another problem can arise if only the etcd-operator is separated under a network split. I think if it cannot communicate to kube apiserver then it might fall into a logic loop where it considers the cluster as dead, then will attempt to issue a few pod delete operations, then when kube-apiserver communication is restored it will still proceed to delete the cluster regardless. Would need to double check this particular case in the code. I believe it would help to introduce configurable strategies to handle split brain scenarios in the operator (new fields in the cluster resource definition). For instance, if members are disrupted, then one strategy could be to not delete pods beyond the quorum size. Akka has implemented and documented those use cases, maybe something we could get inspiration from. My strategy above is essential what they call "static quorum". https://developer.lightbend.com/docs/akka-commercial-addons/current/split-brain-resolver.html |
I think your use cases fit with what etcd-operator is designed to do. I feel sorry that you encountered troubles and we didn't resolve them in time. Last week we just made a step forward in #1323 (comment). The use cases and issues you described are on progress right now. Please keep following and communicate feedback and I would love to hear them all.
When losing connection to apiserver, etcd-operator wouldn't do anything. Once connected again, it would compare the current state with desired state and reconcile.
I don't understand what split brain issues you found. Sounds like orthogonal to this issue. Could you open a new issue and describe it more detailedly? |
@hongchaodeng @xiang90 just wanted to track this. Is it on the internal roadmap with a timeline? |
@xiang90 wrote 11 months ago:
Did this happen? @hongchaodeng You wrote:
..but this is #1323. Would you remember from Jan, if you intended to refer to some other issue/PR? |
I would like to express interest in seeing etcd w/ durability. cc: @robszumski @atinmu @kshlm |
Can someone provide a brief update on the status? Been reading through the comments + linked issues and am confused at what state we are at now. It seems the PV changes have been merged in, but not released for stable and the logic for handling all three failure cases don't seem to have been finished. Trying to get a sense of stability to better assess whether etcd-operator is a good fit for our use case. |
Not really usable in production. It doesn't do much, only deploying etcd but if something goes wrong (and it will), it won't do anything about it. I initially thought that the idea behind having an operator is that it would try to bring it back online. |
This is a simple fix that addresses Case C from https://github.com/coreos/etcd-operator/blob/master/doc/design/persistent_volumes_etcd_data.md It makes the etcd cluster with PVC able to recover from full k8s cluster outage. This fixes coreos#1323 inspired by coreos#1323 (comment)
This is a simple fix that addresses Case C from https://github.com/coreos/etcd-operator/blob/master/doc/design/persistent_volumes_etcd_data.md It makes the etcd cluster with PVC able to recover from full k8s cluster outage. Inspired by coreos#1323 (comment) Fixes coreos#1323
I'd like to be able to use etcd-operator to achieve a persistent/durable etcd cluster. I'd like to avoid as much as possible (only when I know that my members will never come back) the need to restore from a backup only because a majority of member dies at the same time (this could happen a lot of times for a lot of unpredictable reasons).
If I know that they will come back (rescheduled by etcd-operator) I'll prefer losing availability for some time (while waiting for them to come back).
Right now looks like etcd operator cannot achieve this. From the README.md:
I did a little analysis on how etcd-operator works in this post https://sgotti.me/post/kubernetes-persistent-etcd/
Now etcd-operator directly schedules pods (acts like a custom controller) and uses a k8s emptyDir volume for member data. Everytime a pod dies (or is manually deleted) a new replacement pod is created and the old member removed from the etcd cluster. When a majority of pods dies at the same time the cluster cannot be restored for two primary reasons:
To fix these points I can see the below solutions, but, probably, this will require a big change in the current etcd-operator architecture:
Use "persistent" etcd member data.
Persistent peer URLs addresses.
The cluster peerURLs contains the pod names, so this list will change after every replacement pod is created (with increasing numbers starting from 0000).
When creating a replacement pod you'll end with a pod with a new fqdn. So the etcd-operator will remove the old etcd member and add a new member. But this will work only when losing a minority of the etcd members. If you lose a majority of the etcd members the etcd cluster will be unquorate and won't accept a member update. Instead using stable peer URLs addresses (something like etcd-$i with i from 0 to cluster size-1) will avoid this problem.
One option to use "persistent" network names could be:
About this last point, before etcd-io/etcd#6336 you was able to provide a peer url containing a domain name. So instead of using "persistent" peer ip addresses you could use "persistent" network names for example creating an headless service with an endpoint for every member pod that will resolve a fixed member name (like etcd-0, etcd-1 etc...) to the ip of the current pod (like done by a k8s statefulset).But since etcd-io/etcd#6336 a peer url accepts only an ip address so the above solution won't work. Another solution will be to define a service with a cluster ip for every member pod with a label selector pointing to just that pod and use these service ips in the cluster peers list that will never change (except when resizing the cluster).Some possible downsides are that now the etcd member packets need to pass through the kube-proxy (but when using the default iptables based kube proxy the overhead should be negligible) and that pod1 -> service -> pod1 packets requires enabling kubelet hairpin mode (see a better explanation in the above post).
If this solutions looks unpraticable/not clean another solution will be to add a way in etcd to force a peerURL update also when the cluster is unquorate (to be done when all members are stopped?) that should be used by etcd-operator when trying to replace a majority of pods.~~
The text was updated successfully, but these errors were encountered: