-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247
Comments
Thanks @aaronjzhang for raising this issue. It looks like a major design issue to me. When the etcdserver is being blocked on the fdatasync, there is no chance for raft to notify etcdserver the leader changing. I do not get time to deep dive into the new feature raft: support asynchronous storage writes, my immediate feeling is that it should can resolve such issue. @tbg @nvanbenschoten Could you double confirm this? But we also need to resolve it for etcd 3.5, which still depends on the old raft package. Note you can reproduce this issue using the failpoint
Procfile
|
Run |
It could help be more resilient to disk stalls, but it wasn't designed to do just that, and so this is just speculation. In CockroachDB we only use async log appends, but the log application is still in the same loop, so we don't have a ton of experience there either. Plus, we haven't shipped it. I don't understand exactly what the problem in etcd is that is described above, but it's probably similar to issues we have in CockroachDB when the disk stalls and turns into a grey failure. What we do there is detect disk stalls and fatal the process. We don't try to handle it gracefully. |
Thanks @tbg . This seems to be the key point. Could you be a little more specific about this? |
Thank you @tbg ! Will take a look later. |
#13527 is related. |
Thanks @tbg for the input. When there is something wrong on the storage, etcd may get stuck on syncing data on the storage files. When running into this situation, the etcd member may not be able to process any requests, even become unresponsive when it is being removed from the cluster, e.g issues/14338. If the blocked member is the leader, it may cause two issues,
I am thinking to propose to terminate etcd process when it doesn’t finish syncing the data (WAL or boltdb) in a configured max duration (we can define two thresholds of duration, one for WAL and the other for boltDB). When the member starts again, it will not be elected as the leader anymore if the storage issue isn’t resolved, so the two issues mentioned above will not happen. any immediate comments or concerns? If it looks overall good, then I will provide a detailed design doc. |
+1 on the proposal The issue we saw was kube-apiserver watch still connects to the old member which was stuck in disk write and serving with no updates. Terminating the "stuck in disk writing" etcd process is the right thing to do IMO. It is now fulfilled by nanny agent but could have been done in etcd. For data consistency risks:
In other hand, linearizability test already has equivalent failure injection point |
I don't see any data consistency issue. I will clearly clarify & explain the details in the design doc. Note that it's a critical issue if we see any data inconsistency in whatever situation, unless the storage hardware is broken. |
+1 for the proposal. Let the old leader steps down and raises a alert. The admin operator can notice the cluster is not health, even if the old leader process is in D status and waiting for pid reaper. |
+1 for @ahrtr 's proposal. I think the root cause is that some part of lease related state transition isn't controlled by Raft, so |
Thanks all for the feedback. The first draft design doc: https://docs.google.com/document/d/1U9hAcZQp3Y36q_JFiw2VBJXVAo2dK2a-8Rsbqv3GgDo/edit# |
Great proposal, I see there's a lot of comments in the doc around the specificity of this problem. I agree with @serathius though, we should look at this more from a general failure PoV. If it's not disk stalls, it's a slow network. Therefore I do like the split between detection and action (or remediation). Clearly we don't need to implement all of it to fix the issue at hand, but having something more generic in place will allow us to add more over time. Just to summarize some strategies that were proposed here or in the doc: Stall Detection
Stall Remediation
|
I thought about another lightweight mitigation for the issue: let LeaseRevokeRequest have a field of Raft term when it was issued. This branch has a rough implementation: https://github.com/mitake/etcd/commits/lease-revoke-term With the change, a node (which thinks it’s a leader) issues LeaseRevokeRequest including its term information. When other nodes apply the request, it compares the term in the request and its own term. If the term from request is older, it simply discard the request. I think it can be an adhoc mitigation for the probelm. I tested the above branch with the fail point testing briefly (#15247 (comment)) and it seems to be working. Pros:
Cons:
I think similar thing can be done for lease renew by adding term information to LeaseKeepAliveRequest. This approach is a very adhoc and dirty change so I’m not sure it’s worth trying. I think the watchdog approach will be good for being defensive for other issues caused by slow I/O (but this approach isn’t conflicting with the watchdog). If you have comments, please let me know. |
Looking at the scope this has become a whole new feature for etcd to detect getting stuck on disk. |
It isn't a blocker for 3.4.26 and 3.5.9. The big challenge is to sort out how to prevent lease from being mistakenly revoked. I will have a deep dive into it once we have more confidence on bbolt. @mitake 's comment above #15247 (comment) makes some sense, but there might have some potential compatibility issue. Will dig into it later as well. |
Could anyone help to backport #16822 to 3.5 and 3.4? |
Hi, I was looking for some good first issues to start with. I would like to pick this up if no one else is currently working on this. Please let me know. |
/assign @amit-rastogi Thanks Amit - Assigned to you to complete the backports, please let us know if you have any questions 🙏🏻 |
|
Hi @ahrtr , sorry for the delay. Our lease's TTL is 15s. Sorry for the confusion, but the etcd server version is actually based on the image: Unluckily, when I use |
Please read etcd-io/website#640
We haven't backport the fix to 3.5 and 3.4 yet. Could you try the main branch? |
Thank you @ahrtr for supporting the guidelines. For the main branch, can I follow the steps as below to build the images?
Alternatively, after finishing the first step and building the source code, I will generate the binary program 'etcd'. Then, I will replace the old binary program 'etcd' in the existing image 'k8s.gcr.io/etcd:3.5.0'. These are the two ways that could be used for the main branch. If I am wrong, please correct me. In general, where can we obtain the official image after backporting the fix to 3.5 and 3.4? |
Yes, it should work.
etcd uses gcr.io/etcd-development/etcd as a primary container registry, and quay.io/coreos/etcd as secondary. Please refer to https://github.com/etcd-io/etcd/releases |
Hi @ahrtr , unfortunately, when executing the build command It seems that some packages' version mismatched, as the following picture shown, the Go file is missing |
Hi @ahrtr , Above errors may come from the old go version, since I update go version, it appears to work. For testing the issue further, I reproduce it as below steps:
Go through the above 4 steps, the image my/etcd:3.5.11 will be created. Since remove /bin/sh #15173 from the image, I couldn't build a new image based on my/etcd:3.5.11 again with customed shell script file. So, I just copy the 3 files: etcd/etcdctl/etcdutl to my old images k8s.gcr.io/etcd:3.5.0, then I based on the copied file image to build a new image, suppose the new image name as k8s.gcr.io/etcd:3.5.11, then I start to test the case again. For testing the new image, I prepared for some test tasks:
To start test
[Get ETCD Key Log]
[ETCD Server Log]
From the server log, we can see the PS: I am not sure only copy 3 files etcd/etcdctl/etcdutl enough to test and the etcd pod's info as below:
|
No, it doesn't reproduce the issue. Please refer to |
That means what I raised is another issue? |
It isn't an issue. It just complained the performance of the disk and/or network communication between peers. |
Hi @ahrtr , we can also see the logs:
During this period, all services that depend on the etcd service were restarted. I picked the log from one of these services:
I can understand it was related to disk performance, but due to etcd and all services deployed on the same server, there should not be a network issue. This should be a problem caused by high I/O, which led to etcd missing the lease and consequently causing all services to be restarted. |
@ivanvc Please add a change for both 3.4 and 3.5, once it gets merged, then we can close this ticket. thx Regarding #15247 (comment), I don't see an easy & cheap solution for now. I expect the Livez feature can workaround it. |
All done. Thanks @ivanvc for backporting the fix to 3.5 and 3.4! |
ref #5520, close #8018 Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247 Signed-off-by: dongmen <[email protected]>
ref #5520, close #8018 Upgrade etcd to v3.4.31 to fix the etcd issue: etcd-io/etcd#15247 Signed-off-by: husharp <[email protected]> Co-authored-by: husharp <[email protected]> Co-authored-by: lhy1024 <[email protected]>
What happened?
We have 3 etcd nodes which are running on k8s, the etcd pod use portworx storage for db saving, we've found when rebooting the portworx node, it will cause etcd to be stuck in fdatasync, if this etcd node is leader, then it will cause below 2 lease keep alive issues.
some clients connected to other etcd nodes and send keep alive to them, after the old leader steped to follower, the new leader will handle the lease keep alive, but the old leader's lessor still considered itself is the primary and revoked all leases after their ttl expired in its view, after the old leader returned from fdatasync, these revoke request will be still sent out when handling the next Ready.
some clients connected to the old leader etcd node, when the old leader was stuck in fdatasync, the raft layer steped to follower, but the lessor was not demoted, the connected clients can still send keep alive request to the old leader, at the same time the other two etcd nodes elected a new leader and didn't receive the lease keep alive message which should be sent to them but they were still sent to old leader, so these leases were revoked after the ttl expired.
Essentially the above 2 issues are because of the lessor primary is not synced up with the raft leader, there is big gap between lessor primary and raft leader when etcd server is stuck in processing raft Ready.
What did you expect to happen?
How can we reproduce it (as minimally and precisely as possible)?
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
Solution and analysis (updated by @ahrtr)
The text was updated successfully, but these errors were encountered: