-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vault HA problem in vault 0.7.3 #3031
Comments
How to reproduce this issue? |
Three nodes like this [root@SGDLITVM0819 ~]# vault status -tls-skip-verify High-Availability Enabled: true |
i tired, but it works. cannot really reproduce what you described. can you provide me a step by step script? (how did you initialize vault? how do you shutdown the active one? what logging information should i expect to see?) |
I usually happen this problem. 2017/07/19 07:31:14.425524 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found 2017-07-19 07:21:09.939587 I | raft: 2046c458caf5773f [logterm: 7, index: 15633, vote: 0] ignored MsgVote from d33e6e92eef6099a [logterm: 7, index: 15633] at term 7: lease is not expired (remaining ticks: 10) |
your etcd cluster is not in a good shape. fsync took too long. |
I believe it's because vault lost its etcd lease. Looking at the etcd client code (that vault uses), I don't see anything for handling that situation, e.g. code for acquiring a new lease. Due to 64d412e, I believe if you want vault + etcd to be healthy, you must ensure vault and etcd can always talk to each other within 15 seconds (used to be 60 seconds). If that's not possible, due to fsync, or network problems, or for any other reason, you will need to restart vault to acquire a new etcd lease. |
I looked into this a bit more, because we've seen the same problem, where a standby vault will eventually end up printing out "core: failed to acquire lock: error=etcdserver: requested lease not found" over and over after the leader goes down. I don't know what caused it to get into that state (network problems? etcd problems? otherwise?) and to some degree, I don't care. I just want things to recover eventually :-) I don't know much about etcd, but it seems to me that EtcdLock's Lock() function needs to check if the session is still valid (i.e. is the channel returned by Done() still open?). If the session isn't valid anymore--because the lease expired, I assume--then a new one needs to be made, instead of continuing to use the old session. If that sounds reasonable, I can provide a patch. |
My colleague @willmo wonders if there may be a better way to address this. The question is around the Lock interface, specifically what to do if an error is returned from Lock(). The code in core.go, specifically acquireLock and runStandby (which calls acquireLock), assumes that all errors returned from Lock() will be transient; in other words, an error won't mean that the lock is permanently unusable. However, for etcd3, this is currently not the case. The lock returned by LockWith() can become invalid if its associated session/lease expires. Without knowing more about Consul and Zookeeper, it's not immediately clear if a similar problem might exist with those backends as well. That leads us to ask: what should change here? Should core.go change so it calls LockWith() again if Lock() returns an error? Should the etcd3 backend change (which was my proposal above) and the Lock documentation improve to explicitly state the contract? Even if the current contract remains, for the sake of robustness, should we do both? |
i am going to take a look. |
@xiang90, thanks for taking a look. Just wanted to note how to reproduce this:
After this, the standby that had its lease removed will start logging "requested lease not found" every 10 seconds and never recover. In the real world, my guess for what took the place of step (2) in our case was temporary network problems. |
@xiang90, please take a look at my PR. Thanks. |
This change makes these errors transient instead of permanent: [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found After this change, there can still be one of these errors when a standby vault that lost its lease tries to become leader, but on the next lock acquisition attempt a new session will be created. With this new session, the standby will be able to become the leader.
This change makes these errors transient instead of permanent: [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found After this change, there can still be one of these errors when a standby vault that lost its lease tries to become leader, but on the next lock acquisition attempt a new session will be created. With this new session, the standby will be able to become the leader.
Sure! |
vault 0.7.3
etcd 3.1.9
API v3
vault cannot change leader to other standby nodes
Error log
2017/07/18 03:28:39.935689 [ERROR] core: failed to acquire lock: error=etcdserver: requested lease not found
The text was updated successfully, but these errors were encountered: