-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629
Comments
Just a quick update, we have several nodes that authenticate using only certificates instead of using a specific user/password. Those hosts appear to be unaffected by this authentication problem. So it appears to only be the user/password authentication tokens that don't work correctly. |
@aauren thanks for your report. The problem comes from the stateful nature of simple token. The simple tokens aren't replicated during membership change so they are invalid from the rejoined node. Is it possible for you to try jwt token: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md#--auth-token ? JWT is stateless so the problem won't happen. |
Thanks for your response @mitake! While JWT is certainly something that we can look into in the future, we have deployed clusters with simple token authentication already. If one of them loses their data directory is there any work around to allow them to authenticate again with simple token authentication like the other nodes in the cluster do? Also, do you know if this caveat to simple token authentication is somewhere in the documentation? |
I meet this problem when upgrade etcd from v3.3.4 to v3.3.11. |
This problem hit me, as well. JWT is not something my organization supports. We do SSL certs so that is feasible, but it adds another whole level of complexity at scale, to a simple problem of needing some sort of authentication to the etcd cluster from hundreds or thousands of clients. With this known issue (auth breaking - permanently - if you replace a node), plus the missing feature of plain auth working over TLS in etcd v3, this really is landmine in any real enterprise deployment. This needs to be more clearly documented. The whole plain-auth feature needs a giant asterisk -- it's really only a feature for testing or small-scale development. "DO try this at home -- ONLY." Many system designers are adding an NGINX layer on top of etcd v3 just to solve this problem. The gRPC-proxy is another good workaround, at least as a TLS termination point getting you crypto over your WAN links, say, and being in-the-clear only on the LAN to clients within a data center. Still, this is rapidly becoming "not good enough" in today's tightening security climate. |
I disable auth, then enable auth to solve this problem |
@aauren @zyf0330 @paulcaskey sorry for my late reply, I missed your comments... I'll enhance a doc for describing the problem. Probably @wangsong93 's workaround would be good for testing purpose so I'll add it too. |
I tried |
I had similar problem, try restarting re-added node once right after it's up, it'll work fine then |
My guess is that you experienced a different problem than the one here if restarting or re-adding the node worked for you. Like @mitake said originally, this information for the simple auth token isn't synced during membership, so the only way that it can be re-synchronized to a node that doesn't have this information is by adding the node to the cluster, disabling authentication, setting the password on the user again, and then re-enabling authentication. This is the only procedure that I've found to work consistently when this problem is encountered. |
Sorry, my situation is not a problem. Actually some watchers with expired token cause that log. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
YES. re-start the add node . it worked |
I was simulating loss of the etcd data directory on one of our etcd test clusters. In this cluster we have HTTPS setup and also authentication turned on. Since certs and authentication are required assume that the following environment variables are present and configured correctly for all of the below etcdctl commands (unless specifically overridden):
ETCDCTL_CERT
,ETCDCTL_KEY
,ETCDCTL_USER
(which is set to the root user and password),ETCDCTL_ENDPOINTS
andETCDCTL_API=3
.The steps I was using are as follows:
The service starts correctly and everything looks good in the logs (at first).
etcdctl endpoint status
shows the following:However, if I actually try to execute anything against etcd2 I get the following error both from etcdctl and in the logs:
the etcd error logs display a bunch of the following:
Running the above commands against either of the other two nodes in the cluster succeeds and displays the correct results.
It seems to me that when the new node comes up, while it is able to sync data down from the other two etcd's left in the cluster, somehow authentication credentials aren't being synced up correctly. So that for any endpoint that requires authentication (like the health endpoint or user list) it fails to authenticate correctly.
I've attached a scrubbed log from the offending node.
etcd.err.zip
Let me know if there is any other information that I can provide to help get this figured out/resolved.
The text was updated successfully, but these errors were encountered: