You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
after a failed upgrade from nomad 0.5.4 to 0.5.6 on some of our hosts, we got broken nomad on that nodes(it doesn't work) So we decide to cleanup nomad client state dir(we simply remove it from file system), and relaunch nomad agent. But it can't join to working cluster due follow errors in log:
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.1:4647: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connec
tion refused
Apr 12 00:14:52 monitor1 nomad[3226]: client: registration failure: 7 error(s) occurred:#012#012* RPC failed to server 192.168.30.3:4647: rpc error: failed t
o get conn: dial tcp 192.168.30.3:4647: getsockopt: connection refused#012* RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does
not match. Not registering node.#012* RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012
* RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.30.1:4
647: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does
not match. Not registering node.#012* RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connect
ion refused
and node maked as donw, without any chance go to ready state
root@social:/home/ruslan# nomad node-status
00000000 test vol-h-docker-02 ceph false ready
a50ce082 test server6 ceph false ready
a3e6b08b test monitor1 ceph false down
439a2f5a test graphite ceph false ready
41b521c8 test vol-h-docker-01 ceph false ready
ec475f0a test social ceph false ready
1e6111fb test server2 ceph false ready
As in understand due GH-2277, nodes now have persistent IDs, but secretIDs not persistent, because it can be cleared by remove nomad agent state dir(in our case) so nomad servers thinks that buggy node(because it remember persistent nodeID) try to register, and reject it. In nomad, no any commands that allow to force nomad to forget about down nodes, thus giving her a chance to re-register (it seems that we just have to wait when nomad will made a GC of down nodes, but this require time). What can we do in this situation?
The text was updated successfully, but these errors were encountered:
@tantra35 Yeah that is a bit tricky. What you can do is stop the node and wait for nomad to detect it as dead (30 seconds) and then issue a GC which will clear knowledge of that node from the servers. You can do that as follows:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
0.5.6
Issue
after a failed upgrade from nomad 0.5.4 to 0.5.6 on some of our hosts, we got broken nomad on that nodes(it doesn't work) So we decide to cleanup nomad client state dir(we simply remove it from file system), and relaunch nomad agent. But it can't join to working cluster due follow errors in log:
and node maked as donw, without any chance go to ready state
As in understand due GH-2277, nodes now have persistent IDs, but secretIDs not persistent, because it can be cleared by remove nomad agent state dir(in our case) so nomad servers thinks that buggy node(because it remember persistent nodeID) try to register, and reject it. In nomad, no any commands that allow to force nomad to forget about down nodes, thus giving her a chance to re-register (it seems that we just have to wait when nomad will made a GC of down nodes, but this require time). What can we do in this situation?
The text was updated successfully, but these errors were encountered: