You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
About two days ago, one of the server nodes in our cluster panicked and exited, and was subsequently restarted. However, since then, some of our periodic jobs have not been working. There are a lot of these lines in the server logs:
[ERR] nomad.periodic: force run of periodic job "consul-snapshot" failed: can't force run non-tracked job consul-snapshot
[ERR] nomad: failed to establish leadership: force run of periodic job "consul-snapshot" failed: can't force run non-tracked job consul-snapshot
As well as these:
[ERR] nomad.periodic: failed to dispatch job "logstash-curator": timed out enqueuing operation
[ERR] nomad.client: alloc update failed: timed out enqueuing operation ### (about 1 of these for 100 of the above)
What are my options here? Just remove the jobs and reschedule? They have been working for several months at least up until two days ago.
Server logs
This is the last few logs from the node that crashed. I'm not sure if it is related or not:
2017/07/11 14:42:12.495355 [ERR] nomad: failed to establish leadership: force run of periodic job "consul-snapshot" failed: can't force run non-tracked job consul-snapshot
2017/07/11 14:42:46.146890 [INFO] fingerprint.consul: consul agent is unavailable
2017/07/11 14:42:46 [WARN] raft: Failed to contact quorum of nodes, stepping down
2017/07/11 14:42:46 [INFO] raft: Node at 192.168.123.154:4647 [Follower] entering Follower state (Leader: "")
2017/07/11 14:42:46.210437 [ERR] nomad.client: Register failed: node is not the leader
2017/07/11 14:42:46.210478 [ERR] client: registration failure: node is not the leader
2017/07/11 14:42:46.210421 [INFO] nomad: cluster leadership lost
2017/07/11 14:42:46 [INFO] raft: aborting pipeline replication to peer {Voter 192.168.123.118:4647 192.168.123.118:4647}
2017/07/11 14:42:46 [INFO] raft: aborting pipeline replication to peer {Voter 192.168.123.116:4647 192.168.123.116:4647}
2017/07/11 14:42:46.213334 [ERR] worker: failed to dequeue evaluation: eval broker disabled
panic: close of closed channel
goroutine 176795521 [running]:
github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).run(0xc42039d3e0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:325 +0x221
created by github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).Start
/opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:171 +0x71
The job consul-snapshot is a periodic parameterized job. I'm guessing one of the parameterized versions has been crashing for a longer time, since we do not seem to have any snapshots from that consul cluster.
The text was updated successfully, but these errors were encountered:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Nomad v0.5.6
Operating system and Environment details
Centos 7, 3 server nodes
Issue
About two days ago, one of the server nodes in our cluster panicked and exited, and was subsequently restarted. However, since then, some of our periodic jobs have not been working. There are a lot of these lines in the server logs:
As well as these:
What are my options here? Just remove the jobs and reschedule? They have been working for several months at least up until two days ago.
Server logs
This is the last few logs from the node that crashed. I'm not sure if it is related or not:
The job
consul-snapshot
is a periodic parameterized job. I'm guessing one of the parameterized versions has been crashing for a longer time, since we do not seem to have any snapshots from that consul cluster.The text was updated successfully, but these errors were encountered: