You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently it's possible for a job to be submitted but Nomad fails to create an evaluation for it. This leaves the job permanently in pending state until an operator notices and manually creates a new evaluation.
An orphaned job could be created by a server crashing, leader election, or a backup happening after a job has been committed to Raft, but before the corresponding evaluation has been committed. While this should be exceedingly rare, it does happen.
It's especially problematic with periodic jobs as the only indication of a failure is a log line on the current leader:
nomad.periodic: failed to dispatch job ...
Failures when submitting a job through the API would return a similar error to the user, so they would have immediate feedback and could resubmit the job.
Solution
Submit the job and its eval in a single raft log entry. This ensures both is either fully committed or the whole operation fails leading to no job or raft log entry.
In the case of a leader election during periodic job dispatching, the newly elected leader should notice the missing invocation and create it successfully.
The text was updated successfully, but these errors were encountered:
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Currently it's possible for a job to be submitted but Nomad fails to create an evaluation for it. This leaves the job permanently in
pending
state until an operator notices and manually creates a new evaluation.An orphaned job could be created by a server crashing, leader election, or a backup happening after a job has been committed to Raft, but before the corresponding evaluation has been committed. While this should be exceedingly rare, it does happen.
It's especially problematic with periodic jobs as the only indication of a failure is a log line on the current leader:
Failures when submitting a job through the API would return a similar error to the user, so they would have immediate feedback and could resubmit the job.
Solution
Submit the job and its eval in a single raft log entry. This ensures both is either fully committed or the whole operation fails leading to no job or raft log entry.
In the case of a leader election during periodic job dispatching, the newly elected leader should notice the missing invocation and create it successfully.
The text was updated successfully, but these errors were encountered: