Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed core jobs should not have follow-ups #8682

Merged
merged 2 commits into from
Aug 18, 2020
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Aug 17, 2020

Fixes #8658

If an evaluation fails more than the delivery limit, the leader will create a new
eval with the TriggeredBy field set to failed-follow-up.

Evaluations for core jobs have the leader's ACL, which is not valid on another
leader after an election. The failed-follow-up evals do not have ACLs, so
core job evals that fail more than the delivery limit or core job evals that
span leader elections will never succeed and will be re-enqueued forever. So
we should not retry with a failed-follow-up.


Failing test before the patch in 025a613:

nomad-613 2020-08-17T20:27:52.184Z [WARN]  nomad/leader.go:641: nomad: eval reached delivery limit, marking as failed: eval="<Eval "1bf00c66-09a5-a3fb-488a-f30ecede9798" JobID: "csi-plugin-gc" Namespace: "-">"
    core_sched_test.go:2452: 
        	Error Trace:	core_sched_test.go:2452
        	Error:      	Expected nil, but got: <Eval "574cf08f-c0ee-b12b-2ef3-d257380bae8e" JobID: "csi-plugin-gc" Namespace: "-">
        	Test:       	TestCoreScheduler_FailLoop
        	Messages:   	failed core jobs should not result in follow-up. TriggeredBy: failed-follow-up

If a core job fails more than the delivery limit, the leader will create a new
eval with the TriggeredBy field set to `failed-follow-up`.

Evaluations for core jobs have the leader's ACL, which is not valid on another
leader after an election. The `failed-follow-up` evals do not have ACLs, so
core job evals that fail more than the delivery limit or core job evals that
span leader elections will never succeed and will be re-enqueued forever. So
we should not retry with a `failed-follow-up`.
@tgross tgross marked this pull request as ready for review August 17, 2020 20:48
@tgross tgross requested review from schmichael and notnoop August 17, 2020 20:48
@tgross tgross added this to the 0.12.4 milestone Aug 18, 2020
@tgross tgross merged commit 108a422 into master Aug 18, 2020
@tgross tgross deleted the b-core-job-followup branch August 18, 2020 20:48
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: batch job reap failed: error="rpc error: Permission denied"
2 participants