Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true #16583

Merged
merged 21 commits into from
Mar 27, 2023

Conversation

Juanadelacuesta
Copy link
Member

@Juanadelacuesta Juanadelacuesta commented Mar 21, 2023

This PR addresses the bug reported on #11052

When a leader change happens, the periodic dispatcher on the new leader starts by re running all periodic jobs by force, without checking if there is an instance of the said job already.
A new check is introduced that skips the job if prohibit_overlap is set and there is already a instance running.

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, thanks so much for taking this on. Most on the suggestions are nitpicky around using must and a couple of stylistic suggestions.

The biggest item is that we need to revert the changes made within nomad/periodic.go due to the problem described in my comment.

There also seem to be failures and panics within the CI tests that we should look into and fix before merging.

The PR will also need a changelog entry before we can merge.

nomad/leader.go Outdated Show resolved Hide resolved
nomad/leader.go Outdated
// isNewEvalNeeded checks if the job allows for overlap and if there are already
// instances of the job running in order to determine if a new evaluation needs to
// be created upon periodic dispatcher restore
func (s *Server) isNewEvalNeeded(job *structs.Job) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to bikeshed too much, but this function is specific to periodic jobs, therefore it might be nice to have a more descriptive name. Otherwise, this seems like a generic job helper function at first glance.

@@ -465,7 +487,15 @@ func TestLeader_PeriodicDispatcher_Restore_Evals(t *testing.T) {
}

// Create an eval for the past launch.
s1.periodicDispatcher.createEval(job, past)
eval, _ := s1.periodicDispatcher.createEval(job, past)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let check this error return using must.NoError to be sure.

nomad/leader_test.go Outdated Show resolved Hide resolved
nomad/leader_test.go Outdated Show resolved Hide resolved
nomad/leader_test.go Outdated Show resolved Hide resolved
nomad/leader_test.go Outdated Show resolved Hide resolved
@@ -278,10 +278,10 @@ func (p *PeriodicDispatch) removeLocked(jobID structs.NamespacedID) error {
// subsequent eval.
func (p *PeriodicDispatch) ForceRun(namespace, jobID string) (*structs.Evaluation, error) {
p.l.Lock()
defer p.l.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to go back to the original style here, as the comment on the createEval function reads "This should not be called with the lock held."

nomad/leader_test.go Outdated Show resolved Hide resolved
nomad/leader_test.go Outdated Show resolved Hide resolved
@jrasell jrasell requested a review from tgross March 22, 2023 13:22
@@ -278,10 +278,10 @@ func (p *PeriodicDispatch) removeLocked(jobID structs.NamespacedID) error {
// subsequent eval.
func (p *PeriodicDispatch) ForceRun(namespace, jobID string) (*structs.Evaluation, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is an existing method name, but I wonder if calling this ForceRun rather than ForceEval is misleading.

nomad/leader.go Outdated Show resolved Hide resolved
@Juanadelacuesta Juanadelacuesta merged commit e773710 into main Mar 27, 2023
@Juanadelacuesta Juanadelacuesta deleted the b-gh-11052 branch March 27, 2023 15:25
Juanadelacuesta added a commit that referenced this pull request Mar 28, 2023
…hibit_overlap is true (#16583)

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <[email protected]>
Co-authored-by: James Rasell <[email protected]>
jrasell added a commit that referenced this pull request Mar 28, 2023
…hibit_overlap is true (#16583)

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <[email protected]>
Co-authored-by: James Rasell <[email protected]>
jrasell added a commit that referenced this pull request Mar 28, 2023
…hibit_overlap is true (#16583)

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous childre.

* Multiple instances of a periodic job are run simultaneously, when prohibit_overlap is true
Fixes #11052
When restoring periodic dispatcher, all periodic jobs are forced without checking for previous children.

* style: refactor force run function

* fix: remove defer and inline unlock for speed optimization

* Update nomad/leader.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* style: refactor tests to use must

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* Update nomad/leader_test.go

Co-authored-by: James Rasell <[email protected]>

* fix: move back from defer to calling unlock before returning.

createEval cant be called with the lock on

* style: refactor test to use must

* added new entry to changelog and update comments

---------

Co-authored-by: James Rasell <[email protected]>
Co-authored-by: James Rasell <[email protected]>
jrasell added a commit that referenced this pull request Mar 28, 2023
…hibit_overlap is true (#16583) (#16683)

Co-authored-by: Juana De La Cuesta <[email protected]>
jrasell added a commit that referenced this pull request Mar 28, 2023
…hibit_overlap is true (#16583) (#16682)

Co-authored-by: Juana De La Cuesta <[email protected]>
jrasell added a commit that referenced this pull request Mar 28, 2023
…hibit_overlap is true (#16583) (#16681)

Co-authored-by: James Rasell <[email protected]>
Co-authored-by: James Rasell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.3.x backport to 1.3.x release line backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line hcc/cst Admin - internal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants