-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: jobs: clear job claim after execution #93475
release-22.2: jobs: clear job claim after execution #93475
Conversation
Since cockroachdb#89014 the job system reset a job's claim when transitioning it from pause-requested to paused and from cancel-requested to reverting. The job system signals these transitions to the running Resumer by cancelling the job's context and does not wait for the resumer to exit. Once the claim is clear, another node can adopt the job and start running it's OnFailOrCancel callback. As a result, clearing the context makes it more likely that OnFailOrCancel executions will overlap with Resume executions. In general, Jobs need to assume that Resume may still be running while OnFailOrCancel is called. But, making it more likely isn't in our interest. Here, we only clear the lease when we exit the job state machine. This makes it much more likely that OnFailOrCancel doesn't start until Resume has returned. Release note: None
Release note: None
Release note: None
The job system clears the lease asyncronously after cancelation. This lease clearing transaction can cause a restart in the alter changefeed transaction, which will lead to different feature counter counts. Thus, we want to wait for the lease clear. However, the lease clear isn't guaranteed to happen, so we only wait a few seconds for it. Release note: None
The explicit transactions in this test can hit transaction retry errors despite the test conditions all passing. Here, we wrap the transactions we intend to commit in a retry loop using client-side retries. It seems likely that cockroachdb#91563 made transaction retries more likely. Fixes cockroachdb#92001 Release note: None
Previously we only cleared the claim after the state machine returned and only if the status wasn't pause-requested or cancel-requested. This filter on status, however, was unnecessary. The job may still be in the cancel-requested or pause-requested state when we go to clear the claim because the transaction that resulted in the canceled context may not have completed. But, it is still fine to clear the claim. There are 1 of two cases: 1) Either the transaction that cancelled us fails and we are thus still in the state cancel-requested or paused-requested with no claim. This is fine. The adoption loop will adopt the job and move the state to paused or reverting, just with no context to cancel. 2) The transaction succeeds and we are in paused or reverting without a claim set. Just as we wanted. Here we remove the where clause to always clear the claim when we return from the state machine. In the case of (1), when processing the cancel-requested or paused-requested state the second time, we may still want the claim cleared. Here, we make sure it gets cleared even in the case where there is no running job that actually needs to be canceled. Fixes cockroachdb#92112 Release note: None
Not sure what happened on the Github CI generated code check, but it passed on a rerun: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_CheckGeneratedCode/7943816?showRootCauses=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r4, 3 of 4 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @benbardin, @dt, and @HonoreDB)
// maybeClearLease clears the claim on the given job, provided that | ||
// the current lease matches our liveness Session. | ||
func (r *Registry) maybeClearLease(job *Job, jobErr error) { | ||
if jobErr == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noob q: why don't we clear the lease if jobErr == nil
? Is a job only ever in a pause/cancel requested state if the job registry interrupts job by a context cancellation, bubbling up in a non nil jobErr? I'm just not sure when the state machine returns non-nil errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the job completes it will exit with a nil error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ty. On master, I may eventually add some documentation to reflect this.
Since #89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.
In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.
Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.
Release note: None
Release justification: Lowers the risk of anomalous job behaviour resulting
from a Job's cancellation code running while the job is still executing.
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release