Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: jobs: clear job claim after execution #93475

Conversation

stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Dec 12, 2022

Since #89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None

Release justification: Lowers the risk of anomalous job behaviour resulting
from a Job's cancellation code running while the job is still executing.

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Since cockroachdb#89014 the job system reset a job's claim when transitioning it
from pause-requested to paused and from cancel-requested to
reverting. The job system signals these transitions to the running
Resumer by cancelling the job's context and does not wait for the
resumer to exit. Once the claim is clear, another node can adopt the
job and start running it's OnFailOrCancel callback. As a result,
clearing the context makes it more likely that OnFailOrCancel
executions will overlap with Resume executions.

In general, Jobs need to assume that Resume may still be running while
OnFailOrCancel is called. But, making it more likely isn't in our
interest.

Here, we only clear the lease when we exit the job state machine.
This makes it much more likely that OnFailOrCancel doesn't start until
Resume has returned.

Release note: None
Release note: None
The job system clears the lease asyncronously after cancelation. This
lease clearing transaction can cause a restart in the alter changefeed
transaction, which will lead to different feature counter
counts. Thus, we want to wait for the lease clear. However, the lease
clear isn't guaranteed to happen, so we only wait a few seconds for
it.

Release note: None
The explicit transactions in this test can hit transaction retry
errors despite the test conditions all passing.

Here, we wrap the transactions we intend to commit in a retry loop
using client-side retries.

It seems likely that cockroachdb#91563 made transaction retries more likely.

Fixes cockroachdb#92001

Release note: None
Previously we only cleared the claim after the state machine returned
and only if the status wasn't pause-requested or
cancel-requested. This filter on status, however, was unnecessary.

The job may still be in the cancel-requested or pause-requested state
when we go to clear the claim because the transaction that resulted in
the canceled context may not have completed. But, it is still fine to
clear the claim. There are 1 of two cases:

1) Either the transaction that cancelled us fails and we are thus
   still in the state cancel-requested or paused-requested with no
   claim. This is fine. The adoption loop will adopt the job and move
   the state to paused or reverting, just with no context to cancel.

2) The transaction succeeds and we are in paused or reverting without
   a claim set. Just as we wanted.

Here we remove the where clause to always clear the claim when we
return from the state machine.

In the case of (1), when processing the cancel-requested or
paused-requested state the second time, we may still want the claim
cleared. Here, we make sure it gets cleared even in the case where
there is no running job that actually needs to be canceled.

Fixes cockroachdb#92112

Release note: None
@stevendanna stevendanna requested review from a team as code owners December 12, 2022 23:44
@stevendanna stevendanna requested review from HonoreDB and removed request for a team December 12, 2022 23:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@stevendanna
Copy link
Collaborator Author

Not sure what happened on the Github CI generated code check, but it passed on a rerun: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_CheckGeneratedCode/7943816?showRootCauses=false

@ajwerner ajwerner changed the title jobs: clear job claim after execution release-22.2: jobs: clear job claim after execution Dec 13, 2022
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r4, 3 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @benbardin, @dt, and @HonoreDB)

// maybeClearLease clears the claim on the given job, provided that
// the current lease matches our liveness Session.
func (r *Registry) maybeClearLease(job *Job, jobErr error) {
if jobErr == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noob q: why don't we clear the lease if jobErr == nil? Is a job only ever in a pause/cancel requested state if the job registry interrupts job by a context cancellation, bubbling up in a non nil jobErr? I'm just not sure when the state machine returns non-nil errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the job completes it will exit with a nil error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty. On master, I may eventually add some documentation to reflect this.

@stevendanna stevendanna merged commit 1f8f9fe into cockroachdb:release-22.2 Dec 13, 2022
@stevendanna stevendanna deleted the backport22.2-91563-91884-92005-92121 branch December 13, 2022 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants