-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs: improve job adoption #53589
jobs: improve job adoption #53589
Conversation
Release justification: bug fixes and low-risk updates to new functionality Release note: None
be28a4b
to
cf3b6e7
Compare
In the PR which adopted the sqlliveness sessions, we shoved all of the stages of adopting jobs into the same stage and we invoked that stage on each adoption interval and on each sent to the adoption channel. These stages are: * Cancel jobs * Serve pause and cancel requests * Delete claims due to dead sessions * Claim jobs * Process claimed jobs This is problematic for tests which send on the adoption channel at a high rate. One important thing to note is that all jobs which are sent on the adoption channel are already claimed. After this PR we move the first three steps above into the cancellation loop we were already running. We also increase the default interval for that loop as it was exceedingly frequent at 1s for no obvious reason. Much of the testing changes are due to this cancelation loop duration change. The tests in this package now run 3x faster (10s vs 30s). Then, upon sends on the adoption channel, we just process claimed jobs. When the adoption interval rolls around, then we attempt to both claim and process jobs. Release justification: bug fixes and low-risk updates to new functionality Release note: None
cf3b6e7
to
8020c95
Compare
I've |
On master the
With this PR it is:
Better but not amazing. cc @rafiss |
Actually setting the GC TTL and the merge queue setting gets us to:
The CPU profiles are rather hilarious still. I'm working on an RFC to deal with the root cause of that pain. I'll post the basic finding which I should have very soon. |
A couple more tweaks to the KV layer gets us to:
|
if r.adoptionDisabled(ctx) { | ||
r.deprecatedCancelAll(ctx) | ||
return | ||
removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused what is happening here - this is the old adoption logic. Why are we changing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow this comment. This loop was both the old and the new logic. I've now separated them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new logic is in claimAndProcessJobs
return stop.ErrUnavailable | ||
case <-ctx.Done(): | ||
return ctx.Err() | ||
if !usingSQLLiveness || i == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment explaining the if condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
This commit turns out to make a reasonably big difference. Release justification: bug fixes and low-risk updates to new functionality Release note: None
Release justification: low risk, high benefit changes to existing functionality Release note: None
8020c95
to
df0e881
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
ping me directly if it is easier
if r.adoptionDisabled(ctx) { | ||
r.deprecatedCancelAll(ctx) | ||
return | ||
removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new logic is in claimAndProcessJobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 12 of 15 files at r2, 2 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @spaskob)
pkg/jobs/adopt.go, line 77 at r4 (raw file):
// resumeClaimedJobs invokes r.resumeJob for each job in claimedToResume. It // does so concurrently. func (r *Registry) resumeClaimedJobs(
an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r=spaskob
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @spaskob)
pkg/jobs/adopt.go, line 77 at r4 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.
In a world where the interfaces were different, I think you're right. I'd love to inject the set of currently running jobs under the sql query via a builtin or virtual table and then pull the job payload directly but that feels more disruptive at this point. I think this will work nicely for this release.
pkg/jobs/registry.go, line 594 at r2 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
The new logic is in
claimAndProcessJobs
Done.
Build failed (retrying...): |
bors r+ |
Already running a review |
is this causing |
Maybe. bors r- while I investigate |
Canceled. |
The table data is deleted asynchronously by the GC job. It's not obvious to me why this wasn't flakey before. The work in cockroachdb#53589 to generally speed up job adoption seems to have revealed the flake that I think existed before the change. This test used to fail under stress 2 minutes. I've run it for 6 now and so far so good. Release justification: non-production code changes Release note: None
53711: sql: fix TestTruncateCompletion r=rohany a=ajwerner The table data is deleted asynchronously by the GC job. It's not obvious to me why this wasn't flakey before. The work in #53589 to generally speed up job adoption seems to have revealed the flake that I think existed before the change. This test used to fail under stress 2 minutes. I've run it for 6 now and so far so good. Release justification: non-production code changes Release note: None 53714: roachtest: add expected passes to ORM tests, stabilize sqlalchemy r=rafiss a=rafiss fixes #53598 Release note: None Release justification: test-only change Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
bors r+ |
Build succeeded: |
touches #52556 |
This was broken in cockroachdb#53589. Having this broken made the backup tests much slower. Release justification: bug fixes and low-risk updates to new functionality Release note: None
53872: jobs: fix TestingNudgeAdoptionQueue r=ajwerner a=ajwerner This was broken in #53589. Having this broken made the backup tests much slower. Release justification: bug fixes and low-risk updates to new functionality Release note: None Co-authored-by: Andrew Werner <[email protected]>
These were missed in cockroachdb#53589. These settings would really be much better served by a cluster setting. Release justification: non-production code change. Release note: None
53898: sql,importccl: adopt jobs.TestingSetAdoptAndCancelIntervals r=ajwerner a=ajwerner These were missed in #53589. These settings would really be much better served by a cluster setting. Release justification: non-production code change. Release note: None Co-authored-by: Andrew Werner <[email protected]>
jobs: don't hold mutex during adoption, launch in parallel
jobs: break up new stages of job lifecycle movement
In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.
These stages are:
This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.
After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.
Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).
Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.
Release justification: bug fixes and low-risk updates to new functionality
Release note: None