jobs: improve job adoption #53589

ajwerner · 2020-08-28T03:36:02Z

jobs: don't hold mutex during adoption, launch in parallel

jobs: break up new stages of job lifecycle movement

In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.

These stages are:

Cancel jobs
Serve pause and cancel requests
Delete claims due to dead sessions
Claim jobs
Process claimed jobs

This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.

After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.

Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).

Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None

Release justification: bug fixes and low-risk updates to new functionality Release note: None

cockroach-teamcity · 2020-08-28T03:36:09Z

This change is

In the PR which adopted the sqlliveness sessions, we shoved all of the stages of adopting jobs into the same stage and we invoked that stage on each adoption interval and on each sent to the adoption channel. These stages are: * Cancel jobs * Serve pause and cancel requests * Delete claims due to dead sessions * Claim jobs * Process claimed jobs This is problematic for tests which send on the adoption channel at a high rate. One important thing to note is that all jobs which are sent on the adoption channel are already claimed. After this PR we move the first three steps above into the cancellation loop we were already running. We also increase the default interval for that loop as it was exceedingly frequent at 1s for no obvious reason. Much of the testing changes are due to this cancelation loop duration change. The tests in this package now run 3x faster (10s vs 30s). Then, upon sends on the adoption channel, we just process claimed jobs. When the adoption interval rolls around, then we attempt to both claim and process jobs. Release justification: bug fixes and low-risk updates to new functionality Release note: None

ajwerner · 2020-08-28T15:03:20Z

I've roachprod-stressraced this for 16 minutes so far.

ajwerner · 2020-08-28T15:44:00Z

On master the typeorm test takes:

--- PASS: typeorm (2425.84s)
--- PASS: typeorm (2510.47s)
--- PASS: typeorm (2554.91s)
PASS

With this PR it is:

--- PASS: typeorm (1774.17s)
--- PASS: typeorm (1786.25s)
--- PASS: typeorm (1821.85s)
PASS

Better but not amazing. cc @rafiss

ajwerner · 2020-08-28T16:25:55Z

Actually setting the GC TTL and the merge queue setting gets us to:

--- PASS: typeorm (1444.19s)
--- PASS: typeorm (1463.58s)
--- PASS: typeorm (1530.58s)

The CPU profiles are rather hilarious still. I'm working on an RFC to deal with the root cause of that pain. I'll post the basic finding which I should have very soon.

ajwerner · 2020-08-28T17:01:00Z

A couple more tweaks to the KV layer gets us to:

=== RUN   typeorm
=== RUN   typeorm
=== RUN   typeorm
--- PASS: typeorm (1094.87s)
--- PASS: typeorm (1094.73s)
--- PASS: typeorm (1084.94s)

ajwerner · 2020-08-28T18:48:01Z

In sum this + one of #53605 or #53603 + #53606 gets us to:

--- PASS: typeorm (1094.87s)
--- PASS: typeorm (1094.73s)
--- PASS: typeorm (1084.94s)

spaskob · 2020-08-31T07:59:51Z

pkg/jobs/registry.go

-		if r.adoptionDisabled(ctx) {
-			r.deprecatedCancelAll(ctx)
-			return
+	removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) {


I am confused what is happening here - this is the old adoption logic. Why are we changing it?

I'm not sure I follow this comment. This loop was both the old and the new logic. I've now separated them.

The new logic is in claimAndProcessJobs

spaskob · 2020-08-31T08:05:11Z

pkg/jobs/registry.go

-			return stop.ErrUnavailable
-		case <-ctx.Done():
-			return ctx.Err()
+		if !usingSQLLiveness || i == 0 {


please add a comment explaining the if condition

This commit turns out to make a reasonably big difference. Release justification: bug fixes and low-risk updates to new functionality Release note: None

Release justification: low risk, high benefit changes to existing functionality Release note: None

spaskob

PTAL
ping me directly if it is easier

spaskob · 2020-08-31T15:07:10Z

pkg/jobs/registry.go

-		if r.adoptionDisabled(ctx) {
-			r.deprecatedCancelAll(ctx)
-			return
+	removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) {


The new logic is in claimAndProcessJobs

spaskob

Reviewed 12 of 15 files at r2, 2 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @spaskob)

pkg/jobs/adopt.go, line 77 at r4 (raw file):

// resumeClaimedJobs invokes r.resumeJob for each job in claimedToResume. It
// does so concurrently.
func (r *Registry) resumeClaimedJobs(

an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.

ajwerner

TFTR!

bors r=spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @spaskob)

pkg/jobs/adopt.go, line 77 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.

In a world where the interfaces were different, I think you're right. I'd love to inject the set of currently running jobs under the sql query via a builtin or virtual table and then pull the job payload directly but that feels more disruptive at this point. I think this will work nicely for this release.

pkg/jobs/registry.go, line 594 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

The new logic is in claimAndProcessJobs

Done.

craig · 2020-08-31T18:04:50Z

Build failed (retrying...):

GitHub CI (Cockroach)

ajwerner · 2020-08-31T19:19:20Z

bors r+

craig · 2020-08-31T19:19:21Z

Already running a review

otan · 2020-08-31T21:17:12Z

is this causing TestTruncateCompletion to fail in https://teamcity.cockroachdb.com/viewLog.html?buildId=2236228&buildTypeId=Cockroach_UnitTests?

ajwerner · 2020-08-31T21:18:14Z

Maybe.

bors r-

while I investigate

craig · 2020-08-31T21:18:16Z

Canceled.

The table data is deleted asynchronously by the GC job. It's not obvious to me why this wasn't flakey before. The work in cockroachdb#53589 to generally speed up job adoption seems to have revealed the flake that I think existed before the change. This test used to fail under stress 2 minutes. I've run it for 6 now and so far so good. Release justification: non-production code changes Release note: None

53711: sql: fix TestTruncateCompletion r=rohany a=ajwerner The table data is deleted asynchronously by the GC job. It's not obvious to me why this wasn't flakey before. The work in #53589 to generally speed up job adoption seems to have revealed the flake that I think existed before the change. This test used to fail under stress 2 minutes. I've run it for 6 now and so far so good. Release justification: non-production code changes Release note: None 53714: roachtest: add expected passes to ORM tests, stabilize sqlalchemy r=rafiss a=rafiss fixes #53598 Release note: None Release justification: test-only change Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

ajwerner · 2020-09-01T00:03:55Z

bors r+

craig · 2020-09-01T03:11:01Z

Build succeeded:

GitHub CI (Cockroach)

rafiss · 2020-09-01T17:06:36Z

touches #52556

This was broken in cockroachdb#53589. Having this broken made the backup tests much slower. Release justification: bug fixes and low-risk updates to new functionality Release note: None

53872: jobs: fix TestingNudgeAdoptionQueue r=ajwerner a=ajwerner This was broken in #53589. Having this broken made the backup tests much slower. Release justification: bug fixes and low-risk updates to new functionality Release note: None Co-authored-by: Andrew Werner <[email protected]>

These were missed in cockroachdb#53589. These settings would really be much better served by a cluster setting. Release justification: non-production code change. Release note: None

53898: sql,importccl: adopt jobs.TestingSetAdoptAndCancelIntervals r=ajwerner a=ajwerner These were missed in #53589. These settings would really be much better served by a cluster setting. Release justification: non-production code change. Release note: None Co-authored-by: Andrew Werner <[email protected]>

jobs: don't hold mutex during adoption, launch in parallel

2f5c6b8

Release justification: bug fixes and low-risk updates to new functionality Release note: None

ajwerner force-pushed the ajwerner/fix-jobs-badness branch 2 times, most recently from be28a4b to cf3b6e7 Compare August 28, 2020 14:28

ajwerner force-pushed the ajwerner/fix-jobs-badness branch from cf3b6e7 to 8020c95 Compare August 28, 2020 15:02

ajwerner requested a review from spaskob August 28, 2020 15:03

ajwerner marked this pull request as ready for review August 28, 2020 15:03

ajwerner requested a review from a team August 28, 2020 15:03

ajwerner requested a review from a team as a code owner August 28, 2020 15:03

ajwerner requested review from adityamaru and removed request for a team August 28, 2020 15:03

spaskob reviewed Aug 31, 2020

View reviewed changes

ajwerner added 2 commits August 31, 2020 08:50

jobs: minor tweaks increase adoption concurrency

680d605

This commit turns out to make a reasonably big difference. Release justification: bug fixes and low-risk updates to new functionality Release note: None

jobs: run GC at least as frequently as things expire

df0e881

Release justification: low risk, high benefit changes to existing functionality Release note: None

ajwerner force-pushed the ajwerner/fix-jobs-badness branch from 8020c95 to df0e881 Compare August 31, 2020 12:50

spaskob reviewed Aug 31, 2020

View reviewed changes

ajwerner mentioned this pull request Aug 31, 2020

quotapool,kvserver: extend quotapool.RateLimit, rate limit queue addition in SystemConfigUpdate #53605

Merged

spaskob approved these changes Aug 31, 2020

View reviewed changes

ajwerner commented Aug 31, 2020

View reviewed changes

ajwerner mentioned this pull request Aug 31, 2020

sql: fix TestTruncateCompletion #53711

Merged

craig bot merged commit d331642 into cockroachdb:master Sep 1, 2020

ajwerner mentioned this pull request Sep 3, 2020

jobs: fix TestingNudgeAdoptionQueue #53872

Merged

ajwerner mentioned this pull request Sep 3, 2020

sql,importccl: adopt jobs.TestingSetAdoptAndCancelIntervals #53898

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: improve job adoption #53589

jobs: improve job adoption #53589

ajwerner commented Aug 28, 2020

cockroach-teamcity commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

spaskob Aug 31, 2020

ajwerner Aug 31, 2020

spaskob Aug 31, 2020

spaskob Aug 31, 2020

ajwerner Aug 31, 2020

spaskob left a comment

spaskob Aug 31, 2020

spaskob left a comment

ajwerner left a comment

craig bot commented Aug 31, 2020

ajwerner commented Aug 31, 2020

craig bot commented Aug 31, 2020

otan commented Aug 31, 2020

ajwerner commented Aug 31, 2020

craig bot commented Aug 31, 2020

ajwerner commented Sep 1, 2020

craig bot commented Sep 1, 2020

rafiss commented Sep 1, 2020

jobs: improve job adoption #53589

jobs: improve job adoption #53589

Conversation

ajwerner commented Aug 28, 2020

jobs: don't hold mutex during adoption, launch in parallel

jobs: break up new stages of job lifecycle movement

cockroach-teamcity commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

ajwerner commented Aug 28, 2020

spaskob Aug 31, 2020

Choose a reason for hiding this comment

ajwerner Aug 31, 2020

Choose a reason for hiding this comment

spaskob Aug 31, 2020

Choose a reason for hiding this comment

spaskob Aug 31, 2020

Choose a reason for hiding this comment

ajwerner Aug 31, 2020

Choose a reason for hiding this comment

spaskob left a comment

Choose a reason for hiding this comment

spaskob Aug 31, 2020

Choose a reason for hiding this comment

spaskob left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

craig bot commented Aug 31, 2020

ajwerner commented Aug 31, 2020

craig bot commented Aug 31, 2020

otan commented Aug 31, 2020

ajwerner commented Aug 31, 2020

craig bot commented Aug 31, 2020

ajwerner commented Sep 1, 2020

craig bot commented Sep 1, 2020

rafiss commented Sep 1, 2020