Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: improve job adoption #53589

Merged
merged 4 commits into from
Sep 1, 2020

Conversation

ajwerner
Copy link
Contributor

jobs: don't hold mutex during adoption, launch in parallel

jobs: break up new stages of job lifecycle movement

In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.

These stages are:

  • Cancel jobs
  • Serve pause and cancel requests
  • Delete claims due to dead sessions
  • Claim jobs
  • Process claimed jobs

This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.

After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.

Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).

Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None

Release justification: bug fixes and low-risk updates to new functionality
Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/fix-jobs-badness branch 2 times, most recently from be28a4b to cf3b6e7 Compare August 28, 2020 14:28
In the PR which adopted the sqlliveness sessions, we shoved all of the stages
of adopting jobs into the same stage and we invoked that stage on each adoption
interval and on each sent to the adoption channel.

These stages are:

 * Cancel jobs
 * Serve pause and cancel requests
 * Delete claims due to dead sessions
 * Claim jobs
 * Process claimed jobs

This is problematic for tests which send on the adoption channel at a high
rate. One important thing to note is that all jobs which are sent on the
adoption channel are already claimed.

After this PR we move the first three steps above into the cancellation
loop we were already running. We also increase the default interval for
that loop as it was exceedingly frequent at 1s for no obvious reason.

Much of the testing changes are due to this cancelation loop duration
change. The tests in this package now run 3x faster (10s vs 30s).

Then, upon sends on the adoption channel, we just process claimed jobs.
When the adoption interval rolls around, then we attempt to both claim
and process jobs.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-jobs-badness branch from cf3b6e7 to 8020c95 Compare August 28, 2020 15:02
@ajwerner
Copy link
Contributor Author

I've roachprod-stressraced this for 16 minutes so far.

@ajwerner ajwerner requested a review from spaskob August 28, 2020 15:03
@ajwerner ajwerner marked this pull request as ready for review August 28, 2020 15:03
@ajwerner ajwerner requested a review from a team August 28, 2020 15:03
@ajwerner ajwerner requested a review from a team as a code owner August 28, 2020 15:03
@ajwerner ajwerner requested review from adityamaru and removed request for a team August 28, 2020 15:03
@ajwerner
Copy link
Contributor Author

On master the typeorm test takes:

--- PASS: typeorm (2425.84s)
--- PASS: typeorm (2510.47s)
--- PASS: typeorm (2554.91s)
PASS

With this PR it is:

--- PASS: typeorm (1774.17s)
--- PASS: typeorm (1786.25s)
--- PASS: typeorm (1821.85s)
PASS

Better but not amazing. cc @rafiss

@ajwerner
Copy link
Contributor Author

Actually setting the GC TTL and the merge queue setting gets us to:

--- PASS: typeorm (1444.19s)
--- PASS: typeorm (1463.58s)
--- PASS: typeorm (1530.58s)

The CPU profiles are rather hilarious still. I'm working on an RFC to deal with the root cause of that pain. I'll post the basic finding which I should have very soon.

@ajwerner
Copy link
Contributor Author

A couple more tweaks to the KV layer gets us to:

=== RUN   typeorm
=== RUN   typeorm
=== RUN   typeorm
--- PASS: typeorm (1094.87s)
--- PASS: typeorm (1094.73s)
--- PASS: typeorm (1084.94s)

@ajwerner
Copy link
Contributor Author

In sum this + one of #53605 or #53603 + #53606 gets us to:

--- PASS: typeorm (1094.87s)
--- PASS: typeorm (1094.73s)
--- PASS: typeorm (1084.94s)

if r.adoptionDisabled(ctx) {
r.deprecatedCancelAll(ctx)
return
removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused what is happening here - this is the old adoption logic. Why are we changing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this comment. This loop was both the old and the new logic. I've now separated them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic is in claimAndProcessJobs

return stop.ErrUnavailable
case <-ctx.Done():
return ctx.Err()
if !usingSQLLiveness || i == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a comment explaining the if condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This commit turns out to make a reasonably big difference.

Release justification: bug fixes and low-risk updates to new functionality
Release note: None
Release justification: low risk, high benefit changes to existing functionality
Release note: None
@ajwerner ajwerner force-pushed the ajwerner/fix-jobs-badness branch from 8020c95 to df0e881 Compare August 31, 2020 12:50
Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL
ping me directly if it is easier

if r.adoptionDisabled(ctx) {
r.deprecatedCancelAll(ctx)
return
removeClaimsFromDeadSessions := func(ctx context.Context, s sqlliveness.Session) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new logic is in claimAndProcessJobs

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 15 files at r2, 2 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @ajwerner, and @spaskob)


pkg/jobs/adopt.go, line 77 at r4 (raw file):

// resumeClaimedJobs invokes r.resumeJob for each job in claimedToResume. It
// does so concurrently.
func (r *Registry) resumeClaimedJobs(

an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r=spaskob

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru and @spaskob)


pkg/jobs/adopt.go, line 77 at r4 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

an observation: the slow bit of this operation is fetching the payload of the job to be run from the jobs table; an alternative approach may be to fetch all these in one query. I am not sure if this is better, just food for thought.

In a world where the interfaces were different, I think you're right. I'd love to inject the set of currently running jobs under the sql query via a builtin or virtual table and then pull the job payload directly but that feels more disruptive at this point. I think this will work nicely for this release.


pkg/jobs/registry.go, line 594 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

The new logic is in claimAndProcessJobs

Done.

@craig
Copy link
Contributor

craig bot commented Aug 31, 2020

Build failed (retrying...):

@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 31, 2020

Already running a review

@otan
Copy link
Contributor

otan commented Aug 31, 2020

is this causing TestTruncateCompletion to fail in https://teamcity.cockroachdb.com/viewLog.html?buildId=2236228&buildTypeId=Cockroach_UnitTests?

@ajwerner
Copy link
Contributor Author

Maybe.

bors r-

while I investigate

@craig
Copy link
Contributor

craig bot commented Aug 31, 2020

Canceled.

ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 31, 2020
The table data is deleted asynchronously by the GC job. It's not obvious to me
why this wasn't flakey before. The work in cockroachdb#53589 to generally speed up job
adoption seems to have revealed the flake that I think existed before the
change. This test used to fail under stress 2 minutes. I've run it for 6
now and so far so good.

Release justification: non-production code changes

Release note: None
craig bot pushed a commit that referenced this pull request Sep 1, 2020
53711: sql: fix TestTruncateCompletion r=rohany a=ajwerner

The table data is deleted asynchronously by the GC job. It's not obvious to me
why this wasn't flakey before. The work in #53589 to generally speed up job
adoption seems to have revealed the flake that I think existed before the
change. This test used to fail under stress 2 minutes. I've run it for 6
now and so far so good.

Release justification: non-production code changes

Release note: None

53714: roachtest: add expected passes to ORM tests, stabilize sqlalchemy r=rafiss a=rafiss

fixes #53598 

Release note: None
Release justification: test-only change

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@ajwerner
Copy link
Contributor Author

ajwerner commented Sep 1, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 1, 2020

Build succeeded:

@craig craig bot merged commit d331642 into cockroachdb:master Sep 1, 2020
@rafiss
Copy link
Collaborator

rafiss commented Sep 1, 2020

touches #52556

ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 3, 2020
This was broken in cockroachdb#53589. Having this broken made the backup tests much
slower.

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
craig bot pushed a commit that referenced this pull request Sep 3, 2020
53872: jobs: fix TestingNudgeAdoptionQueue r=ajwerner a=ajwerner

This was broken in #53589. Having this broken made the backup tests much
slower.

Release justification: bug fixes and low-risk updates to new functionality

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Sep 3, 2020
These were missed in cockroachdb#53589.

These settings would really be much better served by a cluster setting.

Release justification: non-production code change.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 8, 2020
53898: sql,importccl: adopt jobs.TestingSetAdoptAndCancelIntervals r=ajwerner a=ajwerner

These were missed in #53589.

These settings would really be much better served by a cluster setting.

Release justification: non-production code change.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants