Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: start index drop job for primary key changes immediately #47624

Merged
merged 1 commit into from
Apr 22, 2020

Conversation

thoszhang
Copy link
Contributor

@thoszhang thoszhang commented Apr 17, 2020

Previously, the job to drop old indexes after a primary key change was
created at the end of the schema change and left for the job registry to
adopt at some later time. This PR switches the job creation over to
CreateStartableJobWithTxn, which allows for starting the job
immediately after the transaction which creates the job is committed.

This eliminates one source of waiting time before other schema changes,
which are queued behind the mutation to drop the indexes, can run. Note,
however, that successive schema changes don't start running immediately
after the mutations ahead of them in the queue are all cleared; if
they've ever hit a retry error due to not being first in line, then they
will have to wait to be re-adopted by the job registry. This is why the
existing primary key change tests still need to have the adopt interval
set to a low value in order to finish quickly.

Addresses #45150.

Release note (performance improvement): The cleanup job which runs after
a primary key change to remove old indexes, which blocks other schema
changes from running, now starts immediately after the primary key swap
is complete. This reduces the amount of waiting time before subsequent
schema changes can run.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rohany
Copy link
Contributor

rohany commented Apr 17, 2020

I think the alter column set data type implementation will want something similar to this too

@blathers-crl
Copy link

blathers-crl bot commented Apr 17, 2020

❌ The GitHub CI (Cockroach) build has failed on ff660830.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@thoszhang thoszhang force-pushed the pk-drop-index-startable-job branch from ff66083 to fc96d9d Compare April 18, 2020 03:17
@thoszhang thoszhang marked this pull request as ready for review April 18, 2020 03:18
@thoszhang
Copy link
Contributor Author

FWIW, on my laptop, TestMultiplePrimaryKeyChanges takes about 1m to run on master without setTestJobsAdoptInterval, and 30s to run after this change. (It takes about 1s to run with setTestJobsAdoptInterval.) The remaining delay is due to retries from schema change jobs not being first in line.

@thoszhang thoszhang requested review from rohany and ajwerner April 18, 2020 03:22
// change operations. This is wrapped in SucceedsSoon to handle cases where
// dropped indexes are expected to be GC'ed immediately after the schema
// change completes.
testutils.SucceedsSoon(t, func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the PrimaryKeyChangeWithOperations flaky now (the old indexes got cleaned up too fast?)

@@ -2992,14 +3001,68 @@ CREATE TABLE t.test (k INT NOT NULL, v INT);
})
}

// TestPrimaryKeyDropIndexNotCancelable tests that the job to drop indexes after
// a primary key change is not cancelable.
func TestPrimaryKeyDropIndexNotCancelable(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the job finish now in the logic test so we have to move it here?

update := func(txn *kv.Txn, descs map[sqlbase.ID]*sqlbase.MutableTableDescriptor) error {
// Reset vars here because update function can be called multiple times in a retry.
isRollback = false
indexGCJobs = nil
pkChangeIndexDropJob = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to a slice like "childJobs" or something so that it is easy to add new jobs to start immediately?

@rohany
Copy link
Contributor

rohany commented Apr 18, 2020

Thanks for getting to this lucy!

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Seems sane to me. I'll let @rohany sign off.

Reviewed 2 of 3 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @lucy-zhang, and @rohany)


pkg/sql/schema_changer.go, line 786 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Can we change this to a slice like "childJobs" or something so that it is easy to add new jobs to start immediately?

Seems like a reasonable idea to me

Previously, the job to drop old indexes after a primary key change was
created at the end of the schema change and left for the job registry to
adopt at some later time. This PR switches the job creation over to
`CreateStartableJobWithTxn`, which allows for starting the job
immediately after the transaction which creates the job is committed.

This eliminates one source of waiting time before other schema changes,
which are queued behind the mutation to drop the indexes, can run. Note,
however, that successive schema changes don't start running immediately
after the mutations ahead of them in the queue are all cleared; if
they've ever hit a retry error due to not being first in line, then they
will have to wait to be re-adopted by the job registry. This is why the
existing primary key change tests still need to have the adopt interval
set to a low value in order to finish quickly.

Release note (performance improvement): The cleanup job which runs after
a primary key change to remove old indexes, which blocks other schema
changes from running, now starts immediately after the primary key swap
is complete. This reduces the amount of waiting time before subsequent
schema changes can run.
@thoszhang thoszhang force-pushed the pk-drop-index-startable-job branch from fc96d9d to 9f647aa Compare April 21, 2020 17:57
Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rohany)


pkg/sql/schema_changer.go, line 786 at r1 (raw file):

Previously, ajwerner wrote…

Seems like a reasonable idea to me

Done.


pkg/sql/schema_changer_test.go, line 411 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Was the PrimaryKeyChangeWithOperations flaky now (the old indexes got cleaned up too fast?)

Yeah, but the problem wasn't the indexes being GC'ed too quickly per se, since the test previously had the default GC TTL set for the table. Without setting the GC TTL to a low value, I kept getting a value that was lower than 2 times the number of rows in the table but that didn't decrease over time. I think what's going on is that some writes to the table are now happening after the index starts being dropped, so the dropped non-public index doesn't necessarily have the expected KV count. Does that seem plausible to you?


pkg/sql/schema_changer_test.go, line 3006 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Does the job finish now in the logic test so we have to move it here?

Yeah, that's why.

@blathers-crl
Copy link

blathers-crl bot commented Apr 21, 2020

❌ The GitHub CI (Cockroach) build has failed on 9f647aa1.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Contributor

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lucy-zhang)


pkg/sql/schema_changer_test.go, line 411 at r1 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

Yeah, but the problem wasn't the indexes being GC'ed too quickly per se, since the test previously had the default GC TTL set for the table. Without setting the GC TTL to a low value, I kept getting a value that was lower than 2 times the number of rows in the table but that didn't decrease over time. I think what's going on is that some writes to the table are now happening after the index starts being dropped, so the dropped non-public index doesn't necessarily have the expected KV count. Does that seem plausible to you?

Yes, i think that makes sense. The test originally had a multiple of 2 (one multiple for the new primary key and one for the old key). Changing the multiple to 1 and having the gc trigger immediately seems correct.

@thoszhang
Copy link
Contributor Author

TFTRs

bors r+

@thoszhang
Copy link
Contributor Author

(hopefully the infra flake is gone)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Apr 21, 2020

Build failed

@thoszhang
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 22, 2020

Build succeeded

@craig craig bot merged commit 2bcb72d into cockroachdb:master Apr 22, 2020
@thoszhang thoszhang deleted the pk-drop-index-startable-job branch April 22, 2020 03:37
@rohany
Copy link
Contributor

rohany commented Apr 22, 2020

I'm wondering, is this something that is backportable to a point release of 20.1? It makes the previous problem of an apparent hang after primary key changes almost disappear.

@thoszhang
Copy link
Contributor Author

Yeah, I opened a backport PR #47818 (though I figured it wasn't worth getting reviews until we actually start merging to the release branch for 20.1.1, which I believe we're not doing right now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants