OnlineDDL: better scheduling/cancellation logic #8603

shlomi-noach · 2021-08-09T07:51:51Z

Description

This PR improves the logic around tracking running migrations and cancelling of stale/broken migrations.

Previously, we tracked:

last gh-ost executed migration, and
last pt-osc executed migration, and
last vrepl executed migration
This led to some spaghetti code because vrepl doesn't behave the same way as gh-ost and pt-osc: it doesn't start and end in the same function. It actually doesn't have to start and end in the same tablet.

This PR introduces ownedRunningMigrations: a map whose keys are UUID, which indicate "what migrations this executor expects to be running right now and is happy to own", irrespective of the strategy. In the new logic:

A migration that is found to be running, and is not expected to be running, is auto-terminated
executor is good to "adopt" a vrepl that is found to be running (this is in preparation for better handling of PRS/ERS)

This PR also produces better error messages around cancelled migrations. What used to be "auto cancel" is now a reasoned justified report.

We also cleanup some boiletplate code: previously CancelMigration accepted a bool flag indicating whether a running migration should be force-terminated. We find that we always pass true for this flag, so we remove the flag and just always terminate the migration if found to be running.

Existing test suite covers the logic reasonably well. Last week we had a production issue where a couple migrations were auto cancelled without good reason. At the very least this PR will give us more information, but I believe it will also solve the issue. I'm not sure how to reproduce that production issue, so this is just to confess some cases must elude the existing tests.

Related Issue(s)

#6926

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach · 2021-08-10T10:04:45Z

As it turns out, this improvement to the scheduler also unlocked the would-be next step: automatic recoveries for NativeDDL.

This seems to work now! You may begin a NativeDDL (OnlineDDL/VReplication) miration, promote a new primary halfway through, and migration will auto-recover, resume and complete on the new primary.

endtoend tests needed.

deepthi

Nice work! I can approve once the endtoend test has been added.

…her tablet, updating to its own tablet Signed-off-by: Shlomi Noach <[email protected]>

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach · 2021-08-11T06:05:13Z

PRS tests are WIP. Meanwhile turning this PR into draft

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach · 2021-08-12T06:26:26Z

Added endtoend test to confirm vrepl migration survives PRS. The test works as follows:

force throttling
kick vrepl migration
verify it is throttled
verify vreplication is in Copying state (it's not doing anything because it's throttled)
PRS on one shard
unthrottle, expect migrations to come through to completion
verify that on the PRS-d shard, the migration did indeed complete on the newly promoted tablet
verify that on non-PRS-d shard, everything just executed as normal
do the same, again, reinstating the original primary for consistency

This... works!

Ready for review

deepthi

All I found is a typo in a comment. Otherwise LGTM

deepthi · 2021-08-16T20:52:54Z

go/vt/vttablet/onlineddl/executor.go

@@ -1329,7 +1336,12 @@ func (e *Executor) readPendingMigrationsUUIDs(ctx context.Context) (uuids []stri
 }

 // terminateMigration attempts to interrupt and hard-stop a running migration
-func (e *Executor) terminateMigration(ctx context.Context, onlineDDL *schema.OnlineDDL, lastMigrationUUID string) (foundRunning bool, err error) {
+func (e *Executor) terminateMigration(ctx context.Context, onlineDDL *schema.OnlineDDL) (foundRunning bool, err error) {
+	// It's possible the killing the migratoin fails for whatever reason, in which case


Suggested change

// It's possible the killing the migratoin fails for whatever reason, in which case

// It's possible the killing the migration fails for whatever reason, in which case

Signed-off-by: Shlomi Noach <[email protected]>

OnlineDDL: better scheduling/cancellation logic

ecf0dba

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach added Component: Query Serving Type: Bug labels Aug 9, 2021

shlomi-noach added 4 commits August 9, 2021 10:57

a bit more comments

5d1f237

Signed-off-by: Shlomi Noach <[email protected]>

Merge branch 'main' into onlineddl-tracking-running-migrations

d029497

Signed-off-by: Shlomi Noach <[email protected]>

empty commit to kick CI

9a27d02

Signed-off-by: Shlomi Noach <[email protected]>

fix call to CancelMigration

3ec6900

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach requested review from harshit-gangal and systay as code owners August 9, 2021 11:02

deepthi reviewed Aug 10, 2021

View reviewed changes

shlomi-noach added 2 commits August 11, 2021 08:59

updateMigrationTablet: executor to adopt vreplication started in anot…

d58b02c

…her tablet, updating to its own tablet Signed-off-by: Shlomi Noach <[email protected]>

WIP: Adding endtoend test to validate vrepl migrations survives PRS

a0f005a

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach marked this pull request as draft August 11, 2021 06:05

working endtoend tests with 2*PRS

b7709e8

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach marked this pull request as ready for review August 12, 2021 06:26

deepthi approved these changes Aug 16, 2021

View reviewed changes

shlomi-noach added 2 commits August 17, 2021 09:25

typo

5dc7e03

Signed-off-by: Shlomi Noach <[email protected]>

Merge branch 'main' into onlineddl-tracking-running-migrations

ab82c04

Signed-off-by: Shlomi Noach <[email protected]>

shlomi-noach merged commit 740cff3 into vitessio:main Aug 17, 2021

shlomi-noach deleted the onlineddl-tracking-running-migrations branch August 17, 2021 07:39

shlomi-noach mentioned this pull request Aug 18, 2021

Documenting Recoverable, failover agnostic migrations vitessio/website#804

Merged

frouioui added the release notes label Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OnlineDDL: better scheduling/cancellation logic #8603

OnlineDDL: better scheduling/cancellation logic #8603

shlomi-noach commented Aug 9, 2021

shlomi-noach commented Aug 10, 2021

deepthi left a comment

shlomi-noach commented Aug 11, 2021

shlomi-noach commented Aug 12, 2021

deepthi left a comment

deepthi Aug 16, 2021

shlomi-noach Aug 17, 2021

	// It's possible the killing the migratoin fails for whatever reason, in which case
	// It's possible the killing the migration fails for whatever reason, in which case

OnlineDDL: better scheduling/cancellation logic #8603

OnlineDDL: better scheduling/cancellation logic #8603

Conversation

shlomi-noach commented Aug 9, 2021

Description

Related Issue(s)

Checklist

Deployment Notes

shlomi-noach commented Aug 10, 2021

deepthi left a comment

Choose a reason for hiding this comment

shlomi-noach commented Aug 11, 2021

shlomi-noach commented Aug 12, 2021

deepthi left a comment

Choose a reason for hiding this comment

deepthi Aug 16, 2021

Choose a reason for hiding this comment

shlomi-noach Aug 17, 2021

Choose a reason for hiding this comment