-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OnlineDDL: better scheduling/cancellation logic #8603
OnlineDDL: better scheduling/cancellation logic #8603
Conversation
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
As it turns out, this improvement to the scheduler also unlocked the would-be next step: automatic recoveries for NativeDDL. This seems to work now! You may begin a NativeDDL (OnlineDDL/VReplication) miration, promote a new primary halfway through, and migration will auto-recover, resume and complete on the new primary.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I can approve once the endtoend test has been added.
…her tablet, updating to its own tablet Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
PRS tests are WIP. Meanwhile turning this PR into draft |
Signed-off-by: Shlomi Noach <[email protected]>
Added
This... works! Ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All I found is a typo in a comment. Otherwise LGTM
go/vt/vttablet/onlineddl/executor.go
Outdated
@@ -1329,7 +1336,12 @@ func (e *Executor) readPendingMigrationsUUIDs(ctx context.Context) (uuids []stri | |||
} | |||
|
|||
// terminateMigration attempts to interrupt and hard-stop a running migration | |||
func (e *Executor) terminateMigration(ctx context.Context, onlineDDL *schema.OnlineDDL, lastMigrationUUID string) (foundRunning bool, err error) { | |||
func (e *Executor) terminateMigration(ctx context.Context, onlineDDL *schema.OnlineDDL) (foundRunning bool, err error) { | |||
// It's possible the killing the migratoin fails for whatever reason, in which case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// It's possible the killing the migratoin fails for whatever reason, in which case | |
// It's possible the killing the migration fails for whatever reason, in which case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Description
This PR improves the logic around tracking running migrations and cancelling of stale/broken migrations.
Previously, we tracked:
gh-ost
executed migration, andpt-osc
executed migration, andvrepl
executed migrationThis led to some spaghetti code because
vrepl
doesn't behave the same way asgh-ost
andpt-osc
: it doesn't start and end in the same function. It actually doesn't have to start and end in the same tablet.This PR introduces
ownedRunningMigrations
: a map whose keys are UUID, which indicate "what migrations this executor expects to be running right now and is happy to own", irrespective of the strategy. In the new logic:vrepl
that is found to be running (this is in preparation for better handling of PRS/ERS)This PR also produces better error messages around cancelled migrations. What used to be
"auto cancel"
is now a reasoned justified report.We also cleanup some boiletplate code: previously
CancelMigration
accepted abool
flag indicating whether a running migration should be force-terminated. We find that we always passtrue
for this flag, so we remove the flag and just always terminate the migration if found to be running.Existing test suite covers the logic reasonably well. Last week we had a production issue where a couple migrations were
auto cancel
led without good reason. At the very least this PR will give us more information, but I believe it will also solve the issue. I'm not sure how to reproduce that production issue, so this is just to confess some cases must elude the existing tests.Related Issue(s)
#6926
Checklist
Deployment Notes