Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnlineDDL: better scheduling/cancellation logic #8603

Merged

Conversation

shlomi-noach
Copy link
Contributor

Description

This PR improves the logic around tracking running migrations and cancelling of stale/broken migrations.

Previously, we tracked:

  • last gh-ost executed migration, and
  • last pt-osc executed migration, and
  • last vrepl executed migration
    This led to some spaghetti code because vrepl doesn't behave the same way as gh-ost and pt-osc: it doesn't start and end in the same function. It actually doesn't have to start and end in the same tablet.

This PR introduces ownedRunningMigrations: a map whose keys are UUID, which indicate "what migrations this executor expects to be running right now and is happy to own", irrespective of the strategy. In the new logic:

  • A migration that is found to be running, and is not expected to be running, is auto-terminated
  • executor is good to "adopt" a vrepl that is found to be running (this is in preparation for better handling of PRS/ERS)

This PR also produces better error messages around cancelled migrations. What used to be "auto cancel" is now a reasoned justified report.

We also cleanup some boiletplate code: previously CancelMigration accepted a bool flag indicating whether a running migration should be force-terminated. We find that we always pass true for this flag, so we remove the flag and just always terminate the migration if found to be running.

Existing test suite covers the logic reasonably well. Last week we had a production issue where a couple migrations were auto cancelled without good reason. At the very least this PR will give us more information, but I believe it will also solve the issue. I'm not sure how to reproduce that production issue, so this is just to confess some cases must elude the existing tests.

Related Issue(s)

#6926

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@shlomi-noach
Copy link
Contributor Author

As it turns out, this improvement to the scheduler also unlocked the would-be next step: automatic recoveries for NativeDDL.

This seems to work now! You may begin a NativeDDL (OnlineDDL/VReplication) miration, promote a new primary halfway through, and migration will auto-recover, resume and complete on the new primary.

endtoend tests needed.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I can approve once the endtoend test has been added.

@shlomi-noach
Copy link
Contributor Author

PRS tests are WIP. Meanwhile turning this PR into draft

@shlomi-noach shlomi-noach marked this pull request as draft August 11, 2021 06:05
@shlomi-noach
Copy link
Contributor Author

Added endtoend test to confirm vrepl migration survives PRS. The test works as follows:

  • force throttling
  • kick vrepl migration
  • verify it is throttled
  • verify vreplication is in Copying state (it's not doing anything because it's throttled)
  • PRS on one shard
  • unthrottle, expect migrations to come through to completion
  • verify that on the PRS-d shard, the migration did indeed complete on the newly promoted tablet
  • verify that on non-PRS-d shard, everything just executed as normal
  • do the same, again, reinstating the original primary for consistency

This... works!

Ready for review

@shlomi-noach shlomi-noach marked this pull request as ready for review August 12, 2021 06:26
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I found is a typo in a comment. Otherwise LGTM

@@ -1329,7 +1336,12 @@ func (e *Executor) readPendingMigrationsUUIDs(ctx context.Context) (uuids []stri
}

// terminateMigration attempts to interrupt and hard-stop a running migration
func (e *Executor) terminateMigration(ctx context.Context, onlineDDL *schema.OnlineDDL, lastMigrationUUID string) (foundRunning bool, err error) {
func (e *Executor) terminateMigration(ctx context.Context, onlineDDL *schema.OnlineDDL) (foundRunning bool, err error) {
// It's possible the killing the migratoin fails for whatever reason, in which case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// It's possible the killing the migratoin fails for whatever reason, in which case
// It's possible the killing the migration fails for whatever reason, in which case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@shlomi-noach shlomi-noach merged commit 740cff3 into vitessio:main Aug 17, 2021
@shlomi-noach shlomi-noach deleted the onlineddl-tracking-running-migrations branch August 17, 2021 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants