Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Restoring Specific Backups #8824

Merged
merged 6 commits into from
Sep 26, 2021

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Sep 15, 2021

Description

By default Vitess will only make practical use of the latest backup of a given shard. While this makes perfect sense for the common use cases there are times where you may want to restore a specific backup. For example:

  1. In order to extract a portion of the data that can then be merged with the current state. For example if you later realize that you accidentally deleted some records in a table that you shouldn't have last week, and you need to perform a restore so that you can copy those specific records back to the live data set.
  2. To perform validation, forensics, analysis on the system state at that time.
  3. A specific PITR for whatever reason ...

This work supports that in two different ways:

  1. You can start a tablet with: -restore_from_backup -restore_from_backup_ts 2021-04-29.133050
  2. You can use vtctl with: vtctlclient -server=<vtctld-server>:<vtctld-port> RestoreFromBackup -backup_timestamp=2021-04-29.133050 <tablet-alias>

This compared with PITR

While Vitess supports a method to accomplish PITR, it's a fairly involved process with the details left up to the user (e.g. binlog servers). This work offers another method that can potentially be used in various ways/circumstances as an alternative. For example, let's say we realize that 1-2 days ago we made a mistake and want to look at the previous state of the data on one tablet in the shard... we could do that with:

$ vtctlclient -server=<vtctld-server>:<vtctld-port> RestoreFromBackup \
    -backup_timestamp=$(date -d"2 days ago" +"%Y-%m-%d").000000 <tablet-alias> && \
      vtctlclient -server=<vtctld-server>:<vtctld-port> StopReplication <tablet-alias>

This will stop MySQL replication on the tablet ASAP after the restore completes and prevent it from being automatically re-started (via tablet repair). This allows you to do whatever you like on that tablet before deciding to later:

  1. Throw it away (perhaps we added this new tablet just for this purpose)
  2. Let it catch up with replication normally (just starting replication again via vtctlclient StartReplication)
  3. Perform a Restore on it using the latest shard backup
  4. ...

Co-authored-by: Guido Iaquinti [email protected]
Signed-off-by: Matt Lord [email protected]

Related Issue(s)

This is a continuation of: #7998
This solves: #4905

Checklist

By default Vitess will only make practical use of the latest
backup of a given shard. While this makes perfect sense for
the common use cases there are times where you need to restore
a specific backup. For example:
  1. In order to extract a portion of the data
that can then be merged with the current state. For example if
you later realize that you accidentally deleted some records
in a table that you shouldn't have last week, and you need to
perform a restore so that you can copy those specific records
back to the live data set.
  2. To perform validation, forensics, analysis on the system
state at that time.
  3. A specific PITR for whatever reason ...

This is a continuation of: vitessio#7998

This solves: vitessio#4905

Co-authored-by: Guido Iaquinti <[email protected]>
Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord force-pushed the RestoreFromTimestamp branch from 597aa29 to 50938fd Compare September 15, 2021 22:01
@guidoiaquinti
Copy link
Member

Once tests are added LGTM. Thank you for moving this forward!

@mattlord mattlord force-pushed the RestoreFromTimestamp branch 5 times, most recently from 1caad45 to 863df25 Compare September 17, 2021 02:17
@mattlord
Copy link
Contributor Author

mattlord commented Sep 17, 2021

I've tested and verified the tablet flag as well as the vtctlclient flag:

Backups

$ vtctlclient ListBackups commerce/0
2021-09-17.034338.zone1-0000000101
2021-09-17.034524.zone1-0000000101

Updated help output

$ vtctlclient RestoreFromBackup --help
Usage: RestoreFromBackup [-backup_timestamp=yyyy-MM-dd.HHmmss] <tablet alias>

Stops mysqld and restores the data from the latest backup or if a timestamp is specified then the most recent backup at or before that time.

  -backup_timestamp string
    	Use the backup taken at or before this timestamp rather than using the latest backup.

Invalid timestamp

$ vtctlclient RestoreFromBackup -backup_timestamp=222 zone1-101
RestoreFromBackup Error: rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0000000101 error: unable to parse the backup timestamp value provided of '222': unable to parse the backup timestamp value provided of '222'
E0917 03:46:24.807775    3246 main.go:76] remote error: rpc error: code = Unknown desc = TabletManager.RestoreFromBackup on zone1-0000000101 error: unable to parse the backup timestamp value provided of '222': unable to parse the backup timestamp value provided of '222'

Uses latest / 2021-09-17.034524

$ vtctlclient RestoreFromBackup zone1-101
I0917 03:46:49.446409    3252 main.go:67] I0917 03:46:49.445436 backup.go:264] I0917 03:46:49.444950 backup.go:270] Restore: looking for a suitable backup to restore
I0917 03:46:49.446658    3252 main.go:67] I0917 03:46:49.445792 backup.go:264] I0917 03:46:49.445654 backupengine.go:222] Restore: found latest backup commerce/0 2021-09-17.034524.zone1-0000000101 to restore
...
I0917 03:46:57.013383    3252 main.go:67] I0917 03:46:57.013171 backup.go:264] I0917 03:46:57.012888 backup.go:356] Restore: restarting mysqld after mysql_upgrade

Uses the older one based on the timestamp

$ vtctlclient RestoreFromBackup -backup_timestamp=2021-09-17.034338 zone1-101
I0917 03:48:57.366954    5613 main.go:67] I0917 03:48:57.366055 backup.go:264] I0917 03:48:57.363727 backup.go:270] Restore: looking for a suitable backup to restore
I0917 03:48:57.367262    5613 main.go:67] I0917 03:48:57.366087 backup.go:264] I0917 03:48:57.364443 backupengine.go:224] Restore: found backup commerce/0 2021-09-17.034338.zone1-0000000101 to restore using the specified timestamp of '2021-09-17.034338'
...
I0917 03:49:04.905961    5613 main.go:67] I0917 03:49:04.905797 backup.go:264] I0917 03:49:04.905601 backup.go:356] Restore: restarting mysqld after mysql_upgrade

Now I have a good idea of what to test and how in the test suite. 🙂

@mattlord mattlord force-pushed the RestoreFromTimestamp branch 2 times, most recently from 40a66cf to 0d9f2c8 Compare September 17, 2021 03:21
@mattlord mattlord force-pushed the RestoreFromTimestamp branch from 0d9f2c8 to bf78488 Compare September 17, 2021 03:35
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, except protobuf definition is missing.

go/vt/proto/tabletmanagerdata/tabletmanagerdata.pb.go Outdated Show resolved Hide resolved
@mattlord mattlord force-pushed the RestoreFromTimestamp branch from e61794f to 4b80776 Compare September 17, 2021 23:42
And add missing protobuf change

Signed-off-by: Matt Lord <[email protected]>
@mattlord mattlord force-pushed the RestoreFromTimestamp branch 5 times, most recently from 9db70ac to 25fa195 Compare September 21, 2021 01:11
@mattlord mattlord force-pushed the RestoreFromTimestamp branch from dff915a to f8ec301 Compare September 21, 2021 16:08
mm == minutes (two digits)
MM == month (two digits)

Signed-off-by: Matt Lord <[email protected]>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Let's get another set of eyes before merging.
Can you please create a website PR to document the new usage (vttablet flag & vtctl command)?

Comment on lines +332 to +333
err = localCluster.VtctlclientProcess.ExecuteCommand("DeleteTablet", "-allow_primary=true", replica2.Alias)
require.Nil(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a clever way to simulate the test scenario 👍

@deepthi deepthi requested review from sougou and ajm188 September 24, 2021 00:01
mattlord added a commit to vitessio/website that referenced this pull request Sep 24, 2021
This documents the work done in:
  vitessio/vitess#8824

Signed-off-by: Matt Lord <[email protected]>
@mattlord
Copy link
Contributor Author

Can you please create a website PR to document the new usage (vttablet flag & vtctl command)?

vitessio/website#834 (noted in the description now and linked in comments)

Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to bring up one comment about the types we're using, but if you don't feel it's worth it to change then I'm happy to merge as-is.

Comment on lines 159 to 165
// Check if we should use the latest (default) or a specified backup timestamp for the restore
if restoreFromBackupTs != "" {
startTime, err = time.Parse(mysqlctl.BackupTimestampFormat, restoreFromBackupTs)
if err != nil {
return vterrors.New(vtrpcpb.Code_INVALID_ARGUMENT, fmt.Sprintf("unable to parse the backup timestamp value provided of '%s'", restoreFromBackupTs))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the rest of the PR and it doesn't seem like anyone brought this up. Another way to do this would be to push the parsing further up to the cli, and enforce stricter types closer to this function. In this way we could:

  • commandRestoreBackup in go/vt/vtctl/vtctl.go parses its backupTimestampStr into a time.Time, using mysqlctl.BackupTimestampFormat
  • we pass time.Time to all the tmc functions
  • when making the gRPC request, we change that protobuf definition to be a vttime.Time, and call protoutil.TimeToProto to convert the time.Time to a vttime.Time, and convert back on the tabletmanager side.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly considered this but wasn't sure of the benefits (avoiding the RPC call with an invalid vtctl CLI parameter is one). I don't mind making this change if you prefer it? Sounds like you do or you wouldn't have mentioned it 😄 ... so I'll work on that now. Thanks!

Copy link
Contributor Author

@mattlord mattlord Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after trying it again I'm remembering WHY I didn't go this route last time. 🙂

What makes it feel awkward/clunky to use time.Time coming from the vtctl side is that there's also a corresponding vttablet flag that's used directly in the tmc:

restoreFromBackupTs = flag.String("restore_from_backup_ts", "", "(init restore parameter) if set, restore the latest backup taken at or before this timestamp. Example: '2021-04-29.133050'")

As things are now, it's uniform handling (always a string) on the tmc side -- whether the value came from a vtctl call and RPC or from the vttablet flag. Maybe I'm missing something though?

I understand the pull to use a more precise type... but it also feels like it makes the code more complex with little practical benefit. If you feel strongly about it though I can continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I just switched to this method. Thank you for the nudge! I feel better about it too and It wasn't really that awkward in the end. :-)

@mattlord mattlord force-pushed the RestoreFromTimestamp branch 5 times, most recently from 0920d09 to 67db03f Compare September 25, 2021 04:25
@mattlord mattlord force-pushed the RestoreFromTimestamp branch from 67db03f to f9c6938 Compare September 25, 2021 05:47
Copy link
Contributor

@ajm188 ajm188 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -636,9 +636,21 @@ func (tm *TabletManager) handleRestore(ctx context.Context) (bool, error) {
// Open the state manager after restore is done.
defer tm.tmState.Open()

// Zero date will cause us to use the latest, which is the default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ajm188 ajm188 merged commit 84aad0d into vitessio:main Sep 26, 2021
@mattlord mattlord deleted the RestoreFromTimestamp branch September 27, 2021 14:28
mattlord added a commit to planetscale/vitess that referenced this pull request Sep 27, 2021
Follow-up to: vitessio#8824

I noticed this oversight after it was merged. :-)

Signed-off-by: Matt Lord <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Backup and Restore Component: vtctl Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants