Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

BR failed to backup stats #679

Closed
3pointer opened this issue Jan 6, 2021 · 14 comments
Closed

BR failed to backup stats #679

3pointer opened this issue Jan 6, 2021 · 14 comments

Comments

@3pointer
Copy link
Collaborator

3pointer commented Jan 6, 2021

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.
    Upgrade TiDB cluster from v4.0.0 to v4.0.9. then use BR v4.0.9 to backup.

  2. What did you expect to see?
    backup success.

  3. What did you see instead?

[ERROR] [backup.go:25] ["failed to backup"] [error="[types:1406]Data Too Long, field len 5, data len 6"] [errorVerbose="[types:1406]Data Too Long, field len 5, data len 6\n
github.com/pingcap/errors.AddStack\n
\t	github.com/pingcap/[email protected]/errors.go:174\n
github.com/pingcap/errors.(*Error).GenWithStack\n
\t	github.com/pingcap/[email protected]/normalize.go:147\n
github.com/pingcap/tidb/types.ProduceStrWithSpecifiedTp\n
\t	github.com/pingcap/[email protected]/types/datum.go:995\n
github.com/pingcap/tidb/types.(*Datum).convertToString\n
\t	github.com/pingcap/[email protected]/types/datum.go:942\n
github.com/pingcap/tidb/types.(*Datum).ConvertTo\n
\t	github.com/pingcap/[email protected]/types/datum.go:821\n
github.com/pingcap/tidb/statistics/handle.(*Handle).histogramFromStorage\n
\t	github.com/pingcap/[email protected]/statistics/handle/handle.go:667\n
github.com/pingcap/tidb/statistics/handle.(*Handle).columnStatsFromStorage\n
\t	github.com/pingcap/[email protected]/statistics/handle/handle.go:460\n
github.com/pingcap/tidb/statistics/handle.(*Handle).tableStatsFromStorage\n
\t	github.com/pingcap/[email protected]/statistics/handle/handle.go:539\n
github.com/pingcap/tidb/statistics/handle.(*Handle).tableStatsToJSON\n
\t	github.com/pingcap/[email protected]/statistics/handle/dump.go:89\n
github.com/pingcap/tidb/statistics/handle.(*Handle).DumpStatsToJSON\n
\t	github.com/pingcap/[email protected]/statistics/handle/dump.go:76\n
github.com/pingcap/br/pkg/backup.BuildBackupRangeAndSchema\n
\t	github.com/pingcap/br@/pkg/backup/client.go:340\n
github.com/pingcap/br/pkg/task.RunBackup\n
\t	github.com/pingcap/br@/pkg/task/backup.go:257\n
github.com/pingcap/br/cmd.runBackupCommand\n
\t	github.com/pingcap/br@/cmd/backup.go:24\n
github.com/pingcap/br/cmd.newFullBackupCommand.func1\n
\t	github.com/pingcap/br@/cmd/backup.go:85\n
github.com/spf13/cobra.(*Command).execute\n
\t	github.com/spf13/[email protected]/command.go:842\n
github.com/spf13/cobra.(*Command).ExecuteC\n
\t	github.com/spf13/[email protected]/command.go:950\n
github.com/spf13/cobra.(*Command).Execute\n
\t	github.com/spf13/[email protected]/command.go:887\n
main.main\n
\t	github.com/pingcap/br@/main.go:58\n
runtime.main\n
\t	runtime/proc.go:203\n
runtime.goexit\n
\t	runtime/asm_amd64.s:1357"] [
  1. What version of BR and TiDB/TiKV/PD are you using?
    BR v4.0.9
    TiDB v4.0.9 (upgrade from v4.0.0)
@3pointer 3pointer added the type/bug Something isn't working label Jan 6, 2021
@3pointer
Copy link
Collaborator Author

3pointer commented Jan 6, 2021

The failed reason is DumpsStatsToJson is not compatible when TiDB is upgraded from v4.0.0.
To workaround, user can add --ignore-stats=true to command to skip backup stats. and we need find a better way to backup stats and consider with compatibility.

@kennytm
Copy link
Collaborator

kennytm commented Jan 6, 2021

This is a bug in TiDB. Still, such failure probably shouldn't abort backup/restore since stats aren't essential.

also, the dump stats error should log which table is the cause.

br/pkg/backup/client.go

Lines 348 to 358 in 458d542

var stats []byte
if !ignoreStats {
jsonTable, err := h.DumpStatsToJSON(dbInfo.Name.String(), tableInfo, nil)
if err != nil {
return nil, nil, errors.Trace(err)
}
stats, err = json.Marshal(jsonTable)
if err != nil {
return nil, nil, errors.Trace(err)
}
}

@IANTHEREAL
Copy link
Collaborator

IANTHEREAL commented Jan 6, 2021

What is the trigger mechanism of dumpStats failure?

stats shouldn't cause the backup to fail. If the stats can be backup&restored, it is good. If BR can't back up stats, it needs to output a backup summary log indicating that backup stats failed; and if there is no stats when restoring, then BR needs to be able to support running analyze table

stats information is stored to backupmeta. what If the entire cluster is backup in a large-scale cluster? how to deal with the large size backupmeta, is it sense?

@kennytm
Copy link
Collaborator

kennytm commented Jan 6, 2021

if the stats field is nil during restore, it will just be skipped.

br/pkg/restore/client.go

Lines 816 to 826 in 458d542

if table.Stats != nil {
log.Info("start loads analyze after validate checksum",
zap.Stringer("db name", tbl.OldTable.DB.Name),
zap.Stringer("name", tbl.OldTable.Info.Name),
zap.Int64("old id", tbl.OldTable.Info.ID),
zap.Int64("new id", tbl.Table.ID),
)
if err := rc.statsHandler.LoadStatsFromJSON(rc.dom.InfoSchema(), table.Stats); err != nil {
log.Error("analyze table failed", zap.Any("table", table.Stats), zap.Error(err))
}
}

missing stats won't prevent user accessing the data, just that the queries are not optimized. ANALYZE TABLE can be performed separately.

@IANTHEREAL
Copy link
Collaborator

OK, it can be a workaround method.

For a better user experience, restore needs to support analyze table for tables without stats. Many users may not know analyze table, and restore should be a complete operation, the cluster should be in a good state after restoring

@kennytm
Copy link
Collaborator

kennytm commented Jan 7, 2021

running "analyze table" requires scanning the tables again though, which is slow (esp. when restoring from an old backup where the stats are all missing)

@IANTHEREAL
Copy link
Collaborator

This is true, but analyze table is necessary. So we should make sure that the backup statistics need to be successful in major situations

@IANTHEREAL
Copy link
Collaborator

IANTHEREAL commented Jan 7, 2021

As discussed with @kennytm, some consensuses are as follows

  1. the most important thing is that BR needs to ensure the success rate of backup stats to ensure that most of the backup statistics can be backed up successfully, the success rate can reach 99.9%
  2. table restored without stats can be considered unavailable, and requires human or restore program intervention
  3. ignore-stats option of backup is mainly for testing. If someone does backup with ignore-statsrestore does not need to execute analyze table automatically
  4. BR restore should provide an option skip-analyze whose default values is false
  5. the backup that is from tidb (version < 4.0.9) can skip analyze table
  6. If there are table backups that do not contain stats information, the user should be reminded that restore would take a long time
  7. If restore fails to execute analyze table statement, the log will prompt the user to manually execute analyze table, and continue to restore other tables.

@kennytm
Copy link
Collaborator

kennytm commented Jan 7, 2021

my preference is:

4. the option should be --analyze (default is true) which could be set to false for those who really can't wait
5. no need to do version check. we do analyze even if restoring pre-4.0.9 backup archives
8. we should clearly document this analyze behavior.

@IANTHEREAL
Copy link
Collaborator

I have no problem with these preferences ⏩

For version <4.0.9, I have no special requirements. In order to restore quickly, we can allow the restore of version less than 4.0.9 not to execute analyze table, you can choose it. But I don’t want there are many stats backup failures after version 4.0.9 to avoid too long restore time

@IANTHEREAL
Copy link
Collaborator

IANTHEREAL commented Jan 18, 2021

I think we made some wrong decisions before, based on the inertia to solve the current problem. I have a new proposal. The product factors considered are as follows:
One is that users do not need to know statistical information at all;
the other is that we should ensure the success rate of statistical information backup and restoration.

I suggest not to introduce more complicated logic. Because ignore stats already exists, we can temporarily shelve it, but it is recommended not to introduce skip analyze, especially in the future we do not want this parameter in the product, it is only the temporary solution. What we really need is a fast, highly successful statistical information backup and recovery program, unless we can't do it.

The recommended processing logic is as follows

  1. If the backup statistics fails, the backup fails
  2. If the restoration of statistical information fails, the restoration fails, but it can prompt the user that the data has been successfully restored, and you can manually execute analyze table; even support the separate restoration of statistical information

@kennytm @3pointer

@kennytm
Copy link
Collaborator

kennytm commented Jan 19, 2021

So we'll be changing how we backup stats, and also support system tables like RBAC info and global variables.

  1. We will backup all relevant mysql.* tables via normal SST backup. This includes user information, statistics, etc. The schema is first renamed to tidb_br_temp_mysql.*.
    Since mysql.* tables are treated like normal backup, it should never trigger processing logic (1) as they are now equal.

  2. After restoring the tidb_br_temp_mysql.*, we will use the SQL REPLACE INTO mysql.xxx SELECT ... FROM tidb_br_temp_mysql.xxx WHERE ... to insert the data back into the real mysql.* tables.
    These SQL may fail but won't impact overall data as logic (2) suggests.

To further support logic (2), and support backward/forward compatibility,

  1. BR will never execute ANALYZE.
  2. We will add a flag --system-tables='minimal,privileges,statistics' to backup and restore to allow skipping part of system tables when they exist..
  3. Version compatibility:
    • 4.0.8 backup → New restore: If the system table is completely not found in backup archive but --system-tables specifies it, a non-fatal warning is issued about what should be done. For --system-tables=statistics specifically, we will list all selected tables for restore as potential targets of ANALYZE.
    • 4.0.9 backup → New restore: The Stats field will just be ignored. BR will ask user to do ANALYZE manually. Also the --ignore-stats flag becomes no-op.
    • New backup → 4.0.9 restore: We must never backup the system tables with the name mysql.* directly, since this will corrupt older clusters who don't understand how to work with them. Therefore, during backup we must already rename mysql.* to tidb_br_temp_mysql.*.
  4. Feature compatibility:
    • We haven't investigated if incremental backup can handle system tables. There are two things to consider:
      1. the incremental backed up system tables do not contain complete information.
      2. if a version upgrade happened between last-backupts and backupts, there may be DDLs operating on mysql.* specifically. They should either be skipped (current situation) or renamed.
    • Raw backup/restore is simply irrelevant.
    • Online restore - The restore SST part is the same as normal restore. The execute SQL part will be done directly in the BR node. We may use REPLACE LOW_PRIORITY INTO … if it does make any difference. Each group of SQL execution should happen inside a transaction. There may be some problem when the system table is too large to run in a single transaction. That is, without large transaction support, we may have to split those REPLACE statemeents into multiple self-consistent batches, and this could complicate the logic quite a lot.

@SunRunAway
Copy link
Contributor

  1. mysql.user may have the account information which customer may forget in their brain.
  2. TiDB allow a customer to create a table in the mysql schema.

I suggest we should backup the system tables, and can opt out when restoring.

@kennytm @IANTHEREAL

@SunRunAway
Copy link
Contributor

The second reason is very important and critical.

As a matter of principle, we should also back up everything and restore selectively, instead of backuping selectively.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants