Skip to content

Commit

Permalink
backupccl: reintroduce previously offline tables with manifest.Descri…
Browse files Browse the repository at this point in the history
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable).

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct in pre 22.2 branches. During revision history backups, a table
with an in-progress import (e.g. offline at backup time) can get backed up and
included in manifest.DescriptorChanges but not in manifest.Descriptors. This
implies getReintroducedSpans missed reintroducing spans from this table,
implying that backup chains have missed backing up some data.

It's worth noting this bug could only affect a backup chain in 22.2 iff the
backup that observed the table go offline was taken on a 22.1
cluster. Because 22.2 backups explicitly back up all offline tables, all tables
in manifest.DescriptorChanges should also be in manifest.Descriptors. In other
words, before this patch, the following sequence could lead to a corrupt backup
chain:
- t0: begin IMPORT on foo, on a pre 22.2 cluster
- t1: conduct a revision history backup of foo
- t2: complete or cancel the import of foo, via a non-mvcc operation
- t3: upgrade the cluster to 22.2
- t4: conduct a revision history incremental backup of foo
     - foo's span will not get re-introduced

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span. The
patch tests this check on a corrupt backup taken on v21.2.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
  • Loading branch information
msbutler committed Oct 6, 2022
1 parent 058cd57 commit 6b2b45d
Show file tree
Hide file tree
Showing 71 changed files with 515 additions and 27 deletions.
66 changes: 56 additions & 10 deletions pkg/ccl/backupccl/backup_planning.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,21 @@ var featureBackupEnabled = settings.RegisterBoolSetting(
featureflag.FeatureFlagEnabledDefault,
).WithPublic()

// includeTableSpans returns true if the backup should include spans for the
// given table descriptor.
func includeTableSpans(table *descpb.TableDescriptor) bool {
// We do not backup spans for views here as they do not contain data.
//
// Additionally, because we do not split ranges at view boundaries, it is
// possible that the range the view span belongs to is shared by another
// object in the cluster (that we may or may not be backing up) that might
// have its own bespoke zone configurations, namely one with a short GC TTL.
// This could lead to a situation where the short GC TTL on the range we are
// not backing up causes our protectedts verification to fail when attempting
// to backup the view span.
return table.IsPhysicalTable()
}

// forEachPublicIndexTableSpan constructs a span for each public index of the
// provided table and runs the given function on each of them. The added map is
// used to track duplicates. Duplicate indexes are not passed to the provided
Expand All @@ -97,16 +112,7 @@ func forEachPublicIndexTableSpan(
codec keys.SQLCodec,
f func(span roachpb.Span),
) {
// We do not backup spans for views here as they do not contain data.
//
// Additionally, because we do not split ranges at view boundaries, it is
// possible that the range the view span belongs to is shared by another
// object in the cluster (that we may or may not be backing up) that might
// have its own bespoke zone configurations, namely one with a short GC TTL.
// This could lead to a situation where the short GC TTL on the range we are
// not backing up causes our protectedts verification to fail when attempting
// to backup the view span.
if !table.IsPhysicalTable() {
if !includeTableSpans(table) {
return
}

Expand Down Expand Up @@ -1022,8 +1028,27 @@ func getReintroducedSpans(
) ([]roachpb.Span, error) {
reintroducedTables := make(map[descpb.ID]struct{})

// First, create a map that indicates which tables from the previous backup
// were offline when the last backup was taken. To create this map, we must
// iterate two fields in the _last_ backup's manifest:
//
// 1. manifest.Descriptors contains a list of descriptors _explicitly_
// included in the backup, gathered at backup startTime. This includes all
// public descriptors, and in the case of cluster backups, offline
// descriptors.
//
// 2. manifest.DescriptorChanges contains a list of descriptor changes tracked
// in the backup. While investigating #88042, it was discovered that during
// revision history database and table.* backups, a table can get included in
// manifest.DescriptorChanges when it transitions from an online to offline
// state, causing its spans to get backed up. However, if this table was
// offline when the backup began, it's excluded from manifest.Descriptors.
// Therefore, to find all descriptors covered in the backup that were offline
// at backup time, we must find all tables in manifest.DescriptorChanges whose
// last change brought the table offline.
offlineInLastBackup := make(map[descpb.ID]struct{})
lastBackup := prevBackups[len(prevBackups)-1]

for _, desc := range lastBackup.Descriptors {
// TODO(pbardea): Also check that lastWriteTime is set once those are
// populated on the table descriptor.
Expand All @@ -1032,6 +1057,27 @@ func getReintroducedSpans(
}
}

// Gather all the descriptors that were offline at the endTime of the last
// backup, according the backup's descriptor changes. If the last descriptor
// change in the previous backup interval put the table offline, then that
// backup was offline at the endTime of the last backup.
latestTableDescChangeInLastBackup := make(map[descpb.ID]*descpb.TableDescriptor)
for _, rev := range lastBackup.DescriptorChanges {
if table, _, _, _, _ := descpb.FromDescriptor(rev.Desc); table != nil {
if trackedRev, ok := latestTableDescChangeInLastBackup[table.GetID()]; !ok {
latestTableDescChangeInLastBackup[table.GetID()] = table
} else if trackedRev.Version < table.Version {
latestTableDescChangeInLastBackup[table.GetID()] = table
}
}
}

for _, table := range latestTableDescChangeInLastBackup {
if table.Offline() {
offlineInLastBackup[table.GetID()] = struct{}{}
}
}

// If the table was offline in the last backup, but becomes PUBLIC, then it
// needs to be re-included since we may have missed non-transactional writes.
tablesToReinclude := make([]catalog.TableDescriptor, 0)
Expand Down
5 changes: 5 additions & 0 deletions pkg/ccl/backupccl/restore_planning.go
Original file line number Diff line number Diff line change
Expand Up @@ -1589,6 +1589,11 @@ func doRestorePlan(
"use SHOW BACKUP to find correct targets")
}

if err := checkMissingIntroducedSpans(sqlDescs, mainBackupManifests, endTime,
p.ExecCfg().Codec); err != nil {
return err
}

var revalidateIndexes []jobspb.RestoreDetails_RevalidateIndex
for _, desc := range sqlDescs {
tbl, ok := desc.(catalog.TableDescriptor)
Expand Down
36 changes: 19 additions & 17 deletions pkg/ccl/backupccl/restore_span_covering_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -194,20 +194,8 @@ const noSpanTargetSize = 0
func TestRestoreEntryCoverExample(t *testing.T) {
defer leaktest.AfterTest(t)()

mockTenantID := roachpb.MakeTenantID(10)
mockTenantPrefix := keys.MakeTenantPrefix(mockTenantID)

sp := func(start, end string) roachpb.Span {
// During makeSimpleImportSpans, a reIntroducedTableID is extracted from each passed in
// span, unless the span contains a tenantPrefix. If the span contains
// neither a tenant prefix nor a table prefix, makeSimpleImportSpans will
// return an error. Since the input spans in this test do not contain a
// table prefix, add a mock tenant prefix to each span to prevent spurious
// errors in makeSimpleImportSpans.
sp, err := keys.RewriteSpanToTenantPrefix(roachpb.Span{Key: roachpb.Key(start),
EndKey: roachpb.Key(end)}, mockTenantPrefix)
require.NoError(t, err)
return sp
return roachpb.Span{Key: roachpb.Key(start), EndKey: roachpb.Key(end)}
}
f := func(start, end, path string) backuppb.BackupManifest_File {
return backuppb.BackupManifest_File{Span: sp(start, end), Path: path}
Expand Down Expand Up @@ -239,10 +227,10 @@ func TestRestoreEntryCoverExample(t *testing.T) {
}
}

introducedSpanFrontier, err := createIntroducedSpanFrontier(backups, hlc.Timestamp{})
emptySpanFrontier, err := spanUtils.MakeFrontier(roachpb.Span{})
require.NoError(t, err)

cover := makeSimpleImportSpans(spans, backups, nil, introducedSpanFrontier, nil, noSpanTargetSize)
cover := makeSimpleImportSpans(spans, backups, nil, emptySpanFrontier, nil, noSpanTargetSize)
require.Equal(t, []execinfrapb.RestoreSpanEntry{
{Span: sp("a", "c"), Files: paths("1", "4", "6")},
{Span: sp("c", "e"), Files: paths("2", "4", "6")},
Expand All @@ -251,12 +239,26 @@ func TestRestoreEntryCoverExample(t *testing.T) {
{Span: sp("l", "m"), Files: paths("9")},
}, cover)

coverSized := makeSimpleImportSpans(spans, backups, nil, introducedSpanFrontier, nil, 2<<20)
coverSized := makeSimpleImportSpans(spans, backups, nil, emptySpanFrontier, nil, 2<<20)
require.Equal(t, []execinfrapb.RestoreSpanEntry{
{Span: sp("a", "f"), Files: paths("1", "2", "4", "6")},
{Span: sp("f", "i"), Files: paths("3", "5", "6", "8")},
{Span: sp("l", "m"), Files: paths("9")},
}, coverSized)

// check that introduced spans are properly elided
backups[2].IntroducedSpans = []roachpb.Span{sp("a", "f")}
introducedSpanFrontier, err := createIntroducedSpanFrontier(backups, hlc.Timestamp{})
require.NoError(t, err)

coverIntroduced := makeSimpleImportSpans(spans, backups, nil, introducedSpanFrontier, nil,
noSpanTargetSize)
require.Equal(t, []execinfrapb.RestoreSpanEntry{
{Span: sp("a", "f"), Files: paths("6")},
{Span: sp("f", "i"), Files: paths("3", "5", "6", "8")},
{Span: sp("l", "m"), Files: paths("9")},
}, coverIntroduced)

}

type mockBackupInfo struct {
Expand Down Expand Up @@ -462,7 +464,7 @@ func TestRestoreEntryCoverReIntroducedSpans(t *testing.T) {
for _, reIntroTable := range reIntroducedTables {
var coveredReIntroducedGroup roachpb.SpanGroup
for _, entry := range cover {
// If a restoreSpanEntry overlaps with re-introduced table span,
// If a restoreSpanEntry overlaps with re-introduced span,
// assert the entry only contains files from the incremental backup.
if reIntroTable.TableSpan(codec).Overlaps(entry.Span) {
coveredReIntroducedGroup.Add(entry.Span)
Expand Down
116 changes: 116 additions & 0 deletions pkg/ccl/backupccl/targets.go
Original file line number Diff line number Diff line change
Expand Up @@ -332,6 +332,122 @@ func fullClusterTargetsBackup(
return fullClusterDescs, fullClusterDBIDs, nil
}

// checkMissingIntroducedSpans asserts that each backup's IntroducedSpans
// contain all tables that require an introduction (i.e. a backup from ts=0), if
// they are also restore targets. This specifically entails checking the
// following invariant on each table we seek to restore: if the nth backup
// contains an online table in its backupManifest.Descriptors field or
// backupManifest. DescriptorsChanges field but the n-1th does not contain it in
// its Descriptors field, then that table must be covered by the nth backup's
// manifest.IntroducedSpans field. Note: this check assumes that
// mainBackupManifests are sorted by Endtime and this check only applies to
// backups with a start time that is less than the restore AOST.
func checkMissingIntroducedSpans(
restoringDescs []catalog.Descriptor,
mainBackupManifests []backuppb.BackupManifest,
endTime hlc.Timestamp,
codec keys.SQLCodec,
) error {
// Gather the tables we'd like to restore.
requiredTables := make(map[descpb.ID]struct{})
for _, restoringDesc := range restoringDescs {
if _, ok := restoringDesc.(catalog.TableDescriptor); ok {
requiredTables[restoringDesc.GetID()] = struct{}{}
}
}

for i, b := range mainBackupManifests {
if !endTime.IsEmpty() && endTime.Less(b.StartTime) {
// No need to check a backup that began after the restore AOST.
return nil
}

if i == 0 {
// The full backup does not contain reintroduced spans.
continue
}

// Gather the tables included in the previous backup.
prevTables := make(map[descpb.ID]struct{})
for _, desc := range mainBackupManifests[i-1].Descriptors {
if table, _, _, _, _ := descpb.FromDescriptor(&desc); table != nil {
prevTables[table.GetID()] = struct{}{}
}
}

// Gather the tables that were reintroduced in the current backup (i.e.
// backed up from ts=0).
tablesIntroduced := make(map[descpb.ID]struct{})
for _, span := range mainBackupManifests[i].IntroducedSpans {
_, tableID, err := codec.DecodeTablePrefix(span.Key)
if err != nil {
return err
}
tablesIntroduced[descpb.ID(tableID)] = struct{}{}
}

requiredIntroduction := func(table *descpb.TableDescriptor) error {
if _, required := requiredTables[table.GetID()]; !required {
// The table is not required for the restore, so there's no need to check coverage.
return nil
}

if !includeTableSpans(table) {
return nil
}

if _, introduced := tablesIntroduced[table.GetID()]; introduced {
// The table was introduced, thus the table is properly covered at
// this point in the backup chain.
return nil
}

if _, inPrevBackup := prevTables[table.GetID()]; inPrevBackup {
// The table was included in the previous backup, which implies the
// table was online at prevBackup.Endtime, and therefore does not need
// to be introduced.
return nil
}
tableError := errors.Newf("table %q cannot be safely restored from this backup."+
" This backup is affected by issue #88042, which produced incorrect backups after an IMPORT."+
" To continue the restore, you can either:"+
" 1) restore to a system time before the import completed, %v;"+
" 2) restore with a newer backup chain (a full backup [+ incrementals])"+
" taken after the current backup target;"+
" 3) or remove table %v from the restore targets.",
table.Name, mainBackupManifests[i].StartTime.GoTime().String(), table.Name)
return errors.WithIssueLink(tableError, errors.IssueLink{
IssueURL: "https://www.cockroachlabs.com/docs/advisories/a88042",
Detail: `An incremental database backup with revision history can incorrectly backup data for a table
that was running an IMPORT at the time of the previous incremental in this chain of backups.`,
})
}

// If the table in the current backup was not in prevBackup.Descriptors,
// then it needs to be introduced.
for _, desc := range mainBackupManifests[i].Descriptors {
if table, _, _, _, _ := descpb.FromDescriptor(&desc); table != nil {
if err := requiredIntroduction(table); err != nil {
return err
}
}
}

// As described in the docstring for getReintroducedSpans(), there are cases
// where a descriptor may appear in manifest. DescriptorChanges but not
// manifest.Descriptors. If a descriptor is online at any moment during the
// backup interval, it needs to be reintroduced.
for _, desc := range mainBackupManifests[i].DescriptorChanges {
if table, _, _, _, _ := descpb.FromDescriptor(desc.Desc); table != nil && table.Public() {
if err := requiredIntroduction(table); err != nil {
return err
}
}
}
}
return nil
}

// selectTargets loads all descriptors from the selected backup manifest(s),
// filters the descriptors based on the targets specified in the restore, and
// calculates the max descriptor ID in the backup.
Expand Down
Loading

0 comments on commit 6b2b45d

Please sign in to comment.