Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: offline table data in revision history backups can leak into restored cluster #88042

Closed
msbutler opened this issue Sep 16, 2022 · 2 comments · Fixed by #89981
Closed
Assignees
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Sep 16, 2022

Before 22.2, backups were assumed to exclude data from offline, importing, tables; however, backups with revision history will contain offline table data because getRelevantDescChanges will include offline table descriptors contained in the target database(s). Note that it makes sense to include the offline table in the backup to ensure the user can conduct a RESTORE AOST=beforeImportStartTime, which would restore the importing table to it's pre-import state. However, the inclusion of offline table data in the backup can also lead to corrupt data on restore.

Consider the following sequences:

Sequence 1: with a rolled back import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: rollback import foo via non-mvcc clear range
t3: incremental backup foo with revision history

  • fails to reintroduce foo

t4: restore foo to latest time

  • b/c of the non-mvcc clear range, the incremental backup is completely naive to the rollback, thus, the importing data will get restored.

Sequence 2: with a completed import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: complete IMPORT on foo
t3: incremental backup foo with revision history

  • fails to reintroduce foo. if the IMPORT used non-mvcc AddSSTable, then this incremental backup could have missed
    spans from the completed import, leading to data loss.

t4: restore foo to latest time

  • foo could get restored with some of the imported data

Important note: in either scenario, if another incremental backup completed between t0 and t2, the backup/restore would work just fine. I.e. if an incremental backup observed the table offline at the start and end of its interval, there's no bug.

This bug closely relates to #87305 which is also apparent in 22.2, except this one also manifests on earlier releases and only for backups with revision history. Further, this bug is actually worse than #87305, because here, the incremental backup at t3 does not reintroduce foo's spans, rendering the backup unrestorable, and currently, with undetectable data corruption.

Here's the root cause:

  • foo only appears in the revs input to getReintroducedSpans, not in tables because tables is created from prevBackup.Descriptors but as described above, foo is excluded from this field. This matters because when we look through revs to add to tablesToReinclude and reintroducedTables, we only add a rev if it was already in the offlineInLastBackup map which is constructed with the tables variable.

The implications:

  • This backup chain cannot get restored to a valid state because we did not reintroduce foo.
  • The next full backup that runs will be fine.
  • This bug should not affect a backup chain that has been taken fully on 22.2, where we backup all offline spans, but could affect a chain with backups taken on earlier versions.

Jira issue: CRDB-19657

@msbutler msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels Sep 16, 2022
@msbutler msbutler self-assigned this Sep 16, 2022
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 19, 2022
Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each table we want to restore
   - identify the last time, l, the table was re-introduced, using the manifests
   - dont restore the table using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Fixes cockroachdb#87305

Release justification: bug fix

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 21, 2022
Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each span we want to restore
   - identify the last time, l, the span was introduced, using the manifests
   - dont restore the span using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Informs cockroachdb#87305

Release justification: bug fix

Release note: fix for TA advisory
https://cockroachlabs.atlassian.net/browse/TSE-198
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 21, 2022
Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each span we want to restore
   - identify the last time, l, the span was introduced, using the manifests
   - dont restore the span using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Informs cockroachdb#87305

Release justification: bug fix

Release note (bug fix): fix for TA advisory
https://cockroachlabs.atlassian.net/browse/TSE-198
craig bot pushed a commit that referenced this issue Sep 22, 2022
87312: backupccl: elide spans from backups that were subsequently reintroduced r=dt,adityamaru a=msbutler

Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See #87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each span we want to restore
   - identify the last time, l, the span was introduced, using the manifests
   - dont restore the span using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (#88042, #88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Informs #87305

Release justification: bug fix

Release note: fix for TA advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

88384: server: return elapsed time for active executions r=xinhaoz a=xinhaoz

Previously, we calculated the time elapsed for an active stmt or txn based on the start time returned from the server and the time the response was last received. Calculating this value on the client is not reliable and can lead to negative values when the server time is slightly ahead. This commit fixes this issue by including the time elapsed as part of the active txns and stmts response.

Release note (bug fix): time elapsed for active txns and stmts is never negative.

88449: kvserver: fix flaky test for consistency checks r=erikgrinaker a=pavelkalinnikov

There was a race in selecting between a canceled context.Done and 0-time timer.

Fixes #88133

Release justification: flaky test fix
Release note: None

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Sep 22, 2022
Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See #87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each span we want to restore
   - identify the last time, l, the span was introduced, using the manifests
   - dont restore the span using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (#88042, #88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Informs #87305

Release justification: bug fix

Release note (bug fix): fix for TA advisory
https://cockroachlabs.atlassian.net/browse/TSE-198
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 23, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc written.

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct. During revision history backups, a table with an in-progress
import (e.g. offline at backup time) can get backed up and included in
manifest.DescriptorChanges but not in manifest.Descriptors. This implies that
getReintroducedSpans missed reintroducing spans from this table, implying that
backup chains have missed backing up some data.

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 23, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc written.

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct. During revision history backups, a table with an in-progress
import (e.g. offline at backup time) can get backed up and included in
manifest.DescriptorChanges but not in manifest.Descriptors. This implies that
getReintroducedSpans missed reintroducing spans from this table, implying that
backup chains have missed backing up some data.

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 23, 2022
Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each span we want to restore
   - identify the last time, l, the span was introduced, using the manifests
   - dont restore the span using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Informs cockroachdb#87305

Release justification: bug fix

Release note (bug fix): fix for TA advisory
https://cockroachlabs.atlassian.net/browse/TSE-198
@dt dt added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-21.1 branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Sep 26, 2022
msbutler added a commit to msbutler/cockroach that referenced this issue Sep 27, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc written.

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct. During revision history backups, a table with an in-progress
import (e.g. offline at backup time) can get backed up and included in
manifest.DescriptorChanges but not in manifest.Descriptors. This implies that
getReintroducedSpans missed reintroducing spans from this table, implying that
backup chains have missed backing up some data.

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 6, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable).

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct in pre 22.2 branches. During revision history backups, a table
with an in-progress import (e.g. offline at backup time) can get backed up and
included in manifest.DescriptorChanges but not in manifest.Descriptors. This
implies getReintroducedSpans missed reintroducing spans from this table,
implying that backup chains have missed backing up some data.

It's worth noting this bug could only affect a backup chain in 22.2 iff the
backup that observed the table go offline was taken on a 22.1
cluster. Because 22.2 backups explicitly back up all offline tables, all tables
in manifest.DescriptorChanges should also be in manifest.Descriptors. In other
words, before this patch, the following sequence could lead to a corrupt backup
chain:
- t0: begin IMPORT on foo, on a pre 22.2 cluster
- t1: conduct a revision history backup of foo
- t2: complete or cancel the import of foo, via a non-mvcc operation
- t3: upgrade the cluster to 22.2
- t4: conduct a revision history incremental backup of foo
     - foo's span will not get re-introduced

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span. The
patch tests this check on a corrupt backup taken on v21.2.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
@msbutler
Copy link
Collaborator Author

22.1 release blocker resolved by #88488, #89443 and #89688

@msbutler msbutler removed the branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 label Oct 11, 2022
blathers-crl bot pushed a commit that referenced this issue Oct 11, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable).

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating #88042, we discovered that this assumption
is not correct in pre 22.2 branches. During revision history backups, a table
with an in-progress import (e.g. offline at backup time) can get backed up and
included in manifest.DescriptorChanges but not in manifest.Descriptors. This
implies getReintroducedSpans missed reintroducing spans from this table,
implying that backup chains have missed backing up some data.

It's worth noting this bug could only affect a backup chain in 22.2 iff the
backup that observed the table go offline was taken on a 22.1
cluster. Because 22.2 backups explicitly back up all offline tables, all tables
in manifest.DescriptorChanges should also be in manifest.Descriptors. In other
words, before this patch, the following sequence could lead to a corrupt backup
chain:
- t0: begin IMPORT on foo, on a pre 22.2 cluster
- t1: conduct a revision history backup of foo
- t2: complete or cancel the import of foo, via a non-mvcc operation
- t3: upgrade the cluster to 22.2
- t4: conduct a revision history incremental backup of foo
     - foo's span will not get re-introduced

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span. The
patch tests this check on a corrupt backup taken on v21.2.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit that referenced this issue Oct 12, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable).

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating #88042, we discovered that this assumption
is not correct in pre 22.2 branches. During revision history backups, a table
with an in-progress import (e.g. offline at backup time) can get backed up and
included in manifest.DescriptorChanges but not in manifest.Descriptors. This
implies getReintroducedSpans missed reintroducing spans from this table,
implying that backup chains have missed backing up some data.

It's worth noting this bug could only affect a backup chain in 22.2 iff the
backup that observed the table go offline was taken on a 22.1
cluster. Because 22.2 backups explicitly back up all offline tables, all tables
in manifest.DescriptorChanges should also be in manifest.Descriptors. In other
words, before this patch, the following sequence could lead to a corrupt backup
chain:
- t0: begin IMPORT on foo, on a pre 22.2 cluster
- t1: conduct a revision history backup of foo
- t2: complete or cancel the import of foo, via a non-mvcc operation
- t3: upgrade the cluster to 22.2
- t4: conduct a revision history incremental backup of foo
     - foo's span will not get re-introduced

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span. The
patch tests this check on a corrupt backup taken on v21.2.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 12, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc written.

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct. During revision history backups, a table with an in-progress
import (e.g. offline at backup time) can get backed up and included in
manifest.DescriptorChanges but not in manifest.Descriptors. This implies that
getReintroducedSpans missed reintroducing spans from this table, implying that
backup chains have missed backing up some data.

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 12, 2022
Currently RESTORE may restore invalid backup data from a backed up table that
underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation.

This patch ensures that RESTORE elides older backup data that were deleted via
a non-MVCC operation. Because incremental backups always reintroduce spans
(i.e. backs them up from timestamp 0) that may have undergone a non-mvcc
operation, restore can identify restoring spans with potentially corrupt data
in the backup chain and only ingest the spans' reintroduced data to any system
time, without the corrupt data.

Here's the basic impliemenation in Restore:
- For each span we want to restore
   - identify the last time, l, the span was introduced, using the manifests
   - dont restore the span using a backup if backup.EndTime < l

This implementation rests on the following assumption: the input spans for each
restoration flow (created in createImportingDescriptors) and the
restoreSpanEntries (created by makeSimpleImportSpans) do not span across
multiple tables. Given this assumption, makeSimpleImportSpans skips adding
files from a backups for a given input span that was reintroduced in a
subsequent backup.

It's worth noting that all significant refactoring occurs on code run by
the restore coordinator; therefore, no special care needs to be taken for
mixed / cross version backups. In other words, if the coordinator has updated,
the cluster restores properly; else, the bug will exist on the restored cluster.
It's also worth noting that other forms of this bug are apparent on older
cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus,
there is no need to fail a mixed version restore to protect the customer from
this already existing bug.

Informs cockroachdb#87305

Release justification: bug fix

Release note (bug fix): fix for TA advisory
https://cockroachlabs.atlassian.net/browse/TSE-198
@msbutler msbutler removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-21.2 labels Oct 12, 2022
@msbutler
Copy link
Collaborator Author

removing remaining release blocker on 21.2, as #89019 has merged

The last related PR is on 22.2.0 which, once merged, will close a related GA blocker #88043

msbutler added a commit to msbutler/cockroach that referenced this issue Oct 14, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable).

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct in pre 22.2 branches. During revision history backups, a table
with an in-progress import (e.g. offline at backup time) can get backed up and
included in manifest.DescriptorChanges but not in manifest.Descriptors. This
implies getReintroducedSpans missed reintroducing spans from this table,
implying that backup chains have missed backing up some data.

It's worth noting this bug could only affect a backup chain in 22.2 iff the
backup that observed the table go offline was taken on a 22.1
cluster. Because 22.2 backups explicitly back up all offline tables, all tables
in manifest.DescriptorChanges should also be in manifest.Descriptors. In other
words, before this patch, the following sequence could lead to a corrupt backup
chain:
- t0: begin IMPORT on foo, on a pre 22.2 cluster
- t1: conduct a revision history backup of foo
- t2: complete or cancel the import of foo, via a non-mvcc operation
- t3: upgrade the cluster to 22.2
- t4: conduct a revision history incremental backup of foo
     - foo's span will not get re-introduced

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span. The
patch tests this check on a corrupt backup taken on v21.2.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 20, 2022
…ptorChanges

getReintroducedSpans finds all tables included in the current and previous
backup that may have undergone a non-mvcc operation. The current backup will
then back up these tables' spans from ts = 0, as the previous backup may have
missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable).

To find these tables, getReintroducedSpans must find all tables covered in the
previous backup that were in the offline state at previous backup start time.

This function assumed that the previous backup's
manifest.Descriptors field would contain all tables covered in the previous
backup; however, while investigating cockroachdb#88042, we discovered that this assumption
is not correct in pre 22.2 branches. During revision history backups, a table
with an in-progress import (e.g. offline at backup time) can get backed up and
included in manifest.DescriptorChanges but not in manifest.Descriptors. This
implies getReintroducedSpans missed reintroducing spans from this table,
implying that backup chains have missed backing up some data.

It's worth noting this bug could only affect a backup chain in 22.2 iff the
backup that observed the table go offline was taken on a 22.1
cluster. Because 22.2 backups explicitly back up all offline tables, all tables
in manifest.DescriptorChanges should also be in manifest.Descriptors. In other
words, before this patch, the following sequence could lead to a corrupt backup
chain:
- t0: begin IMPORT on foo, on a pre 22.2 cluster
- t1: conduct a revision history backup of foo
- t2: complete or cancel the import of foo, via a non-mvcc operation
- t3: upgrade the cluster to 22.2
- t4: conduct a revision history incremental backup of foo
     - foo's span will not get re-introduced

This patch fixes getReintroducedSpans to ensure it reintroduces tables included
in manifest.DescriptorChanges whose last revision brought the table offline. In
addition, this patch adds a check to restore planning which causes the restore
to fail immedidiately if any restore target is missing an introduced span. The
patch tests this check on a corrupt backup taken on v21.2.

Release note(bug fix): fix apart of Tech Advisory
https://cockroachlabs.atlassian.net/browse/TSE-198

Release justification: bug fix
@craig craig bot closed this as completed in 72ac655 Oct 20, 2022
adityamaru added a commit to adityamaru/cockroach that referenced this issue Jan 10, 2023
This test was testing a bug that was discovered during
backups in pre-22.2 versions cockroachdb#88042.
Since this bug does not exist on 22.2+ versions
this test will not encounter the expected corruption
on backups that are restoreable in this cluster.

Release note: None
adityamaru added a commit to adityamaru/cockroach that referenced this issue Mar 10, 2023
This test was testing a bug that was discovered during
backups in pre-22.2 versions cockroachdb#88042.
Since this bug does not exist on 22.2+ versions
this test will not encounter the expected corruption
on backups that are restoreable in this cluster.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants