-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: offline table data in revision history backups can leak into restored cluster #88042
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Comments
msbutler
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-disaster-recovery
T-disaster-recovery
labels
Sep 16, 2022
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 19, 2022
Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each table we want to restore - identify the last time, l, the table was re-introduced, using the manifests - dont restore the table using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Fixes cockroachdb#87305 Release justification: bug fix Release note: none
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 21, 2022
Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note: fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 21, 2022
Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198
craig bot
pushed a commit
that referenced
this issue
Sep 22, 2022
87312: backupccl: elide spans from backups that were subsequently reintroduced r=dt,adityamaru a=msbutler Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See #87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (#88042, #88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs #87305 Release justification: bug fix Release note: fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198 88384: server: return elapsed time for active executions r=xinhaoz a=xinhaoz Previously, we calculated the time elapsed for an active stmt or txn based on the start time returned from the server and the time the response was last received. Calculating this value on the client is not reliable and can lead to negative values when the server time is slightly ahead. This commit fixes this issue by including the time elapsed as part of the active txns and stmts response. Release note (bug fix): time elapsed for active txns and stmts is never negative. 88449: kvserver: fix flaky test for consistency checks r=erikgrinaker a=pavelkalinnikov There was a race in selecting between a canceled context.Done and 0-time timer. Fixes #88133 Release justification: flaky test fix Release note: None Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Sep 22, 2022
Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See #87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (#88042, #88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs #87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 23, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc written. To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies that getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 23, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc written. To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies that getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 23, 2022
Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198
dt
added
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
branch-release-21.1
branch-release-22.1
Used to mark GA and release blockers, technical advisories, and bugs for 22.1
labels
Sep 26, 2022
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Sep 27, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc written. To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies that getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Oct 6, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable). To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct in pre 22.2 branches. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. It's worth noting this bug could only affect a backup chain in 22.2 iff the backup that observed the table go offline was taken on a 22.1 cluster. Because 22.2 backups explicitly back up all offline tables, all tables in manifest.DescriptorChanges should also be in manifest.Descriptors. In other words, before this patch, the following sequence could lead to a corrupt backup chain: - t0: begin IMPORT on foo, on a pre 22.2 cluster - t1: conduct a revision history backup of foo - t2: complete or cancel the import of foo, via a non-mvcc operation - t3: upgrade the cluster to 22.2 - t4: conduct a revision history incremental backup of foo - foo's span will not get re-introduced This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. The patch tests this check on a corrupt backup taken on v21.2. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
removed
the
branch-release-22.1
Used to mark GA and release blockers, technical advisories, and bugs for 22.1
label
Oct 11, 2022
blathers-crl bot
pushed a commit
that referenced
this issue
Oct 11, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable). To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating #88042, we discovered that this assumption is not correct in pre 22.2 branches. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. It's worth noting this bug could only affect a backup chain in 22.2 iff the backup that observed the table go offline was taken on a 22.1 cluster. Because 22.2 backups explicitly back up all offline tables, all tables in manifest.DescriptorChanges should also be in manifest.Descriptors. In other words, before this patch, the following sequence could lead to a corrupt backup chain: - t0: begin IMPORT on foo, on a pre 22.2 cluster - t1: conduct a revision history backup of foo - t2: complete or cancel the import of foo, via a non-mvcc operation - t3: upgrade the cluster to 22.2 - t4: conduct a revision history incremental backup of foo - foo's span will not get re-introduced This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. The patch tests this check on a corrupt backup taken on v21.2. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
that referenced
this issue
Oct 12, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable). To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating #88042, we discovered that this assumption is not correct in pre 22.2 branches. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. It's worth noting this bug could only affect a backup chain in 22.2 iff the backup that observed the table go offline was taken on a 22.1 cluster. Because 22.2 backups explicitly back up all offline tables, all tables in manifest.DescriptorChanges should also be in manifest.Descriptors. In other words, before this patch, the following sequence could lead to a corrupt backup chain: - t0: begin IMPORT on foo, on a pre 22.2 cluster - t1: conduct a revision history backup of foo - t2: complete or cancel the import of foo, via a non-mvcc operation - t3: upgrade the cluster to 22.2 - t4: conduct a revision history incremental backup of foo - foo's span will not get re-introduced This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. The patch tests this check on a corrupt backup taken on v21.2. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Oct 12, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc written. To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies that getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Oct 12, 2022
Currently RESTORE may restore invalid backup data from a backed up table that underwent an IMPORT rollback. See cockroachdb#87305 for a detailed explanation. This patch ensures that RESTORE elides older backup data that were deleted via a non-MVCC operation. Because incremental backups always reintroduce spans (i.e. backs them up from timestamp 0) that may have undergone a non-mvcc operation, restore can identify restoring spans with potentially corrupt data in the backup chain and only ingest the spans' reintroduced data to any system time, without the corrupt data. Here's the basic impliemenation in Restore: - For each span we want to restore - identify the last time, l, the span was introduced, using the manifests - dont restore the span using a backup if backup.EndTime < l This implementation rests on the following assumption: the input spans for each restoration flow (created in createImportingDescriptors) and the restoreSpanEntries (created by makeSimpleImportSpans) do not span across multiple tables. Given this assumption, makeSimpleImportSpans skips adding files from a backups for a given input span that was reintroduced in a subsequent backup. It's worth noting that all significant refactoring occurs on code run by the restore coordinator; therefore, no special care needs to be taken for mixed / cross version backups. In other words, if the coordinator has updated, the cluster restores properly; else, the bug will exist on the restored cluster. It's also worth noting that other forms of this bug are apparent on older cluster versions (cockroachdb#88042, cockroachdb#88043) and has not been noticed by customers; thus, there is no need to fail a mixed version restore to protect the customer from this already existing bug. Informs cockroachdb#87305 Release justification: bug fix Release note (bug fix): fix for TA advisory https://cockroachlabs.atlassian.net/browse/TSE-198
msbutler
removed
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
branch-release-21.2
labels
Oct 12, 2022
removing remaining release blocker on 21.2, as #89019 has merged The last related PR is on 22.2.0 which, once merged, will close a related GA blocker #88043 |
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Oct 14, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable). To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct in pre 22.2 branches. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. It's worth noting this bug could only affect a backup chain in 22.2 iff the backup that observed the table go offline was taken on a 22.1 cluster. Because 22.2 backups explicitly back up all offline tables, all tables in manifest.DescriptorChanges should also be in manifest.Descriptors. In other words, before this patch, the following sequence could lead to a corrupt backup chain: - t0: begin IMPORT on foo, on a pre 22.2 cluster - t1: conduct a revision history backup of foo - t2: complete or cancel the import of foo, via a non-mvcc operation - t3: upgrade the cluster to 22.2 - t4: conduct a revision history incremental backup of foo - foo's span will not get re-introduced This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. The patch tests this check on a corrupt backup taken on v21.2. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
msbutler
added a commit
to msbutler/cockroach
that referenced
this issue
Oct 20, 2022
…ptorChanges getReintroducedSpans finds all tables included in the current and previous backup that may have undergone a non-mvcc operation. The current backup will then back up these tables' spans from ts = 0, as the previous backup may have missed certain non-mvcc operations (e.g. ClearRange, non-mvcc AddSSTable). To find these tables, getReintroducedSpans must find all tables covered in the previous backup that were in the offline state at previous backup start time. This function assumed that the previous backup's manifest.Descriptors field would contain all tables covered in the previous backup; however, while investigating cockroachdb#88042, we discovered that this assumption is not correct in pre 22.2 branches. During revision history backups, a table with an in-progress import (e.g. offline at backup time) can get backed up and included in manifest.DescriptorChanges but not in manifest.Descriptors. This implies getReintroducedSpans missed reintroducing spans from this table, implying that backup chains have missed backing up some data. It's worth noting this bug could only affect a backup chain in 22.2 iff the backup that observed the table go offline was taken on a 22.1 cluster. Because 22.2 backups explicitly back up all offline tables, all tables in manifest.DescriptorChanges should also be in manifest.Descriptors. In other words, before this patch, the following sequence could lead to a corrupt backup chain: - t0: begin IMPORT on foo, on a pre 22.2 cluster - t1: conduct a revision history backup of foo - t2: complete or cancel the import of foo, via a non-mvcc operation - t3: upgrade the cluster to 22.2 - t4: conduct a revision history incremental backup of foo - foo's span will not get re-introduced This patch fixes getReintroducedSpans to ensure it reintroduces tables included in manifest.DescriptorChanges whose last revision brought the table offline. In addition, this patch adds a check to restore planning which causes the restore to fail immedidiately if any restore target is missing an introduced span. The patch tests this check on a corrupt backup taken on v21.2. Release note(bug fix): fix apart of Tech Advisory https://cockroachlabs.atlassian.net/browse/TSE-198 Release justification: bug fix
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Jan 10, 2023
This test was testing a bug that was discovered during backups in pre-22.2 versions cockroachdb#88042. Since this bug does not exist on 22.2+ versions this test will not encounter the expected corruption on backups that are restoreable in this cluster. Release note: None
adityamaru
added a commit
to adityamaru/cockroach
that referenced
this issue
Mar 10, 2023
This test was testing a bug that was discovered during backups in pre-22.2 versions cockroachdb#88042. Since this bug does not exist on 22.2+ versions this test will not encounter the expected corruption on backups that are restoreable in this cluster. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-disaster-recovery
Before 22.2, backups were assumed to exclude data from offline, importing, tables; however, backups with revision history will contain offline table data because
getRelevantDescChanges
will include offline table descriptors contained in the target database(s). Note that it makes sense to include the offline table in the backup to ensure the user can conduct a RESTORE AOST=beforeImportStartTime, which would restore the importing table to it's pre-import state. However, the inclusion of offline table data in the backup can also lead to corrupt data on restore.Consider the following sequences:
Sequence 1: with a rolled back import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: rollback import foo via non-mvcc clear range
t3: incremental backup foo with revision history
t4: restore foo to latest time
Sequence 2: with a completed import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: complete IMPORT on foo
t3: incremental backup foo with revision history
spans from the completed import, leading to data loss.
t4: restore foo to latest time
Important note: in either scenario, if another incremental backup completed between t0 and t2, the backup/restore would work just fine. I.e. if an incremental backup observed the table offline at the start and end of its interval, there's no bug.
This bug closely relates to #87305 which is also apparent in 22.2, except this one also manifests on earlier releases and only for backups with revision history. Further, this bug is actually worse than #87305, because here, the incremental backup at t3 does not reintroduce foo's spans, rendering the backup unrestorable, and currently, with undetectable data corruption.
Here's the root cause:
revs
input togetReintroducedSpans
, not intables
becausetables
is created fromprevBackup.Descriptors
but as described above,foo
is excluded from this field. This matters because when we look through revs to add totablesToReinclude
andreintroducedTables
, we only add arev
if it was already in theofflineInLastBackup
map which is constructed with thetables
variable.The implications:
Jira issue: CRDB-19657
The text was updated successfully, but these errors were encountered: