Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error #98779

Closed
msbutler opened this issue Mar 16, 2023 · 7 comments
Closed
Assignees
Labels
A-disaster-recovery branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Mar 16, 2023

The restore/pause/tpce/15GB/aws/nodes=4/cpus=8 test on this commit failed after a restore resumed and re-ingested some keys that were already added before the pause. These keys should be allowed to collide when the disallowShaddowingBelow option is set to a TS of 1. . This failure is also likely reproducible on the larger restore/pause/tpce/80GB/aws/nodes=4/cpus=8 test on master. Not I was testing a crdb sha before adding the checkpoint frontier in #97862

@itsbilal mentioned that some updates were made to checkForKeyCollision code in the past week, so there may have been a bug introduced in these changes.

A debug zip is here.

Jira issue: CRDB-25516

@msbutler msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-disaster-recovery GA-blocker T-storage Storage Team branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 labels Mar 16, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 16, 2023

cc @cockroachdb/disaster-recovery

@blathers-crl blathers-crl bot added A-storage Relating to our storage engine (Pebble) on-disk storage. A-disaster-recovery labels Mar 16, 2023
@msbutler msbutler changed the title backupccl: restore/pause/* roachtest fails on false positive checkForKeyCollisions error backupccl: restore/pause/* roachtest fails on checkForKeyCollisions error Mar 16, 2023
@msbutler msbutler self-assigned this Mar 17, 2023
@msbutler
Copy link
Collaborator Author

msbutler commented Mar 17, 2023

with the slimmer restore with pause test, and with a more verbose checkForKeyCollision error, I've verified the colliding key has a different value than the one initially ingested. On the fifth run of:

roachtest run restore/pause/tpce/15GB/aws/nodes=4/cpus=8 --user=$CLUSTER --cockroach=artifacts/cockroach  --workload artifacts/workload --cloud aws --count 10
RESTORE job 848502650761150465: stepping through state failed with error: importing 146 ranges: addsstable [/Table/116/2/‹43000000016›/‹"TWR"›/‹2006-02-21T15:47:16.309Z›/‹20000001     7269455›/‹0›,/Table/116/2/‹43000009994›/‹"TRCMPRB"›/‹2006-02-22T10:58:43.105Z›/‹200000017292658›/‹0›/‹NULL›): checking for key collisions: ingested key collides with an existing one: /Table/116/2/‹43000001121›/‹"TWLB"›/‹2006-02-23T14:08:12.808Z›/     ‹200000017373801›/‹0›; extValue: [‹236› ‹172› ‹155› ‹5› ‹3› ‹85› ‹4› ‹52› ‹138› ‹11› ‹70› ‹19› ‹200› ‹1›]; sstValue [‹112› ‹192› ‹45› ‹117› ‹3› ‹85› ‹4› ‹52› ‹138› ‹11› ‹70› ‹19› ‹144› ‹3›]; timestamp ‹1679012880.841814117,0›

This could be real bad. It implies that the restore_data_processor does not always return the latest mvcc value for a given key.

Next steps:

  • Check if I can repro this on 22.2. If I can, I'll block the release.
  • Check if this occurs before some of the generative split and scatter work.

@msbutler msbutler added release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 and removed A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Mar 17, 2023
@msbutler
Copy link
Collaborator Author

I've hit the bug on 22.2 as well.

@msbutler
Copy link
Collaborator Author

msbutler commented Mar 17, 2023

On the sha before the generative split and scatter processor merged, this test is passing under stress. So, I suspect one of the commits described in this backport is leading to this potential data corruption.

@itsbilal
Copy link
Contributor

Nice find, thanks for digging into this @msbutler ! Feel free to ping me again if you need more storage eyes on this.

rhu713 pushed a commit to rhu713/cockroach that referenced this issue Mar 20, 2023
…rtSpans

Currently generateAndSendImportSpans has a bug in which, on resume, the first
spans in the covering may be missing files. This happens because when the
algorithm iteratively generates a covering for the current span, it needs to
also append the covering files from the previous cover span. This becomes an
issue when spans below the watermark are skipped, thus not allowing the files
covering those span to be considered as a "previous" cover span.

This patch fixes the issue by never skipping spans even in the presence of a
watermark. For the case where generateAndSendImportSpans generates a cover
whose span is below the watermark, it just chooses to not send it to the
output channel.

Fixes cockroachdb#98779

Release note (bug fix): fixed a bug introduced in 22.2.6 in which a restore
job, on resume, can miss files for the first few spans that are being restored.
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Mar 20, 2023
Set the default value of bulkio.restore.use_simple_import_spans to true as
the generative import spans algorithm has a bug that causes it to emit import
spans with missing files on resume.

Fixes: cockroachdb#98779

Release note (bug fix): sets bulkio.restore.use_simple_import_spans to true. If
the setting is false, restore can emit miss files from the first few spans on
job resume.
@msbutler
Copy link
Collaborator Author

removing release blocker as Rui's tests are green #99304

msbutler added a commit to msbutler/cockroach that referenced this issue Mar 28, 2023
Previously, restore roachtests had little ability to detect data corruption
regressions across runs. This patch introduces this ability. Specifically,
this commit allows the restore roachtest writer to easily run a stripped
fingerprint after a restore, and assert a match to the hardcoded fingerprint
in the test spec.

For now, the fingerprint check is only run on the restore roachtests that
restore 15GB of data. The check takes about the same amount of time it takes to
run the restore (around 3 minutes), so before we use it on larger tests, we
ought to consider adding performance improvements to the fingerprinting tool.
These tests include:
- restore/nodeShutdown/coordinator
- restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB)
- restore/tpce/15GB/aws/nodes=4/cpus=8 (new test)
- restore/nodeShutdown/worker (used to restore 80GB)
- restore/nodeShutdown/coordinator (used to restore 80GB)

This patch also changes the node shutdown tests and the paused restore test to
run the smaller 15GB tpce fixture, as it speeds the test run up.

Informs cockroachdb#98779

Release note: none
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Mar 29, 2023
…ryCover

Add some testing with randomized completed spans to
TestRestoreEntryCover. This testing should demonstrate the correctness
of generateAndSendImportSpans in the presence of completed spans.

Informs: cockroachdb#98779

Release note: None
msbutler added a commit to msbutler/cockroach that referenced this issue Mar 30, 2023
Previously, restore roachtests had little ability to detect data corruption
regressions across runs. This patch introduces this ability. Specifically,
this commit allows the restore roachtest writer to easily run a stripped
fingerprint after a restore, and assert a match to the hardcoded fingerprint
in the test spec.

For now, the fingerprint check is only run on the restore roachtests that
restore 15GB of data. The check takes about the same amount of time it takes to
run the restore (around 3 minutes), so before we use it on larger tests, we
ought to consider adding performance improvements to the fingerprinting tool.
These tests include:
- restore/nodeShutdown/coordinator
- restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB)
- restore/tpce/15GB/aws/nodes=4/cpus=8 (new test)
- restore/nodeShutdown/worker (used to restore 80GB)
- restore/nodeShutdown/coordinator (used to restore 80GB)

This patch also changes the node shutdown tests and the paused restore test to
run the smaller 15GB tpce fixture, as it speeds the test run up.

Informs cockroachdb#98779

Release note: none
craig bot pushed a commit that referenced this issue Mar 30, 2023
98983: server: only set default tenant if login successful r=knz a=dhartunian

Previously, we would always set the default tenant cookie to the default tenant cluster setting regardless of what tenants the user logged-in to successfully.

This change ensures that the default tenant selection is only used when the successful logins include that tenant. Otherwise, we select the first tenant from the list of successful logins.

Epic: CRDB-12100
Release note: None

99792: backupccl: fingerprint 15GB restore roachtests r=rhu713 a=msbutler

Previously, restore roachtests had little ability to detect data corruption regressions across runs. This patch introduces this ability. Specifically, this commit allows the restore roachtest writer to easily run a stripped fingerprint after a restore, and assert a match to the hardcoded fingerprint in the test spec.

For now, the fingerprint check is only run on the restore roachtests that restore 15GB of data. The check takes about the same amount of time it takes to run the restore (around 3 minutes), so before we use it on larger tests, we ought to consider adding performance improvements to the fingerprinting tool. These tests include:
- restore/nodeShutdown/coordinator
- restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB)
- restore/tpce/15GB/aws/nodes=4/cpus=8 (new test)
- restore/nodeShutdown/worker (used to restore 80GB)
- restore/nodeShutdown/coordinator (used to restore 80GB)

This patch also changes the node shutdown tests and the paused restore test to run the smaller 15GB tpce fixture, as it speeds the test run up.

Informs #98779

Release note: none

Co-authored-by: David Hartunian <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 30, 2023
Previously, restore roachtests had little ability to detect data corruption
regressions across runs. This patch introduces this ability. Specifically,
this commit allows the restore roachtest writer to easily run a stripped
fingerprint after a restore, and assert a match to the hardcoded fingerprint
in the test spec.

For now, the fingerprint check is only run on the restore roachtests that
restore 15GB of data. The check takes about the same amount of time it takes to
run the restore (around 3 minutes), so before we use it on larger tests, we
ought to consider adding performance improvements to the fingerprinting tool.
These tests include:
- restore/nodeShutdown/coordinator
- restore/pause/tpce/15GB/aws/nodes=4/cpus=8 (used to restore 80GB)
- restore/tpce/15GB/aws/nodes=4/cpus=8 (new test)
- restore/nodeShutdown/worker (used to restore 80GB)
- restore/nodeShutdown/coordinator (used to restore 80GB)

This patch also changes the node shutdown tests and the paused restore test to
run the smaller 15GB tpce fixture, as it speeds the test run up.

Informs #98779

Release note: none
rhu713 pushed a commit to rhu713/cockroach that referenced this issue Apr 3, 2023
…ryCover

Add some testing with randomized completed spans to
TestRestoreEntryCover. This testing should demonstrate the correctness
of generateAndSendImportSpans in the presence of completed spans.

Informs: cockroachdb#98779

Release note: None
craig bot pushed a commit that referenced this issue Apr 3, 2023
99304: backupccl: add test with randomized completed spans to TestRestoreEntryCover r=rhu713 a=rhu713

Add some testing with randomized completed spans to TestRestoreEntryCover. This testing should demonstrate the correctness of generateAndSendImportSpans in the presence of completed spans.

Informs: #98779

Release note: None

100287: kvserver: add `leases.requests.latency` metric r=erikgrinaker a=erikgrinaker

This patch adds a histogram of lease request latencies. It includes all request types (acquisitions, transfers, and extensions) and all outcomes (successes and errors), but only considers the coalesced lease requests regardless of the number of waiters and how long they have been waiting for.

Epic: none
Release note (ops change): Added a metric `leases.requests.latency` recording a histogram of lease request latencies.

100450: roachtest: unskip `acceptance/gossip/peerings` r=erikgrinaker a=erikgrinaker

Addressed by bfed880.

Resolves #96091.
Touches #100213.

Epic: none
Release note: None

Co-authored-by: Rui Hu <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 3, 2023
…ryCover

Add some testing with randomized completed spans to
TestRestoreEntryCover. This testing should demonstrate the correctness
of generateAndSendImportSpans in the presence of completed spans.

Informs: #98779

Release note: None
@dt dt closed this as completed Apr 5, 2023
@msbutler msbutler added the S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. label May 5, 2023
@rytaft rytaft added C-technical-advisory Caused a technical advisory branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 and removed branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 GA-blocker labels Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-technical-advisory Caused a technical advisory S-0-corruption-or-data-loss Unrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading. T-disaster-recovery
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants