Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add SHOW FINGERPRINT TABLE with exclude_columns option #129497

Closed
msbutler opened this issue Aug 22, 2024 · 1 comment · Fixed by #129766
Closed

sql: add SHOW FINGERPRINT TABLE with exclude_columns option #129497

msbutler opened this issue Aug 22, 2024 · 1 comment · Fixed by #129766
Assignees
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-2 Issues/test failures with a fix SLA of 3 months T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented Aug 22, 2024

To fingerprint LDR tables, we need to exclude the hidden mvcc_origin_timestamp column. To do so, we ought to add a new option to the sql level fingerprint command that will exclude user provided columns from the fingerprint calculation. Essentially, we need to hydrate the ignoredColumns in the BuildFingerprintQueryForIndex function call.

We should also get started with calling BuildFingerprintQueryForIndex in the LDR roachtests as well. the fingerprints should match once the workload has stopped on both clusters, the LDR checkpoints are past time.now, and there's nothing in the DLQ.

Jira issue: CRDB-41568

@msbutler msbutler added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery labels Aug 22, 2024
Copy link

blathers-crl bot commented Aug 22, 2024

cc @cockroachdb/disaster-recovery

@exalate-issue-sync exalate-issue-sync bot added the P-2 Issues/test failures with a fix SLA of 3 months label Aug 26, 2024
navsetlur added a commit to navsetlur/cockroach that referenced this issue Aug 28, 2024
…NGERPRINT

Currently tables used for logical data replication can't be fingerprinted
as they contain a column crdb_replication_origin_timestamp which will vary
based on the internal MVCC timestamp. Adding an option to SHOW
EXPERIMENTAL FINGERPRINT will allow LDR tables to be fingerprinted by excluding
the replication column. A follow up will use this to verify tables in roachtests

Release note: none
Fixes: cockroachdb#129497
craig bot pushed a commit that referenced this issue Aug 29, 2024
129766: logical: Add the EXCLUDE COLUMNS option to SHOW EXPERIMENTAL_FI… r=msbutler a=navsetlur

…NGERPRINT

Currently tables used for logical data replication can't be fingerprinted
as they contain a column crdb_replication_origin_timestamp which will vary
based on the internal MVCC timestamp. Adding an option to SHOW
EXPERIMENTAL FINGERPRINT will allow LDR tables to be fingerprinted by excluding
the replication column. A follow up will use this to verify tables in LDR roachtests

Release note: none
Fixes: #129497

129844: roachtest: remove COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER from `failover` tests r=nvanbenschoten a=nvanbenschoten

Fixes #125259.

This commit removes the use of COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER from the `failover/partial/lease-leader` tests. Instead of using the cluster setting to disable leader follows leaseholder, the tests now creates the kind of situation where leader and leaseholder could be split indefinitely when using epoch-based leases.

This is more realistic and allows us to test lease+leader failover with leader leases when the leaseholder gets partitioned from its followers but remains connected to the liveness range.

I confirmed that by default, both the epoch-based and expiration-based lease variants of this test still fail at 60s due to stuck requests. This is what we see in roachperf on master as well:
- https://roachperf.crdb.dev/?filter=leader&view=failover%2Fpartial%2Flease-leader&tab=gce
- https://roachperf.crdb.dev/?filter=leader&view=failover%2Fpartial%2Flease-leader%2Flease%3Dexpiration&tab=gce

However, notice that the expiration-based variant temporarily improved when we enabled DistSender circuit breakers, which add a timeout to all KV requests. This is because the expiration-based lease setup can actually recover range availability, we just don't currently unwedge requests that were stuck before the recovery. This is of arguable importance, as most real-world users run with application side timeouts. Regardless, I see the same improvement if I re-enable DistSender circuit breakers. With them, `failover/partial/lease-leader` stays at 60s (infinite) recovery time while `failover/partial/lease-leader/lease=expiration` drops to recovering in ~20s.

Release note: None

Co-authored-by: Naveen Setlur <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 84f5cec Aug 29, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in Disaster Recovery Backlog Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) P-2 Issues/test failures with a fix SLA of 3 months T-disaster-recovery
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants