Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: remove COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER #125259

Closed
nvanbenschoten opened this issue Jun 6, 2024 · 1 comment · Fixed by #129844
Closed

kv: remove COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER #125259

nvanbenschoten opened this issue Jun 6, 2024 · 1 comment · Fixed by #129844
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-leader-leases Related to the introduction of leader leases A-testing Testing tools and infrastructure C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 6, 2024

Part of #123847.

From prototype doc:

To validate this requirement, we manually test out the indefinite outage scenario from the failover/partial/lease-leader test (reference) with the prototype. We can’t run the roachtest directly, because it (ab)uses the COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER environment variable to configure a split leader and epoch-leaseholder configuration. Instead, we do something better and create the type of network partition that would lead to a split leader+leaseholder configuration in a real environment.

To do so, our test creates a network partition between a leader+leaseholder and its two followers. However, it keeps the leader+leaseholder node connected to the node liveness range. This allows epoch-based leases to be extended even when the leaseholder cannot contact the rest of its peers. The outage persists as long as the lease is held by the partitioned node.

We will update the roachtest to do something similar and not rely on the test-only environment variable.

Jira issue: CRDB-39350

Epic CRDB-37522

@nvanbenschoten nvanbenschoten added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. A-testing Testing tools and infrastructure T-kv KV Team A-leader-leases Related to the introduction of leader leases labels Jun 6, 2024
@nvanbenschoten nvanbenschoten self-assigned this Jun 6, 2024
Copy link

blathers-crl bot commented Jun 6, 2024

cc @cockroachdb/replication

@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
craig bot pushed a commit that referenced this issue Aug 29, 2024
129766: logical: Add the EXCLUDE COLUMNS option to SHOW EXPERIMENTAL_FI… r=msbutler a=navsetlur

…NGERPRINT

Currently tables used for logical data replication can't be fingerprinted
as they contain a column crdb_replication_origin_timestamp which will vary
based on the internal MVCC timestamp. Adding an option to SHOW
EXPERIMENTAL FINGERPRINT will allow LDR tables to be fingerprinted by excluding
the replication column. A follow up will use this to verify tables in LDR roachtests

Release note: none
Fixes: #129497

129844: roachtest: remove COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER from `failover` tests r=nvanbenschoten a=nvanbenschoten

Fixes #125259.

This commit removes the use of COCKROACH_DISABLE_LEADER_FOLLOWS_LEASEHOLDER from the `failover/partial/lease-leader` tests. Instead of using the cluster setting to disable leader follows leaseholder, the tests now creates the kind of situation where leader and leaseholder could be split indefinitely when using epoch-based leases.

This is more realistic and allows us to test lease+leader failover with leader leases when the leaseholder gets partitioned from its followers but remains connected to the liveness range.

I confirmed that by default, both the epoch-based and expiration-based lease variants of this test still fail at 60s due to stuck requests. This is what we see in roachperf on master as well:
- https://roachperf.crdb.dev/?filter=leader&view=failover%2Fpartial%2Flease-leader&tab=gce
- https://roachperf.crdb.dev/?filter=leader&view=failover%2Fpartial%2Flease-leader%2Flease%3Dexpiration&tab=gce

However, notice that the expiration-based variant temporarily improved when we enabled DistSender circuit breakers, which add a timeout to all KV requests. This is because the expiration-based lease setup can actually recover range availability, we just don't currently unwedge requests that were stuck before the recovery. This is of arguable importance, as most real-world users run with application side timeouts. Regardless, I see the same improvement if I re-enable DistSender circuit breakers. With them, `failover/partial/lease-leader` stays at 60s (infinite) recovery time while `failover/partial/lease-leader/lease=expiration` drops to recovering in ~20s.

Release note: None

Co-authored-by: Naveen Setlur <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 8da8a91 Aug 29, 2024
@github-project-automation github-project-automation bot moved this from Incoming to Closed in KV Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-leader-leases Related to the introduction of leader leases A-testing Testing tools and infrastructure C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team
Projects
No open projects
Status: Closed
1 participant